基于奖励高速路网络的多智能体强化学习中的全局信用分配算法

doi:10.11772/j.issn.1001-9081.2020061009

计算机应用 ›› 2021, Vol. 41 ›› Issue (1): 1-7.DOI: 10.11772/j.issn.1001-9081.2020061009

所属专题：第八届中国数据挖掘会议(CCDM 2020)

• 第八届中国数据挖掘会议(CCDM 2020) • 下一篇

基于奖励高速路网络的多智能体强化学习中的全局信用分配算法

姚兴虎^1,2,3, 谭晓阳^1,2,3

1. 南京航空航天大学计算机科学与技术学院, 南京 211106;
2. 模式分析与机器智能工业和信息化部重点实验室(南京航空航天大学), 南京 211106;
3. 南京航空航天大学软件新技术与产业化协同创新中心, 南京 211106

收稿日期:2020-05-31 修回日期:2020-09-24 发布日期:2020-11-12 出版日期:2021-01-10
通讯作者: 谭晓阳
作者简介:姚兴虎(1996-),男,山东济宁人,硕士研究生,CCF会员,主要研究方向:深度强化学习、多智能体强化学习;谭晓阳(1971-),男,江苏南京人,教授,博士,CCF会员,主要研究方向:机器学习、深度强化学习。
基金资助:
国家自然科学基金资助项目（61976115，61672280， 61732006）；装备预研基金资助项目（6140312020413）；南京航空航天大学人工智能+项目（56XZA18009）；全军共用信息系统装备预研项目（315025305）；南京航空航天大学研究生创新基金资助项目（Kfjj20191608）。

Reward highway network based global credit assignment algorithm in multi-agent reinforcement learning

YAO Xinghu^1,2,3, TAN Xiaoyang^1,2,3

1. College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing Jiangsu 211106, China;
2. MIIT Key Laboratory of Pattern Analysis and Machine Intelligence(Nanjing University of Aeronautics and Astronautics), Nanjing Jiangsu 211106, China;
3. Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing University of Aeronautics and Astronautics, Nanjing Jiangsu 211106, China

Received:2020-05-31 Revised:2020-09-24 Online:2020-11-12 Published:2021-01-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61976115,61672280, 61732006), the Equipment Pre-research Fund (6140312020413), the Artificial Intelligence+ Project of Nanjing University of Aeronautics and Astronautics (56XZA18009), the Project of Pre-research on Military Shared Information System Equipment (315025305), the Graduate Innovation Foundation of Nanjing University of Aeronautics and Astronautics (Kfjj20191608).

摘要/Abstract

摘要： 针对多智能体系统中联合动作空间随智能体数量的增加而产生的指数爆炸的问题，采用“中心训练-分散执行”的框架来避免联合动作空间的维数灾难并降低算法的优化代价。针对在众多的多智能体强化学习场景下，环境仅给出所有智能体的联合行为所对应的全局奖励这一问题，提出一种新的全局信用分配机制——奖励高速路网络（RHWNet）。通过在原有算法的奖励分配机制上引入奖励高速路连接，将每个智能体的值函数与全局奖励直接建立联系，进而使得每个智能体在进行策略选择时能够综合考虑全局的奖励信号与其自身实际分得的奖励值。首先，在训练过程中，通过中心化的值函数结构对每个智能体进行协调；同时，这一中心化的结构也能起到全局奖励分配的作用；然后，在中心值函数结构中引入奖励高速路链接来辅助进行全局奖励分配，从而构建出奖励高速路网络；之后，在执行阶段，每个智能体的策略仅仅依赖于其自身的值函数。在星际争霸多智能体挑战的微操作场景中的实验结果表明，相比当前较先进的反直觉的策略梯度（Coma）算法和单调Q值函数分解（QMIX）算法，该网络所提出的奖励高速路在4个复杂的地图上的测试胜率提升超过20%。更重要的是，在智能体数量较多且种类不同的3s5z和3s6z场景中，该网络在所需样本数量为QMIX和Coma等算法的30%的情况下便能取得更好的结果。

关键词: 深度学习, 深度强化学习, 多智能体强化学习, 多智能体系统, 全局信用分配

Abstract: For the problem of exponential explosion of joint action space with the increase of the number of agents in multi-agent systems, the "central training-decentralized execution" framework was adopted to solve the curse of dimensionality of joint action space and reduce the optimization cost of the algorithm. A new global credit assignment mechanism, Reward HighWay Network (RHWNet), was proposed to solve the problem that only the global reward corresponding to the joint behavior of all agents was given by the environment in multiple multi-agent reinforcement learning scenarios. By introducing the reward highway connection in the global reward assignment mechanism of the original algorithm, the value function of each agent was directly connected with the global reward, so that each agent was able to consider both the global reward signal and its actual reward value when making strategy selection. Firstly, in the training process, each agent was coordinated through a centralized value function structure. At the same time, this centralized structure was also able to play a role in global reward assignment. Then, the reward highway connection was introduced in the central value function structure to assist the global reward assignment, thus establishing the reward highway network. Then, in the execution phase, each agent's strategy depended only on its own value function. Experimental results on the StarCraft Multi-Agent Challenge (SMAC) microoperation scenarios show that the proposed reward highway network achieves a performance improvement of more than 20% in testing winning rate on four complex maps compared to the advanced Counterfactual multi-agent policy gradient (Coma) and QMIX algorithms. More importantly, in 3s5z and 3s6z scenarios with a large number and different types of agents, the proposed network can achieve better results when the required number of samples is only 30% of algorithms such as Coma and QMIX.

Key words: deep learning, deep reinforcement learning, multi-agent reinforcement learning, multi-agent system, global credit assignment

中图分类号:

TP181

姚兴虎, 谭晓阳. 基于奖励高速路网络的多智能体强化学习中的全局信用分配算法[J]. 计算机应用, 2021, 41(1): 1-7.

YAO Xinghu, TAN Xiaoyang. Reward highway network based global credit assignment algorithm in multi-agent reinforcement learning[J]. Journal of Computer Applications, 2021, 41(1): 1-7.

参考文献

[1] MNIH V,KAVUKCUOGLU K,SILVER D,et al. Human-level control through deep reinforcement learning[J]. Nature,2015,518(7540):529-533.
[2] 刘全, 翟建伟, 章宗长, 等. 深度强化学习综述[J]. 计算机学报, 2018,41(1):1-27.(LIU Q,ZHAI J W,ZHANG Z Z,et al. A survey on deep reinforcement learning[J]. Chinese Journal of Computers,2018,41(1):1-27.)
[3] SCHULMAN J,WOLSKI F,DHARIWAL P,et al. Proximal policy optimization algorithms[EB/OL].[2020-09-03]. https://arxiv.org/pdf/1707.06347.pdf.
[4] 殷昌盛, 杨若鹏, 朱巍, 等. 多智能体分层强化学习综述[J]. 智能系统学报, 2020, 15(4):646-655.(YIN C S,YANG R P,ZHU W, et al. A survey on multi-agent hierarchical reinforcement learning[J]. CAAI Transactions on Intelligent Systems,2020, 15(4):646-655.)
[5] 孙长银, 穆朝絮. 多智能体深度强化学习的若干关键科学问题[J]. 自动化学报,2020,46(7):1301-1312.(SUN C Y,MU C X. Important scientific problems of multi-agent deep reinforcement learning[J]. Acta Automatica Sinica,2020,46(7):1301-1312.)
[6] 王冲, 景宁, 李军, 等. 一种基于多agent强化学习的多星协同任务规划算法[J]. 国防科技大学学报,2011,33(1):53-58. (WANG C,JING N,LI J,et al. An algorithm of cooperative multiple satellites mission planning based on multi-agent reinforcement learning[J]. Journal of National University of Defense Technology,2011,33(1):53-58.)
[7] VAN DER POL E, OLIEHOEK F A. Coordinated deep reinforcement learners for traffic light control[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook,NY:Curran Associates Inc.,2016:1-8.
[8] JADERBERG M,CZARNECKI W M,DUNNING I,et al. Humanlevel performance in 3D multiplayer games with population-based reinforcement learning[J]. Science,2019,364(6443):859-865.
[9] NAIR R,TAMBE M,YOKOO M,et al. Taming decentralized POMDPs:towards efficient policy computation for multiagent settings[C]//Proceedings of the 18th International Joint Conference on Artificial Intelligence. San Francisco:Morgan Kaufmann Publishers Inc.,2003:705-711.
[10] LAURENT G J,MATIGNON L,LE FORT-PIAT N. The world of independent learners is not Markovian[J]. International Journal of Knowledge-based and Intelligent Engineering Systems,2011,15(1):55-64.
[11] LOWE R,WU Y,TAMAR A,et al. Multi-agent actor-critic for mixed cooperative-competitive environments[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook,NY:Curran Associates Inc.,2017:6382-6393.
[12] FOERSTER J N, FARQUHAR G, AFOURAS T, et al. Counterfactual multi-agent policy gradients[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto, CA:AAAI,2018:2974-2982.
[13] SUNEHAG P, LEVER G, GRUSLYS A, et al. Valuedecomposition networks for cooperative multi-agent learning based on team reward[C]//Proceedings of the 17th International Conference on Autonomous Agents and Multiagent Systems. Richland,SC:International Foundation for Autonomous Agents and Multiagent Systems,2018:2085-2087.
[14] RASHID T,SAMVELYAN M,SCHROEDER C,et al. QMIX:monotonic value function factorisation for deep multi-agent reinforcement learning[C]//Proceedings of the 35th International Conference on Machine Learning. New York:JMLR. org,2018:4295-4304.
[15] SON K,KIM D,KANG W J,et al. QTRAN:learning to factorize with transformation for cooperative multi-agent reinforcement learning[C]//Proceedings of the 36th International Conference on Machine Learning. New York:JMLR. org,2019:5887-5896.
[16] YAO X,WEN C,WANG Y,et al. SMIX (λ):enhancing centralized value functions for cooperative multi-agent reinforcement learning[EB/OL].[2020-09-03]. https://arxiv.org/pdf/1911.04094.pdf.
[17] SUKHBAATAR S,SZLAM A,FERGUS R. Learning multiagent communication with backpropagation[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook,NY:Curran Associates Inc.,2016:2252-2260.
[18] FOERSTER J,ASSAEL Y M,DE FREITAS N,et al. Learning to communicate with deep multi-agent reinforcement learning[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook, NY:Curran Associates Inc.,2016:2145-2153.
[19] IQBAL S, SHA F. Actor-attention-critic for multi-agent reinforcement learning[C]//Proceedings of the 36th International Conference on Machine Learning. New York:JMLR. org,2019:2961-2970.
[20] JIANG J,LU Z. Learning attentional communication for multiagent cooperation[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook, NY:Curran Associates Inc.,2018:7265-7275.
[21] JIANG J, DUN C, HUANG T, et al. Graph convolutional reinforcement learning[EB/OL].[2020-09-03]. https://arxiv.org/pdf/1810.09202.pdf.
[22] LIU Y,WANG W,HU Y,et al. Multi-agent game abstraction via graph attention neural network[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto,CA:AAAI, 2020:7211-7218.
[23] HAUSKNECHT M,STONE P. Deep recurrent Q-learning for partially observable MDPs[C]//Proceedings of the 2015 AAAI Fall Symposium Series. Palo Alto,CA:AAAI,2015:29-37.
[24] OLIEHOEK F A, AMATO C. A Concise Introduction to Decentralized POMDPs[M]. Cham:Springer,2016:11-30.
[25] CHO K, VAN MERRIËNBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA:Associations for Computational Linguistics, 2014:1724-1734.
[26] HOCHREITER S,SCHMIDHUBER J. Long short-term memory[J]. Neural Computation,1997,9(8):1735-1780.
[27] HE K,ZHANG X,REN S,et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:770-778.
[28] SAMVELYAN M,RASHID T,DE WITT C S,et al. The StarCraft multi-agent challenge[C]//Proceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems. Richland,SC:International Foundation for Autonomous Agents and Multiagent Systems,2019:2186-2188.
[29] HAHNLOSER R H R,SARPESHKAR R,MAHOWALD M A,et al. Digital selection and analogue amplification coexist in a cortexinspired silicon circuit[J]. Nature,2000,405(6789):947-951.

基于奖励高速路网络的多智能体强化学习中的全局信用分配算法

Reward highway network based global credit assignment algorithm in multi-agent reinforcement learning

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	李顺勇, 李师毅, 胥瑞, 赵兴旺. 基于自注意力融合的不完整多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2696-2703.
[2]	潘烨新, 杨哲. 基于多级特征双向融合的小目标检测优化模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2871-2877.
[3]	黄云川, 江永全, 黄骏涛, 杨燕. 基于元图同构网络的分子毒性预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2964-2969.
[4]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[5]	王熙源, 张战成, 徐少康, 张宝成, 罗晓清, 胡伏原. 面向手术导航3D/2D配准的无监督跨域迁移网络[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2911-2918.
[6]	周毅, 高华, 田永谌. 基于裁剪优化和策略指导的近端策略优化算法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2334-2341.
[7]	刘禹含, 吉根林, 张红苹. 基于骨架图与混合注意力的视频行人异常检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2551-2557.
[8]	顾焰杰, 张英俊, 刘晓倩, 周围, 孙威. 基于时空多图融合的交通流量预测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2618-2625.
[9]	石乾宏, 杨燕, 江永全, 欧阳小草, 范武波, 陈强, 姜涛, 李媛. 面向空气质量预测的多粒度突变拟合网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2643-2650.
[10]	吴筝, 程志友, 汪真天, 汪传建, 王胜, 许辉. 基于深度学习的患者麻醉复苏过程中的头部运动幅度分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2258-2263.
[11]	李欢欢, 黄添强, 丁雪梅, 罗海峰, 黄丽清. 基于多尺度时空图卷积网络的交通出行需求预测[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2065-2072.
[12]	张郅, 李欣, 叶乃夫, 胡凯茜. 基于暗知识保护的模型窃取防御技术DKP[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2080-2086.
[13]	马天, 席润韬, 吕佳豪, 曾奕杰, 杨嘉怡, 张杰慧. 基于深度强化学习的移动机器人三维路径规划方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2055-2064.
[14]	赵亦群, 张志禹, 董雪. 基于密集残差物理信息神经网络的各向异性旅行时计算方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2310-2318.
[15]	徐松, 张文博, 王一帆. 基于时空信息的轻量视频显著性目标检测网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2192-2199.