Reward highway network based global credit assignment algorithm in multi-agent reinforcement learning

doi:10.11772/j.issn.1001-9081.2020061009

Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (1): 1-7.DOI: 10.11772/j.issn.1001-9081.2020061009

Special Issue: 第八届中国数据挖掘会议(CCDM 2020)

• China Conference on Data Mining 2020 (CCDM 2020) • Next Articles

Reward highway network based global credit assignment algorithm in multi-agent reinforcement learning

YAO Xinghu^1,2,3, TAN Xiaoyang^1,2,3

1. College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing Jiangsu 211106, China;
2. MIIT Key Laboratory of Pattern Analysis and Machine Intelligence(Nanjing University of Aeronautics and Astronautics), Nanjing Jiangsu 211106, China;
3. Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing University of Aeronautics and Astronautics, Nanjing Jiangsu 211106, China

Received:2020-05-31 Revised:2020-09-24 Online:2021-01-10 Published:2020-11-12
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61976115,61672280, 61732006), the Equipment Pre-research Fund (6140312020413), the Artificial Intelligence+ Project of Nanjing University of Aeronautics and Astronautics (56XZA18009), the Project of Pre-research on Military Shared Information System Equipment (315025305), the Graduate Innovation Foundation of Nanjing University of Aeronautics and Astronautics (Kfjj20191608).

基于奖励高速路网络的多智能体强化学习中的全局信用分配算法

姚兴虎^1,2,3, 谭晓阳^1,2,3

1. 南京航空航天大学计算机科学与技术学院, 南京 211106;
2. 模式分析与机器智能工业和信息化部重点实验室(南京航空航天大学), 南京 211106;
3. 南京航空航天大学软件新技术与产业化协同创新中心, 南京 211106

通讯作者: 谭晓阳
作者简介:姚兴虎(1996-),男,山东济宁人,硕士研究生,CCF会员,主要研究方向:深度强化学习、多智能体强化学习;谭晓阳(1971-),男,江苏南京人,教授,博士,CCF会员,主要研究方向:机器学习、深度强化学习。
基金资助:
国家自然科学基金资助项目（61976115，61672280， 61732006）；装备预研基金资助项目（6140312020413）；南京航空航天大学人工智能+项目（56XZA18009）；全军共用信息系统装备预研项目（315025305）；南京航空航天大学研究生创新基金资助项目（Kfjj20191608）。

Abstract

Abstract: For the problem of exponential explosion of joint action space with the increase of the number of agents in multi-agent systems, the "central training-decentralized execution" framework was adopted to solve the curse of dimensionality of joint action space and reduce the optimization cost of the algorithm. A new global credit assignment mechanism, Reward HighWay Network (RHWNet), was proposed to solve the problem that only the global reward corresponding to the joint behavior of all agents was given by the environment in multiple multi-agent reinforcement learning scenarios. By introducing the reward highway connection in the global reward assignment mechanism of the original algorithm, the value function of each agent was directly connected with the global reward, so that each agent was able to consider both the global reward signal and its actual reward value when making strategy selection. Firstly, in the training process, each agent was coordinated through a centralized value function structure. At the same time, this centralized structure was also able to play a role in global reward assignment. Then, the reward highway connection was introduced in the central value function structure to assist the global reward assignment, thus establishing the reward highway network. Then, in the execution phase, each agent's strategy depended only on its own value function. Experimental results on the StarCraft Multi-Agent Challenge (SMAC) microoperation scenarios show that the proposed reward highway network achieves a performance improvement of more than 20% in testing winning rate on four complex maps compared to the advanced Counterfactual multi-agent policy gradient (Coma) and QMIX algorithms. More importantly, in 3s5z and 3s6z scenarios with a large number and different types of agents, the proposed network can achieve better results when the required number of samples is only 30% of algorithms such as Coma and QMIX.

Key words: deep learning, deep reinforcement learning, multi-agent reinforcement learning, multi-agent system, global credit assignment

摘要： 针对多智能体系统中联合动作空间随智能体数量的增加而产生的指数爆炸的问题，采用“中心训练-分散执行”的框架来避免联合动作空间的维数灾难并降低算法的优化代价。针对在众多的多智能体强化学习场景下，环境仅给出所有智能体的联合行为所对应的全局奖励这一问题，提出一种新的全局信用分配机制——奖励高速路网络（RHWNet）。通过在原有算法的奖励分配机制上引入奖励高速路连接，将每个智能体的值函数与全局奖励直接建立联系，进而使得每个智能体在进行策略选择时能够综合考虑全局的奖励信号与其自身实际分得的奖励值。首先，在训练过程中，通过中心化的值函数结构对每个智能体进行协调；同时，这一中心化的结构也能起到全局奖励分配的作用；然后，在中心值函数结构中引入奖励高速路链接来辅助进行全局奖励分配，从而构建出奖励高速路网络；之后，在执行阶段，每个智能体的策略仅仅依赖于其自身的值函数。在星际争霸多智能体挑战的微操作场景中的实验结果表明，相比当前较先进的反直觉的策略梯度（Coma）算法和单调Q值函数分解（QMIX）算法，该网络所提出的奖励高速路在4个复杂的地图上的测试胜率提升超过20%。更重要的是，在智能体数量较多且种类不同的3s5z和3s6z场景中，该网络在所需样本数量为QMIX和Coma等算法的30%的情况下便能取得更好的结果。

关键词: 深度学习, 深度强化学习, 多智能体强化学习, 多智能体系统, 全局信用分配

CLC Number:

TP181

YAO Xinghu, TAN Xiaoyang. Reward highway network based global credit assignment algorithm in multi-agent reinforcement learning[J]. Journal of Computer Applications, 2021, 41(1): 1-7.

姚兴虎, 谭晓阳. 基于奖励高速路网络的多智能体强化学习中的全局信用分配算法[J]. 计算机应用, 2021, 41(1): 1-7.

References

[1] MNIH V,KAVUKCUOGLU K,SILVER D,et al. Human-level control through deep reinforcement learning[J]. Nature,2015,518(7540):529-533.
[2] 刘全, 翟建伟, 章宗长, 等. 深度强化学习综述[J]. 计算机学报, 2018,41(1):1-27.(LIU Q,ZHAI J W,ZHANG Z Z,et al. A survey on deep reinforcement learning[J]. Chinese Journal of Computers,2018,41(1):1-27.)
[3] SCHULMAN J,WOLSKI F,DHARIWAL P,et al. Proximal policy optimization algorithms[EB/OL].[2020-09-03]. https://arxiv.org/pdf/1707.06347.pdf.
[4] 殷昌盛, 杨若鹏, 朱巍, 等. 多智能体分层强化学习综述[J]. 智能系统学报, 2020, 15(4):646-655.(YIN C S,YANG R P,ZHU W, et al. A survey on multi-agent hierarchical reinforcement learning[J]. CAAI Transactions on Intelligent Systems,2020, 15(4):646-655.)
[5] 孙长银, 穆朝絮. 多智能体深度强化学习的若干关键科学问题[J]. 自动化学报,2020,46(7):1301-1312.(SUN C Y,MU C X. Important scientific problems of multi-agent deep reinforcement learning[J]. Acta Automatica Sinica,2020,46(7):1301-1312.)
[6] 王冲, 景宁, 李军, 等. 一种基于多agent强化学习的多星协同任务规划算法[J]. 国防科技大学学报,2011,33(1):53-58. (WANG C,JING N,LI J,et al. An algorithm of cooperative multiple satellites mission planning based on multi-agent reinforcement learning[J]. Journal of National University of Defense Technology,2011,33(1):53-58.)
[7] VAN DER POL E, OLIEHOEK F A. Coordinated deep reinforcement learners for traffic light control[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook,NY:Curran Associates Inc.,2016:1-8.
[8] JADERBERG M,CZARNECKI W M,DUNNING I,et al. Humanlevel performance in 3D multiplayer games with population-based reinforcement learning[J]. Science,2019,364(6443):859-865.
[9] NAIR R,TAMBE M,YOKOO M,et al. Taming decentralized POMDPs:towards efficient policy computation for multiagent settings[C]//Proceedings of the 18th International Joint Conference on Artificial Intelligence. San Francisco:Morgan Kaufmann Publishers Inc.,2003:705-711.
[10] LAURENT G J,MATIGNON L,LE FORT-PIAT N. The world of independent learners is not Markovian[J]. International Journal of Knowledge-based and Intelligent Engineering Systems,2011,15(1):55-64.
[11] LOWE R,WU Y,TAMAR A,et al. Multi-agent actor-critic for mixed cooperative-competitive environments[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook,NY:Curran Associates Inc.,2017:6382-6393.
[12] FOERSTER J N, FARQUHAR G, AFOURAS T, et al. Counterfactual multi-agent policy gradients[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto, CA:AAAI,2018:2974-2982.
[13] SUNEHAG P, LEVER G, GRUSLYS A, et al. Valuedecomposition networks for cooperative multi-agent learning based on team reward[C]//Proceedings of the 17th International Conference on Autonomous Agents and Multiagent Systems. Richland,SC:International Foundation for Autonomous Agents and Multiagent Systems,2018:2085-2087.
[14] RASHID T,SAMVELYAN M,SCHROEDER C,et al. QMIX:monotonic value function factorisation for deep multi-agent reinforcement learning[C]//Proceedings of the 35th International Conference on Machine Learning. New York:JMLR. org,2018:4295-4304.
[15] SON K,KIM D,KANG W J,et al. QTRAN:learning to factorize with transformation for cooperative multi-agent reinforcement learning[C]//Proceedings of the 36th International Conference on Machine Learning. New York:JMLR. org,2019:5887-5896.
[16] YAO X,WEN C,WANG Y,et al. SMIX (λ):enhancing centralized value functions for cooperative multi-agent reinforcement learning[EB/OL].[2020-09-03]. https://arxiv.org/pdf/1911.04094.pdf.
[17] SUKHBAATAR S,SZLAM A,FERGUS R. Learning multiagent communication with backpropagation[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook,NY:Curran Associates Inc.,2016:2252-2260.
[18] FOERSTER J,ASSAEL Y M,DE FREITAS N,et al. Learning to communicate with deep multi-agent reinforcement learning[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook, NY:Curran Associates Inc.,2016:2145-2153.
[19] IQBAL S, SHA F. Actor-attention-critic for multi-agent reinforcement learning[C]//Proceedings of the 36th International Conference on Machine Learning. New York:JMLR. org,2019:2961-2970.
[20] JIANG J,LU Z. Learning attentional communication for multiagent cooperation[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook, NY:Curran Associates Inc.,2018:7265-7275.
[21] JIANG J, DUN C, HUANG T, et al. Graph convolutional reinforcement learning[EB/OL].[2020-09-03]. https://arxiv.org/pdf/1810.09202.pdf.
[22] LIU Y,WANG W,HU Y,et al. Multi-agent game abstraction via graph attention neural network[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto,CA:AAAI, 2020:7211-7218.
[23] HAUSKNECHT M,STONE P. Deep recurrent Q-learning for partially observable MDPs[C]//Proceedings of the 2015 AAAI Fall Symposium Series. Palo Alto,CA:AAAI,2015:29-37.
[24] OLIEHOEK F A, AMATO C. A Concise Introduction to Decentralized POMDPs[M]. Cham:Springer,2016:11-30.
[25] CHO K, VAN MERRIËNBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA:Associations for Computational Linguistics, 2014:1724-1734.
[26] HOCHREITER S,SCHMIDHUBER J. Long short-term memory[J]. Neural Computation,1997,9(8):1735-1780.
[27] HE K,ZHANG X,REN S,et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:770-778.
[28] SAMVELYAN M,RASHID T,DE WITT C S,et al. The StarCraft multi-agent challenge[C]//Proceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems. Richland,SC:International Foundation for Autonomous Agents and Multiagent Systems,2019:2186-2188.
[29] HAHNLOSER R H R,SARPESHKAR R,MAHOWALD M A,et al. Digital selection and analogue amplification coexist in a cortexinspired silicon circuit[J]. Nature,2000,405(6789):947-951.

Reward highway network based global credit assignment algorithm in multi-agent reinforcement learning

基于奖励高速路网络的多智能体强化学习中的全局信用分配算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	CHEN Chengrui, SUN Ning, HE Shibiao, LIAO Yong. Deep learning-based joint channel estimation and equalization algorithm for C-V2X communications [J]. Journal of Computer Applications, 2021, 41(9): 2687-2693.
[2]	ZHENG Zhiqiang, HU Xin, WENG Zhi, WANG Yuhe, CHENG Xi. Cattle eye image feature extraction method based on improved DenseNet [J]. Journal of Computer Applications, 2021, 41(9): 2780-2784.
[3]	XIE Defeng, JI Jianmin. Syntax-enhanced semantic parsing with syntax-aware representation [J]. Journal of Computer Applications, 2021, 41(9): 2489-2495.
[4]	DAI Yurou, YANG Qing, ZHANG Fengli, ZHOU Fan. Trajectory prediction model of social network users based on self-supervised learning [J]. Journal of Computer Applications, 2021, 41(9): 2545-2551.
[5]	CHAI Jie, GUO Liuxiao, SHEN Wanqiang, CHEN Jing. Consensus of time-varying multi-agent systems based on event-triggered impulsive control [J]. Journal of Computer Applications, 2021, 41(9): 2748-2753.
[6]	ZHAO Hong, KONG Dongyi. Chinese description of image content based on fusion of image feature attention and adaptive attention [J]. Journal of Computer Applications, 2021, 41(9): 2496-2503.
[7]	XU Jianglang, LI Linyan, WAN Xinjun, HU Fuyuan. Indoor scene recognition method combined with object detection [J]. Journal of Computer Applications, 2021, 41(9): 2720-2725.
[8]	CAO Yuhong, XU Hai, LIU Sun'ao, WANG Zixiao, LI Hongliang. Review of deep learning-based medical image segmentation [J]. Journal of Computer Applications, 2021, 41(8): 2273-2287.
[9]	QIN Binbin, PENG Liangkang, LU Xiangming, QIAN Jiangbo. Research progress on driver distracted driving detection [J]. Journal of Computer Applications, 2021, 41(8): 2330-2337.
[10]	HE Zhenghai, XIAN Yantuan, WANG Meng, YU Zhengtao. Case reading comprehension method combining syntactic guidance and character attention mechanism [J]. Journal of Computer Applications, 2021, 41(8): 2427-2431.
[11]	LI Yafang, LIANG Ye, FENG Weiwei, ZU Baokai, KANG Yujian. Deep network embedding method based on community optimization [J]. Journal of Computer Applications, 2021, 41(7): 1956-1963.
[12]	WANG Yue, JIANG Yiming, LAN Julong. Intrusion detection based on improved triplet network and K-nearest neighbor algorithm [J]. Journal of Computer Applications, 2021, 41(7): 1996-2002.
[13]	SHANG Fangjian, LI Xin, Di ZHAI, LU Yang, ZHANG Donglei, QIAN Yuwen. Two-phase resource allocation technology for network slices in smart grid [J]. Journal of Computer Applications, 2021, 41(7): 2033-2038.
[14]	HOU Xiaohan, JIN Guodong, TAN Lining, XUE Yuanliang. Synthetic aperture radar ship detection method based on self-adaptive and optimal features [J]. Journal of Computer Applications, 2021, 41(7): 2150-2155.
[15]	GAO Qinquan, HUANG Bingcheng, LIU Wenzhe, TONG Tong. Bamboo strip surface defect detection method based on improved CenterNet [J]. Journal of Computer Applications, 2021, 41(7): 1933-1938.