Multi‑agent reinforcement learning based on attentional message sharing

doi:10.11772/j.issn.1001-9081.2021122169

Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (11): 3346-3353.DOI: 10.11772/j.issn.1001-9081.2021122169

Special Issue: 第九届CCF大数据学术会议(CCF Bigdata 2021)

• CCF Bigdata 2021 • Previous Articles Next Articles

Multi‑agent reinforcement learning based on attentional message sharing

Rong ZANG¹, Li WANG¹(), Tengfei SHI²

^1.College of Data Science，Taiyuan University of Technology，Jinzhong Shanxi 030600，China
^2.North Automatic Control Technology Institute，Taiyuan Shanxi 030006，China

Received:2021-12-21 Revised:2022-01-14 Accepted:2022-01-24 Online:2022-03-04 Published:2022-11-10
Contact: Li WANG
About author:ZANG Rong， born in 1997， M. S. candidate. His research interests include reinforcement learning， multi-agent system.
WANG Li， born in 1971， Ph. D.， professor. Her research interests include data mining， artificial intelligence， machine learning.
SHI Tengfei， born in 1990， M. S.， engineer. His research interests include deep reinforcement learning.
Supported by:
National Natural Science Foundation of China(61872260)

基于注意力消息共享的多智能体强化学习

臧嵘¹, 王莉¹(), 史腾飞²

^1.太原理工大学大数据学院，山西晋中 030600
^2.北方自动控制技术研究所，太原 030006

通讯作者: 王莉
作者简介:臧嵘（1997—），男，山西太原人，硕士研究生，主要研究方向：强化学习、多智能体系统
王莉（1971—），女，山西太原人，教授，博士，CCF高级会员，主要研究方向：数据挖掘、人工智能、机器学习 wangli@tyut.edu.cn
史腾飞（1990—），男，山西晋城人，工程师，硕士，CCF会员，主要研究方向：深度强化学习。

Abstract

Abstract:

Communication is an important way to achieve effective cooperation among multiple agents in a non? omniscient environment. When there are a large number of agents， redundant messages may be generated in the communication process. To handle the communication messages effectively， a multi?agent reinforcement learning algorithm based on attentional message sharing was proposed， called AMSAC （Attentional Message Sharing multi?agent Actor?Critic）. Firstly， a message sharing network was built for effective communication among agents， and information sharing was achieved through message reading and writing by the agents， thus solving the problem of lack of communication among agents in non?omniscient environment with complex tasks. Then， in the message sharing network， the communication messages were processed adaptively by the attentional message sharing mechanism， and the messages from different agents were processed with importance order to solve the problem that large?scale multi?agent system cannot effectively identify and utilize the messages during the communication process. Moreover， in the centralized Critic network， the Native Critic was used to update the Actor network parameters according to Temporal Difference （TD） advantage policy gradient， so that the action values of agents were evaluated effectively. Finally， during the execution period， the decision was made by the agent distributed Actor network based on its own observations and messages from message sharing network. Experimental results in the StarCraft Multi?Agent Challenge （SMAC） environment show that compared with Native Actor?Critic （Native AC）， Game Abstraction Communication （GA?Comm） and other multi?agent reinforcement learning methods， AMSAC has an average win rate improvement of 4 - 32 percentage points in four different scenarios. AMSAC’s attentional message sharing mechanism provides a reasonable solution for processing communication messages among agents in a multi?agent system， and has broad application prospects in both transportation hub control and unmanned aerial vehicle collaboration.

Key words: multi?agent system, agent cooperation, deep reinforcement learning, agent communication, attention mechanism, policy gradient

摘要：

通信是非全知环境中多智能体间实现有效合作的重要途径，当智能体数量较多时，通信过程会产生冗余消息。为有效处理通信消息，提出一种基于注意力消息共享的多智能体强化学习算法AMSAC。首先，在智能体间搭建用于有效沟通的消息共享网络，智能体通过消息读取和写入完成信息共享，解决智能体在非全知、任务复杂场景下缺乏沟通的问题；其次，在消息共享网络中，通过注意力消息共享机制对通信消息进行自适应处理，有侧重地处理来自不同智能体的消息，解决较大规模多智能体系统在通信过程中无法有效识别消息并利用的问题；然后，在集中式Critic网络中，使用Native Critic依据时序差分（TD）优势策略梯度更新Actor网络参数，使智能体的动作价值得到有效评判；最后，在执行期间，智能体分布式Actor网络根据自身观测和消息共享网络的信息进行决策。在星际争霸Ⅱ多智能体挑战赛（SMAC）环境中进行实验，结果表明，与朴素Actor?Critic （Native AC）、博弈抽象通信（GA?Comm）等多智能体强化学习方法相比，AMSAC在四个不同场景下的平均胜率提升了4 ~ 32个百分点。AMSAC的注意力消息共享机制为处理多智能体系统中智能体间的通信消息提供了合理方案，在交通枢纽控制和无人机协同领域都具备广泛的应用前景。

关键词: 多智能体系统, 智能体协同, 深度强化学习, 智能体通信, 注意力机制, 策略梯度

CLC Number:

TP181

Rong ZANG, Li WANG, Tengfei SHI. Multi‑agent reinforcement learning based on attentional message sharing[J]. Journal of Computer Applications, 2022, 42(11): 3346-3353.

臧嵘, 王莉, 史腾飞. 基于注意力消息共享的多智能体强化学习[J]. 《计算机应用》唯一官方网站, 2022, 42(11): 3346-3353.

Figures/Tables 8

References 28

1	MNIH V， KAVUKCUOGLU K， SILVER D， et al. Human‑level control through deep reinforcement learning［J］. Nature， 2015， 518（7540）： 529-533. 10.1038/nature14236
2	刘全，翟建伟，章宗长，等. 深度强化学习综述［J］. 计算机学报， 2018， 41（1）：1-27. 10.11897/SP.J.1016.2018.00001
	LIU Q， ZHAI J W， ZHANG Z Z， et al. A survey on deep reinforcement learning［J］. Chinese Journal of Computers， 2018， 41（1）：1-27. 10.11897/SP.J.1016.2018.00001
3	TROITZSCH K G. Multi-agent systems and simulation： a survey from an application perspective［M］// UHRMACHER A M， WEYNS D. Multi-Agent Systems： Simulation and Applications. Boca Raton： CRC Press， 2009： 53-76. 10.1201/9781420070248.ch2
4	HERNANDEZ‑LEAL P， KARTAL B， TAYLOR M E. A survey and critique of multiagent deep reinforcement learning［J］. Autonomous Agents and Multi‑Agent Systems， 2019， 33（6）： 750-797. 10.1007/s10458-019-09421-1
5	孙长银，穆朝絮. 多智能体深度强化学习的若干关键科学问题［J］. 自动化学报， 2020， 46（7）：1301-1312. 10.16383/j.aas.c200159
	SUN C Y， MU C X. Important scientific problems of multi‑agent deep reinforcement learning［J］. Acta Automatica Sinica， 2020， 46（7）：1301-1312. 10.16383/j.aas.c200159
6	SUKHBAATAR S， SZLAM A， FERGUS R. Learning multiagent communication with backpropagation［C］// Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2016： 2252-2260.
7	PENG P， WEN Y， YANG Y D， et al. Multiagent bidirectionally‑ coordinated nets： emergence of human‑level coordination in learning to play StarCraft combat games［EB/OL］. （2017-09-14）［2021-02-12］.. 10.48550/arXiv.1703.10069
8	DAS A， GERVET T， ROMOFF J， et al. TarMAC： targeted multi‑ agent communication［C］// Proceedings of the 36th International Conference on Machine Learning. New York： JMLR.org， 2019： 1538-1546.
9	SINGH A， JAIN T， SUKHBAATAR S. Learning when to communicate at scale in multiagent cooperative and competitive tasks［EB/OL］. （2018-12-23）［2021-02-12］..
10	LIU Y， WANG W X， HU Y J， et al. Multi‑agent game abstraction via graph attention neural network［C］// Proceedings of the 34th Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2020： 7211-7218. 10.1609/aaai.v34i05.6211
11	MAO H Y， ZHANG Z C， XIAO Z， et al. Learning multi‑agent communication with double attentional deep reinforcement learning［J］. Autonomous Agents and Multi‑Agent Systems， 2020， 34（1）： No.32. 10.1007/s10458-020-09455-w
12	SU J Y， ADAMS S， BELING P. Value‑decomposition multi‑agent actor‑critics［C］// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2021： 11352-11360. 10.1609/aaai.v35i13.17353
13	SAMVELYAN M， RASHID T， SCHROEDER DE WITT C， et al. The StarCraft multi‑agent challenge［C］// Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. Richland， SC： International Foundation for Autonomous Agents and MultiAgent Systems， 2019： 2186-2188.
14	WILLIAMS R J. Simple statistical gradient‑following algorithms for connectionist reinforcement learning［J］. Machine Learning， 1992， 8（3/4）： 229-256. 10.1007/bf00992696
15	LOWE R， WU Y， TAMAR A， et al. Multi‑agent actor‑critic for mixed cooperative‑competitive environments［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 6382-6393.
16	LILLICRAP T P， HUNT J J， PRITZEL A， et al. Continuous control with deep reinforcement learning［EB/OL］. （2019-07-05）［2021-02-12］..
17	FOERSTER J N， FARQUHAR G， AFOURAS T， et al. Counterfactual multi‑agent policy gradients［C］// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2018： 2974-2982. 10.1609/aaai.v32i1.11794
18	ZHANG K Q， YANG Z R， LIU H， et al. Fully decentralized multi‑agent reinforcement learning with networked agents［C］// Proceedings of the 35th International Conference on Machine Learning. New York： JMLR.org， 2018： 5872-5881.
19	JIANG J C， LU Z Q. Learning attentional communication for multi-agent cooperation［C］// Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2018： 7265-7275.
20	IQBAL S， SHA F. Actor‑attention‑critic for multi‑agent reinforcement learning［C］// Proceedings of the 36th International Conference on Machine Learning. New York： JMLR.org， 2019： 2961-2970.
21	BERNSTEIN D S， GIVAN R， IMMERMAN N， et al. The complexity of decentralized control of Markov decision processes［J］. Mathematics of Operations Research， 2002， 27（4）： 819-840. 10.1287/moor.27.4.819.297
22	SUTTON R S， McALLESTER D， SINGH S， et al. Policy gradient methods for reinforcement learning with function approximation［C］// Proceedings of the 12th International Conference on Neural Information Processing Systems. Cambridge： MIT Press， 1999： 1057-1063.
23	KONDA V R， TSITSIKLIS J N. Actor‑critic algorithms［C］// Proceedings of the 12th International Conference on Neural Information Processing Systems. Cambridge： MIT Press， 1999： 1008-1014.
24	MNIH V， HEESS N， GRAVES A， et al. Recurrent models of visual attention［C］// Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge： MIT Press， 2014： 2204-2212.
25	CHO K， van MERRIËNBOER B， GU̇LÇEHRE Ç， et al. Learning phrase representations using RNN encoder‑decoder for statistical machine translation［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2014： 1724-1734. 10.3115/v1/d14-1179
26	XU K， BA J， KIROS R， et al. Show， attend and tell： neural image caption generation with visual attention［C］// Proceedings of the 32nd International Conference on Machine Learning. New York： JMLR.org， 2015： 2048-2057. 10.1109/cvpr.2015.7298935
27	CHUNG J， GU̇LÇEHRE Ç， CHO K， et al. Empirical evaluation of gated recurrent neural networks on sequence modeling ［S/OL］. （2014-12-11）［2021-10-25］.. 10.1007/978-3-030-89929-5_3
28	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 6000-6010.

地图	友方单位	敌方单位
2s3z	2个追猎者和3个狂战士	2个追猎者和3个狂战士
1c3s5z	1个巨像，3个追猎者和5个狂战士	1个巨像，3个追猎者和5个狂战士
3s5z	3个追猎者和5个狂战士	3个追猎者和5个狂战士
8m	8个海军陆战队	8个海军陆战队

地图	友方单位	敌方单位
2s3z	2个追猎者和3个狂战士	2个追猎者和3个狂战士
1c3s5z	1个巨像，3个追猎者和5个狂战士	1个巨像，3个追猎者和5个狂战士
3s5z	3个追猎者和5个狂战士	3个追猎者和5个狂战士
8m	8个海军陆战队	8个海军陆战队

地图	AMSAC	MSAC	Native AC	COMA	CommNet	GA‑Comm
2s3z	47.02（41.60~54.74）	29.82（20.35~32.37）	30.61（21.18~39.83）	15.19（12.95~17.68）	4.95（3.78~6.44）	7.34（4.34~15.25）
1c3s5z	41.96（32.76~46.88）	28.25（21.67~30.71）	26.72（21.88~31.00）	15.29（8.38~22.01）	0.23（0.00~0.98）	0.22（0.00~0.72）
3s5z	4.21（3.56~5.15）	1.17（0.34~2.39）	0.76（0.09~2.05）	0.08（0.00~0.11）	0.01（0.00~0.03）	0.01（0.00~0.02）
8m	85.06（78.68~86.75）	90.45（89.74~91.22）	89.51（88.59~90.70）	84.51（83.54~85.07）	24.57（8.71~54.24）	45.91（28.45~54.32）

地图	AMSAC	MSAC	Native AC	COMA	CommNet	GA‑Comm
2s3z	47.02（41.60~54.74）	29.82（20.35~32.37）	30.61（21.18~39.83）	15.19（12.95~17.68）	4.95（3.78~6.44）	7.34（4.34~15.25）
1c3s5z	41.96（32.76~46.88）	28.25（21.67~30.71）	26.72（21.88~31.00）	15.29（8.38~22.01）	0.23（0.00~0.98）	0.22（0.00~0.72）
3s5z	4.21（3.56~5.15）	1.17（0.34~2.39）	0.76（0.09~2.05）	0.08（0.00~0.11）	0.01（0.00~0.03）	0.01（0.00~0.02）
8m	85.06（78.68~86.75）	90.45（89.74~91.22）	89.51（88.59~90.70）	84.51（83.54~85.07）	24.57（8.71~54.24）	45.91（28.45~54.32）

地图	AMSAC	MSAC	Native AC	COMA	CommNet	GA‑Comm
2s3z	92.19	90.63	87.50	56.25	34.38	50.00
1c3s5z	100.00	100.00	100.00	78.13	6.25	21.88
3s5z	46.88	31.25	21.88	6.25	3.13	3.13
8m	100.00	100.00	100.00	100.00	100.00	100.00

Multi‑agent reinforcement learning based on attentional message sharing

基于注意力消息共享的多智能体强化学习

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 8

References 28

Related Articles 15

Recommended Articles

Metrics

[1]	Jing QIN, Zhiguang QIN, Fali LI, Yueheng PENG. Diagnosis of major depressive disorder based on probabilistic sparse self-attention neural network [J]. Journal of Computer Applications, 2024, 44(9): 2970-2974.
[2]	Liting LI, Bei HUA, Ruozhou HE, Kuang XU. Multivariate time series prediction model based on decoupled attention mechanism [J]. Journal of Computer Applications, 2024, 44(9): 2732-2738.
[3]	Zhiqiang ZHAO, Peihong MA, Xinhong HEI. Crowd counting method based on dual attention mechanism [J]. Journal of Computer Applications, 2024, 44(9): 2886-2892.
[4]	Yi ZHOU, Hua GAO, Yongshen TIAN. Proximal policy optimization algorithm based on clipping optimization and policy guidance [J]. Journal of Computer Applications, 2024, 44(8): 2334-2341.
[5]	Kaipeng XUE, Tao XU, Chunjie LIAO. Multimodal sentiment analysis network with self-supervision and multi-layer cross attention [J]. Journal of Computer Applications, 2024, 44(8): 2387-2392.
[6]	Pengqi GAO, Heming HUANG, Yonghong FAN. Fusion of coordinate and multi-head attention mechanisms for interactive speech emotion recognition [J]. Journal of Computer Applications, 2024, 44(8): 2400-2406.
[7]	Zhonghua LI, Yunqi BAI, Xuejin WANG, Leilei HUANG, Chujun LIN, Shiyu LIAO. Low illumination face detection based on image enhancement [J]. Journal of Computer Applications, 2024, 44(8): 2588-2594.
[8]	Shangbin MO, Wenjun WANG, Ling DONG, Shengxiang GAO, Zhengtao YU. Single-channel speech enhancement based on multi-channel information aggregation and collaborative decoding [J]. Journal of Computer Applications, 2024, 44(8): 2611-2617.
[9]	Li LIU, Haijin HOU, Anhong WANG, Tao ZHANG. Generative data hiding algorithm based on multi-scale attention [J]. Journal of Computer Applications, 2024, 44(7): 2102-2109.
[10]	Song XU, Wenbo ZHANG, Yifan WANG. Lightweight video salient object detection network based on spatiotemporal information [J]. Journal of Computer Applications, 2024, 44(7): 2192-2199.
[11]	Dahai LI, Zhonghua WANG, Zhendong WANG. Dual-branch low-light image enhancement network combining spatial and frequency domain information [J]. Journal of Computer Applications, 2024, 44(7): 2175-2182.
[12]	Wenliang WEI, Yangping WANG, Biao YUE, Anzheng WANG, Zhe ZHANG. Deep learning model for infrared and visible image fusion based on illumination weight allocation and attention [J]. Journal of Computer Applications, 2024, 44(7): 2183-2191.
[13]	Wu XIONG, Congjun CAO, Xuefang SONG, Yunlong SHAO, Xusheng WANG. Handwriting identification method based on multi-scale mixed domain attention mechanism [J]. Journal of Computer Applications, 2024, 44(7): 2225-2232.
[14]	Huanhuan LI, Tianqiang HUANG, Xuemei DING, Haifeng LUO, Liqing HUANG. Public traffic demand prediction based on multi-scale spatial-temporal graph convolutional network [J]. Journal of Computer Applications, 2024, 44(7): 2065-2072.
[15]	Dianhui MAO, Xuebo LI, Junling LIU, Denghui ZHANG, Wenjing YAN. Chinese entity and relation extraction model based on parallel heterogeneous graph and sequential attention mechanism [J]. Journal of Computer Applications, 2024, 44(7): 2018-2025.