基于注意力消息共享的多智能体强化学习

doi:10.11772/j.issn.1001-9081.2021122169

《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (11): 3346-3353.DOI: 10.11772/j.issn.1001-9081.2021122169

所属专题：第九届CCF大数据学术会议(CCF Bigdata 2021)

• 第九届CCF大数据学术会议 • 上一篇下一篇

基于注意力消息共享的多智能体强化学习

臧嵘¹, 王莉¹(), 史腾飞²

^1.太原理工大学大数据学院，山西晋中 030600
^2.北方自动控制技术研究所，太原 030006

收稿日期:2021-12-21 修回日期:2022-01-14 接受日期:2022-01-24 发布日期:2022-03-04 出版日期:2022-11-10
通讯作者: 王莉
作者简介:臧嵘（1997—），男，山西太原人，硕士研究生，主要研究方向：强化学习、多智能体系统
王莉（1971—），女，山西太原人，教授，博士，CCF高级会员，主要研究方向：数据挖掘、人工智能、机器学习 wangli@tyut.edu.cn
史腾飞（1990—），男，山西晋城人，工程师，硕士，CCF会员，主要研究方向：深度强化学习。

Multi‑agent reinforcement learning based on attentional message sharing

Rong ZANG¹, Li WANG¹(), Tengfei SHI²

^1.College of Data Science，Taiyuan University of Technology，Jinzhong Shanxi 030600，China
^2.North Automatic Control Technology Institute，Taiyuan Shanxi 030006，China

Received:2021-12-21 Revised:2022-01-14 Accepted:2022-01-24 Online:2022-03-04 Published:2022-11-10
Contact: Li WANG
About author:ZANG Rong， born in 1997， M. S. candidate. His research interests include reinforcement learning， multi-agent system.
WANG Li， born in 1971， Ph. D.， professor. Her research interests include data mining， artificial intelligence， machine learning.
SHI Tengfei， born in 1990， M. S.， engineer. His research interests include deep reinforcement learning.
Supported by:
National Natural Science Foundation of China(61872260)

摘要/Abstract

摘要：

通信是非全知环境中多智能体间实现有效合作的重要途径，当智能体数量较多时，通信过程会产生冗余消息。为有效处理通信消息，提出一种基于注意力消息共享的多智能体强化学习算法AMSAC。首先，在智能体间搭建用于有效沟通的消息共享网络，智能体通过消息读取和写入完成信息共享，解决智能体在非全知、任务复杂场景下缺乏沟通的问题；其次，在消息共享网络中，通过注意力消息共享机制对通信消息进行自适应处理，有侧重地处理来自不同智能体的消息，解决较大规模多智能体系统在通信过程中无法有效识别消息并利用的问题；然后，在集中式Critic网络中，使用Native Critic依据时序差分（TD）优势策略梯度更新Actor网络参数，使智能体的动作价值得到有效评判；最后，在执行期间，智能体分布式Actor网络根据自身观测和消息共享网络的信息进行决策。在星际争霸Ⅱ多智能体挑战赛（SMAC）环境中进行实验，结果表明，与朴素Actor?Critic （Native AC）、博弈抽象通信（GA?Comm）等多智能体强化学习方法相比，AMSAC在四个不同场景下的平均胜率提升了4 ~ 32个百分点。AMSAC的注意力消息共享机制为处理多智能体系统中智能体间的通信消息提供了合理方案，在交通枢纽控制和无人机协同领域都具备广泛的应用前景。

关键词: 多智能体系统, 智能体协同, 深度强化学习, 智能体通信, 注意力机制, 策略梯度

Abstract:

Communication is an important way to achieve effective cooperation among multiple agents in a non? omniscient environment. When there are a large number of agents， redundant messages may be generated in the communication process. To handle the communication messages effectively， a multi?agent reinforcement learning algorithm based on attentional message sharing was proposed， called AMSAC （Attentional Message Sharing multi?agent Actor?Critic）. Firstly， a message sharing network was built for effective communication among agents， and information sharing was achieved through message reading and writing by the agents， thus solving the problem of lack of communication among agents in non?omniscient environment with complex tasks. Then， in the message sharing network， the communication messages were processed adaptively by the attentional message sharing mechanism， and the messages from different agents were processed with importance order to solve the problem that large?scale multi?agent system cannot effectively identify and utilize the messages during the communication process. Moreover， in the centralized Critic network， the Native Critic was used to update the Actor network parameters according to Temporal Difference （TD） advantage policy gradient， so that the action values of agents were evaluated effectively. Finally， during the execution period， the decision was made by the agent distributed Actor network based on its own observations and messages from message sharing network. Experimental results in the StarCraft Multi?Agent Challenge （SMAC） environment show that compared with Native Actor?Critic （Native AC）， Game Abstraction Communication （GA?Comm） and other multi?agent reinforcement learning methods， AMSAC has an average win rate improvement of 4 - 32 percentage points in four different scenarios. AMSAC’s attentional message sharing mechanism provides a reasonable solution for processing communication messages among agents in a multi?agent system， and has broad application prospects in both transportation hub control and unmanned aerial vehicle collaboration.

Key words: multi?agent system, agent cooperation, deep reinforcement learning, agent communication, attention mechanism, policy gradient

中图分类号:

TP181

臧嵘, 王莉, 史腾飞. 基于注意力消息共享的多智能体强化学习[J]. 计算机应用, 2022, 42(11): 3346-3353.

Rong ZANG, Li WANG, Tengfei SHI. Multi‑agent reinforcement learning based on attentional message sharing[J]. Journal of Computer Applications, 2022, 42(11): 3346-3353.

图/表 8

参考文献 28

1	MNIH V， KAVUKCUOGLU K， SILVER D， et al. Human‑level control through deep reinforcement learning［J］. Nature， 2015， 518（7540）： 529-533. 10.1038/nature14236
2	刘全，翟建伟，章宗长，等. 深度强化学习综述［J］. 计算机学报， 2018， 41（1）：1-27. 10.11897/SP.J.1016.2018.00001
	LIU Q， ZHAI J W， ZHANG Z Z， et al. A survey on deep reinforcement learning［J］. Chinese Journal of Computers， 2018， 41（1）：1-27. 10.11897/SP.J.1016.2018.00001
3	TROITZSCH K G. Multi-agent systems and simulation： a survey from an application perspective［M］// UHRMACHER A M， WEYNS D. Multi-Agent Systems： Simulation and Applications. Boca Raton： CRC Press， 2009： 53-76. 10.1201/9781420070248.ch2
4	HERNANDEZ‑LEAL P， KARTAL B， TAYLOR M E. A survey and critique of multiagent deep reinforcement learning［J］. Autonomous Agents and Multi‑Agent Systems， 2019， 33（6）： 750-797. 10.1007/s10458-019-09421-1
5	孙长银，穆朝絮. 多智能体深度强化学习的若干关键科学问题［J］. 自动化学报， 2020， 46（7）：1301-1312. 10.16383/j.aas.c200159
	SUN C Y， MU C X. Important scientific problems of multi‑agent deep reinforcement learning［J］. Acta Automatica Sinica， 2020， 46（7）：1301-1312. 10.16383/j.aas.c200159
6	SUKHBAATAR S， SZLAM A， FERGUS R. Learning multiagent communication with backpropagation［C］// Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2016： 2252-2260.
7	PENG P， WEN Y， YANG Y D， et al. Multiagent bidirectionally‑ coordinated nets： emergence of human‑level coordination in learning to play StarCraft combat games［EB/OL］. （2017-09-14）［2021-02-12］.. 10.48550/arXiv.1703.10069
8	DAS A， GERVET T， ROMOFF J， et al. TarMAC： targeted multi‑ agent communication［C］// Proceedings of the 36th International Conference on Machine Learning. New York： JMLR.org， 2019： 1538-1546.
9	SINGH A， JAIN T， SUKHBAATAR S. Learning when to communicate at scale in multiagent cooperative and competitive tasks［EB/OL］. （2018-12-23）［2021-02-12］..
10	LIU Y， WANG W X， HU Y J， et al. Multi‑agent game abstraction via graph attention neural network［C］// Proceedings of the 34th Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2020： 7211-7218. 10.1609/aaai.v34i05.6211
11	MAO H Y， ZHANG Z C， XIAO Z， et al. Learning multi‑agent communication with double attentional deep reinforcement learning［J］. Autonomous Agents and Multi‑Agent Systems， 2020， 34（1）： No.32. 10.1007/s10458-020-09455-w
12	SU J Y， ADAMS S， BELING P. Value‑decomposition multi‑agent actor‑critics［C］// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2021： 11352-11360. 10.1609/aaai.v35i13.17353
13	SAMVELYAN M， RASHID T， SCHROEDER DE WITT C， et al. The StarCraft multi‑agent challenge［C］// Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. Richland， SC： International Foundation for Autonomous Agents and MultiAgent Systems， 2019： 2186-2188.
14	WILLIAMS R J. Simple statistical gradient‑following algorithms for connectionist reinforcement learning［J］. Machine Learning， 1992， 8（3/4）： 229-256. 10.1007/bf00992696
15	LOWE R， WU Y， TAMAR A， et al. Multi‑agent actor‑critic for mixed cooperative‑competitive environments［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 6382-6393.
16	LILLICRAP T P， HUNT J J， PRITZEL A， et al. Continuous control with deep reinforcement learning［EB/OL］. （2019-07-05）［2021-02-12］..
17	FOERSTER J N， FARQUHAR G， AFOURAS T， et al. Counterfactual multi‑agent policy gradients［C］// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2018： 2974-2982. 10.1609/aaai.v32i1.11794
18	ZHANG K Q， YANG Z R， LIU H， et al. Fully decentralized multi‑agent reinforcement learning with networked agents［C］// Proceedings of the 35th International Conference on Machine Learning. New York： JMLR.org， 2018： 5872-5881.
19	JIANG J C， LU Z Q. Learning attentional communication for multi-agent cooperation［C］// Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2018： 7265-7275.
20	IQBAL S， SHA F. Actor‑attention‑critic for multi‑agent reinforcement learning［C］// Proceedings of the 36th International Conference on Machine Learning. New York： JMLR.org， 2019： 2961-2970.
21	BERNSTEIN D S， GIVAN R， IMMERMAN N， et al. The complexity of decentralized control of Markov decision processes［J］. Mathematics of Operations Research， 2002， 27（4）： 819-840. 10.1287/moor.27.4.819.297
22	SUTTON R S， McALLESTER D， SINGH S， et al. Policy gradient methods for reinforcement learning with function approximation［C］// Proceedings of the 12th International Conference on Neural Information Processing Systems. Cambridge： MIT Press， 1999： 1057-1063.
23	KONDA V R， TSITSIKLIS J N. Actor‑critic algorithms［C］// Proceedings of the 12th International Conference on Neural Information Processing Systems. Cambridge： MIT Press， 1999： 1008-1014.
24	MNIH V， HEESS N， GRAVES A， et al. Recurrent models of visual attention［C］// Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge： MIT Press， 2014： 2204-2212.
25	CHO K， van MERRIËNBOER B， GU̇LÇEHRE Ç， et al. Learning phrase representations using RNN encoder‑decoder for statistical machine translation［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2014： 1724-1734. 10.3115/v1/d14-1179
26	XU K， BA J， KIROS R， et al. Show， attend and tell： neural image caption generation with visual attention［C］// Proceedings of the 32nd International Conference on Machine Learning. New York： JMLR.org， 2015： 2048-2057. 10.1109/cvpr.2015.7298935
27	CHUNG J， GU̇LÇEHRE Ç， CHO K， et al. Empirical evaluation of gated recurrent neural networks on sequence modeling ［S/OL］. （2014-12-11）［2021-10-25］.. 10.1007/978-3-030-89929-5_3
28	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 6000-6010.

地图	友方单位	敌方单位
2s3z	2个追猎者和3个狂战士	2个追猎者和3个狂战士
1c3s5z	1个巨像，3个追猎者和5个狂战士	1个巨像，3个追猎者和5个狂战士
3s5z	3个追猎者和5个狂战士	3个追猎者和5个狂战士
8m	8个海军陆战队	8个海军陆战队

地图	友方单位	敌方单位
2s3z	2个追猎者和3个狂战士	2个追猎者和3个狂战士
1c3s5z	1个巨像，3个追猎者和5个狂战士	1个巨像，3个追猎者和5个狂战士
3s5z	3个追猎者和5个狂战士	3个追猎者和5个狂战士
8m	8个海军陆战队	8个海军陆战队

地图	AMSAC	MSAC	Native AC	COMA	CommNet	GA‑Comm
2s3z	47.02（41.60~54.74）	29.82（20.35~32.37）	30.61（21.18~39.83）	15.19（12.95~17.68）	4.95（3.78~6.44）	7.34（4.34~15.25）
1c3s5z	41.96（32.76~46.88）	28.25（21.67~30.71）	26.72（21.88~31.00）	15.29（8.38~22.01）	0.23（0.00~0.98）	0.22（0.00~0.72）
3s5z	4.21（3.56~5.15）	1.17（0.34~2.39）	0.76（0.09~2.05）	0.08（0.00~0.11）	0.01（0.00~0.03）	0.01（0.00~0.02）
8m	85.06（78.68~86.75）	90.45（89.74~91.22）	89.51（88.59~90.70）	84.51（83.54~85.07）	24.57（8.71~54.24）	45.91（28.45~54.32）

地图	AMSAC	MSAC	Native AC	COMA	CommNet	GA‑Comm
2s3z	47.02（41.60~54.74）	29.82（20.35~32.37）	30.61（21.18~39.83）	15.19（12.95~17.68）	4.95（3.78~6.44）	7.34（4.34~15.25）
1c3s5z	41.96（32.76~46.88）	28.25（21.67~30.71）	26.72（21.88~31.00）	15.29（8.38~22.01）	0.23（0.00~0.98）	0.22（0.00~0.72）
3s5z	4.21（3.56~5.15）	1.17（0.34~2.39）	0.76（0.09~2.05）	0.08（0.00~0.11）	0.01（0.00~0.03）	0.01（0.00~0.02）
8m	85.06（78.68~86.75）	90.45（89.74~91.22）	89.51（88.59~90.70）	84.51（83.54~85.07）	24.57（8.71~54.24）	45.91（28.45~54.32）

地图	AMSAC	MSAC	Native AC	COMA	CommNet	GA‑Comm
2s3z	92.19	90.63	87.50	56.25	34.38	50.00
1c3s5z	100.00	100.00	100.00	78.13	6.25	21.88
3s5z	46.88	31.25	21.88	6.25	3.13	3.13
8m	100.00	100.00	100.00	100.00	100.00	100.00

基于注意力消息共享的多智能体强化学习

Multi‑agent reinforcement learning based on attentional message sharing

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 8

参考文献 28

相关文章 15

编辑推荐

Metrics

[1]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[2]	李力铤, 华蓓, 贺若舟, 徐况. 基于解耦注意力机制的多变量时序预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2732-2738.
[3]	赵志强, 马培红, 黑新宏. 基于双重注意力机制的人群计数方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2886-2892.
[4]	薛凯鹏, 徐涛, 廖春节. 融合自监督和多层交叉注意力的多模态情感分析网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2387-2392.
[5]	汪雨晴, 朱广丽, 段文杰, 李书羽, 周若彤. 基于交互注意力机制的心理咨询文本情感分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2393-2399.
[6]	高鹏淇, 黄鹤鸣, 樊永红. 融合坐标与多头注意力机制的交互语音情感识别[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2400-2406.
[7]	李钟华, 白云起, 王雪津, 黄雷雷, 林初俊, 廖诗宇. 基于图像增强的低照度人脸检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2588-2594.
[8]	莫尚斌, 王文君, 董凌, 高盛祥, 余正涛. 基于多路信息聚合协同解码的单通道语音增强[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2611-2617.
[9]	周毅, 高华, 田永谌. 基于裁剪优化和策略指导的近端策略优化算法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2334-2341.
[10]	刘丽, 侯海金, 王安红, 张涛. 基于多尺度注意力的生成式信息隐藏算法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2102-2109.
[11]	徐松, 张文博, 王一帆. 基于时空信息的轻量视频显著性目标检测网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2192-2199.
[12]	李大海, 王忠华, 王振东. 结合空间域和频域信息的双分支低光照图像增强网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2175-2182.
[13]	魏文亮, 王阳萍, 岳彪, 王安政, 张哲. 基于光照权重分配和注意力的红外与可见光图像融合深度学习模型[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2183-2191.
[14]	马天, 席润韬, 吕佳豪, 曾奕杰, 杨嘉怡, 张杰慧. 基于深度强化学习的移动机器人三维路径规划方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2055-2064.
[15]	熊武, 曹从军, 宋雪芳, 邵云龙, 王旭升. 基于多尺度混合域注意力机制的笔迹鉴别方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2225-2232.