基于注意力消息共享的多智能体强化学习

doi:10.11772/j.issn.1001-9081.2021122169

《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (11): 3346-3353.DOI: 10.11772/j.issn.1001-9081.2021122169

• 第九届CCF大数据学术会议 • 上一篇

基于注意力消息共享的多智能体强化学习

臧嵘¹, 王莉¹(), 史腾飞²

^1.太原理工大学大数据学院，山西晋中 030600
^2.北方自动控制技术研究所，太原 030006

收稿日期:2021-12-21 修回日期:2022-01-14 接受日期:2022-01-24 发布日期:2022-03-04 出版日期:2022-11-10
通讯作者: 王莉
作者简介:臧嵘（1997—），男，山西太原人，硕士研究生，主要研究方向：强化学习、多智能体系统
王莉（1971—），女，山西太原人，教授，博士，CCF高级会员，主要研究方向：数据挖掘、人工智能、机器学习 wangli@tyut.edu.cn
史腾飞（1990—），男，山西晋城人，工程师，硕士，CCF会员，主要研究方向：深度强化学习。

Multi‑agent reinforcement learning based on attentional message sharing

Rong ZANG¹, Li WANG¹(), Tengfei SHI²

^1.College of Data Science，Taiyuan University of Technology，Jinzhong Shanxi 030600，China
^2.North Automatic Control Technology Institute，Taiyuan Shanxi 030006，China

Received:2021-12-21 Revised:2022-01-14 Accepted:2022-01-24 Online:2022-03-04 Published:2022-11-10
Contact: Li WANG
About author:ZANG Rong， born in 1997， M. S. candidate. His research interests include reinforcement learning， multi-agent system.
WANG Li， born in 1971， Ph. D.， professor. Her research interests include data mining， artificial intelligence， machine learning.
SHI Tengfei， born in 1990， M. S.， engineer. His research interests include deep reinforcement learning.
Supported by:
National Natural Science Foundation of China(61872260)

摘要/Abstract

摘要：

通信是非全知环境中多智能体间实现有效合作的重要途径，当智能体数量较多时，通信过程会产生冗余消息。为有效处理通信消息，提出一种基于注意力消息共享的多智能体强化学习算法AMSAC。首先，在智能体间搭建用于有效沟通的消息共享网络，智能体通过消息读取和写入完成信息共享，解决智能体在非全知、任务复杂场景下缺乏沟通的问题；其次，在消息共享网络中，通过注意力消息共享机制对通信消息进行自适应处理，有侧重地处理来自不同智能体的消息，解决较大规模多智能体系统在通信过程中无法有效识别消息并利用的问题；然后，在集中式Critic网络中，使用Native Critic依据时序差分（TD）优势策略梯度更新Actor网络参数，使智能体的动作价值得到有效评判；最后，在执行期间，智能体分布式Actor网络根据自身观测和消息共享网络的信息进行决策。在星际争霸Ⅱ多智能体挑战赛（SMAC）环境中进行实验，结果表明，与朴素Actor?Critic （Native AC）、博弈抽象通信（GA?Comm）等多智能体强化学习方法相比，AMSAC在四个不同场景下的平均胜率提升了4 ~ 32个百分点。AMSAC的注意力消息共享机制为处理多智能体系统中智能体间的通信消息提供了合理方案，在交通枢纽控制和无人机协同领域都具备广泛的应用前景。

关键词: 多智能体系统, 智能体协同, 深度强化学习, 智能体通信, 注意力机制, 策略梯度

Abstract:

Communication is an important way to achieve effective cooperation among multiple agents in a non? omniscient environment. When there are a large number of agents， redundant messages may be generated in the communication process. To handle the communication messages effectively， a multi?agent reinforcement learning algorithm based on attentional message sharing was proposed， called AMSAC （Attentional Message Sharing multi?agent Actor?Critic）. Firstly， a message sharing network was built for effective communication among agents， and information sharing was achieved through message reading and writing by the agents， thus solving the problem of lack of communication among agents in non?omniscient environment with complex tasks. Then， in the message sharing network， the communication messages were processed adaptively by the attentional message sharing mechanism， and the messages from different agents were processed with importance order to solve the problem that large?scale multi?agent system cannot effectively identify and utilize the messages during the communication process. Moreover， in the centralized Critic network， the Native Critic was used to update the Actor network parameters according to Temporal Difference （TD） advantage policy gradient， so that the action values of agents were evaluated effectively. Finally， during the execution period， the decision was made by the agent distributed Actor network based on its own observations and messages from message sharing network. Experimental results in the StarCraft Multi?Agent Challenge （SMAC） environment show that compared with Native Actor?Critic （Native AC）， Game Abstraction Communication （GA?Comm） and other multi?agent reinforcement learning methods， AMSAC has an average win rate improvement of 4 - 32 percentage points in four different scenarios. AMSAC’s attentional message sharing mechanism provides a reasonable solution for processing communication messages among agents in a multi?agent system， and has broad application prospects in both transportation hub control and unmanned aerial vehicle collaboration.

Key words: multi?agent system, agent cooperation, deep reinforcement learning, agent communication, attention mechanism, policy gradient

中图分类号:

TP181

臧嵘, 王莉, 史腾飞. 基于注意力消息共享的多智能体强化学习[J]. 计算机应用, 2022, 42(11): 3346-3353.

Rong ZANG, Li WANG, Tengfei SHI. Multi‑agent reinforcement learning based on attentional message sharing[J]. Journal of Computer Applications, 2022, 42(11): 3346-3353.

图/表 8

参考文献 28

1	MNIH V， KAVUKCUOGLU K， SILVER D， et al. Human‑level control through deep reinforcement learning［J］. Nature， 2015， 518（7540）： 529-533. 10.1038/nature14236
2	刘全，翟建伟，章宗长，等. 深度强化学习综述［J］. 计算机学报， 2018， 41（1）：1-27. 10.11897/SP.J.1016.2018.00001
	LIU Q， ZHAI J W， ZHANG Z Z， et al. A survey on deep reinforcement learning［J］. Chinese Journal of Computers， 2018， 41（1）：1-27. 10.11897/SP.J.1016.2018.00001
3	TROITZSCH K G. Multi-agent systems and simulation： a survey from an application perspective［M］// UHRMACHER A M， WEYNS D. Multi-Agent Systems： Simulation and Applications. Boca Raton： CRC Press， 2009： 53-76. 10.1201/9781420070248.ch2
4	HERNANDEZ‑LEAL P， KARTAL B， TAYLOR M E. A survey and critique of multiagent deep reinforcement learning［J］. Autonomous Agents and Multi‑Agent Systems， 2019， 33（6）： 750-797. 10.1007/s10458-019-09421-1
5	孙长银，穆朝絮. 多智能体深度强化学习的若干关键科学问题［J］. 自动化学报， 2020， 46（7）：1301-1312. 10.16383/j.aas.c200159
	SUN C Y， MU C X. Important scientific problems of multi‑agent deep reinforcement learning［J］. Acta Automatica Sinica， 2020， 46（7）：1301-1312. 10.16383/j.aas.c200159
6	SUKHBAATAR S， SZLAM A， FERGUS R. Learning multiagent communication with backpropagation［C］// Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2016： 2252-2260.
7	PENG P， WEN Y， YANG Y D， et al. Multiagent bidirectionally‑ coordinated nets： emergence of human‑level coordination in learning to play StarCraft combat games［EB/OL］. （2017-09-14）［2021-02-12］.. 10.48550/arXiv.1703.10069
8	DAS A， GERVET T， ROMOFF J， et al. TarMAC： targeted multi‑ agent communication［C］// Proceedings of the 36th International Conference on Machine Learning. New York： JMLR.org， 2019： 1538-1546.
9	SINGH A， JAIN T， SUKHBAATAR S. Learning when to communicate at scale in multiagent cooperative and competitive tasks［EB/OL］. （2018-12-23）［2021-02-12］..
10	LIU Y， WANG W X， HU Y J， et al. Multi‑agent game abstraction via graph attention neural network［C］// Proceedings of the 34th Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2020： 7211-7218. 10.1609/aaai.v34i05.6211
11	MAO H Y， ZHANG Z C， XIAO Z， et al. Learning multi‑agent communication with double attentional deep reinforcement learning［J］. Autonomous Agents and Multi‑Agent Systems， 2020， 34（1）： No.32. 10.1007/s10458-020-09455-w
12	SU J Y， ADAMS S， BELING P. Value‑decomposition multi‑agent actor‑critics［C］// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2021： 11352-11360. 10.1609/aaai.v35i13.17353
13	SAMVELYAN M， RASHID T， SCHROEDER DE WITT C， et al. The StarCraft multi‑agent challenge［C］// Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. Richland， SC： International Foundation for Autonomous Agents and MultiAgent Systems， 2019： 2186-2188.
14	WILLIAMS R J. Simple statistical gradient‑following algorithms for connectionist reinforcement learning［J］. Machine Learning， 1992， 8（3/4）： 229-256. 10.1007/bf00992696
15	LOWE R， WU Y， TAMAR A， et al. Multi‑agent actor‑critic for mixed cooperative‑competitive environments［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 6382-6393.
16	LILLICRAP T P， HUNT J J， PRITZEL A， et al. Continuous control with deep reinforcement learning［EB/OL］. （2019-07-05）［2021-02-12］..
17	FOERSTER J N， FARQUHAR G， AFOURAS T， et al. Counterfactual multi‑agent policy gradients［C］// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2018： 2974-2982. 10.1609/aaai.v32i1.11794
18	ZHANG K Q， YANG Z R， LIU H， et al. Fully decentralized multi‑agent reinforcement learning with networked agents［C］// Proceedings of the 35th International Conference on Machine Learning. New York： JMLR.org， 2018： 5872-5881.
19	JIANG J C， LU Z Q. Learning attentional communication for multi-agent cooperation［C］// Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2018： 7265-7275.
20	IQBAL S， SHA F. Actor‑attention‑critic for multi‑agent reinforcement learning［C］// Proceedings of the 36th International Conference on Machine Learning. New York： JMLR.org， 2019： 2961-2970.
21	BERNSTEIN D S， GIVAN R， IMMERMAN N， et al. The complexity of decentralized control of Markov decision processes［J］. Mathematics of Operations Research， 2002， 27（4）： 819-840. 10.1287/moor.27.4.819.297
22	SUTTON R S， McALLESTER D， SINGH S， et al. Policy gradient methods for reinforcement learning with function approximation［C］// Proceedings of the 12th International Conference on Neural Information Processing Systems. Cambridge： MIT Press， 1999： 1057-1063.
23	KONDA V R， TSITSIKLIS J N. Actor‑critic algorithms［C］// Proceedings of the 12th International Conference on Neural Information Processing Systems. Cambridge： MIT Press， 1999： 1008-1014.
24	MNIH V， HEESS N， GRAVES A， et al. Recurrent models of visual attention［C］// Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge： MIT Press， 2014： 2204-2212.
25	CHO K， van MERRIËNBOER B， GU̇LÇEHRE Ç， et al. Learning phrase representations using RNN encoder‑decoder for statistical machine translation［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2014： 1724-1734. 10.3115/v1/d14-1179
26	XU K， BA J， KIROS R， et al. Show， attend and tell： neural image caption generation with visual attention［C］// Proceedings of the 32nd International Conference on Machine Learning. New York： JMLR.org， 2015： 2048-2057. 10.1109/cvpr.2015.7298935
27	CHUNG J， GU̇LÇEHRE Ç， CHO K， et al. Empirical evaluation of gated recurrent neural networks on sequence modeling ［S/OL］. （2014-12-11）［2021-10-25］.. 10.1007/978-3-030-89929-5_3
28	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 6000-6010.

地图	友方单位	敌方单位
2s3z	2个追猎者和3个狂战士	2个追猎者和3个狂战士
1c3s5z	1个巨像，3个追猎者和5个狂战士	1个巨像，3个追猎者和5个狂战士
3s5z	3个追猎者和5个狂战士	3个追猎者和5个狂战士
8m	8个海军陆战队	8个海军陆战队

地图	友方单位	敌方单位
2s3z	2个追猎者和3个狂战士	2个追猎者和3个狂战士
1c3s5z	1个巨像，3个追猎者和5个狂战士	1个巨像，3个追猎者和5个狂战士
3s5z	3个追猎者和5个狂战士	3个追猎者和5个狂战士
8m	8个海军陆战队	8个海军陆战队

地图	AMSAC	MSAC	Native AC	COMA	CommNet	GA‑Comm
2s3z	47.02（41.60~54.74）	29.82（20.35~32.37）	30.61（21.18~39.83）	15.19（12.95~17.68）	4.95（3.78~6.44）	7.34（4.34~15.25）
1c3s5z	41.96（32.76~46.88）	28.25（21.67~30.71）	26.72（21.88~31.00）	15.29（8.38~22.01）	0.23（0.00~0.98）	0.22（0.00~0.72）
3s5z	4.21（3.56~5.15）	1.17（0.34~2.39）	0.76（0.09~2.05）	0.08（0.00~0.11）	0.01（0.00~0.03）	0.01（0.00~0.02）
8m	85.06（78.68~86.75）	90.45（89.74~91.22）	89.51（88.59~90.70）	84.51（83.54~85.07）	24.57（8.71~54.24）	45.91（28.45~54.32）

地图	AMSAC	MSAC	Native AC	COMA	CommNet	GA‑Comm
2s3z	47.02（41.60~54.74）	29.82（20.35~32.37）	30.61（21.18~39.83）	15.19（12.95~17.68）	4.95（3.78~6.44）	7.34（4.34~15.25）
1c3s5z	41.96（32.76~46.88）	28.25（21.67~30.71）	26.72（21.88~31.00）	15.29（8.38~22.01）	0.23（0.00~0.98）	0.22（0.00~0.72）
3s5z	4.21（3.56~5.15）	1.17（0.34~2.39）	0.76（0.09~2.05）	0.08（0.00~0.11）	0.01（0.00~0.03）	0.01（0.00~0.02）
8m	85.06（78.68~86.75）	90.45（89.74~91.22）	89.51（88.59~90.70）	84.51（83.54~85.07）	24.57（8.71~54.24）	45.91（28.45~54.32）

地图	AMSAC	MSAC	Native AC	COMA	CommNet	GA‑Comm
2s3z	92.19	90.63	87.50	56.25	34.38	50.00
1c3s5z	100.00	100.00	100.00	78.13	6.25	21.88
3s5z	46.88	31.25	21.88	6.25	3.13	3.13
8m	100.00	100.00	100.00	100.00	100.00	100.00

基于注意力消息共享的多智能体强化学习

Multi‑agent reinforcement learning based on attentional message sharing

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 8

参考文献 28

相关文章 15

编辑推荐

Metrics

[1]	刘月峰, 张小燕, 郭威, 边浩东, 何滢婕. 基于优化混合模型的航空发动机剩余寿命预测方法[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2960-2968.
[2]	文凯, 唐伟伟, 熊俊臣. 基于注意力机制和有效分解卷积的实时分割算法[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2659-2666.
[3]	魏海云, 郑茜颖, 俞金玲. 基于多尺度网络的运动模糊图像复原算法[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2838-2844.
[4]	张文涛, 王园宇, 李赛泽. 基于条件对抗网络的单幅霾图像深度估计模型[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2865-2875.
[5]	侯旭东, 滕飞, 张艺. 基于深度自编码的医疗命名实体识别模型[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2686-2692.
[6]	衡红军, 徐天宝. 基于多尺度卷积和门控机制的注意力情感分析模型[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2674-2679.
[7]	李姚舜, 刘黎志. 嵌入注意力机制的轻量级钢筋检测网络[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2900-2908.
[8]	吴明晖, 张广洁, 金苍宏. 基于多模态信息融合的时间序列预测模型[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2326-2332.
[9]	吕振虎, 许新征, 张芳艳. 基于挤压激励的轻量化注意力机制模块[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2353-2360.
[10]	张丽莹, 庞春江, 王新颖, 李国亮. 基于改进YOLOv3的多尺度目标检测算法[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2423-2431.
[11]	张新宇, 丁胜, 杨治佩. 基于改进注意力机制的交通标志检测算法[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2378-2385.
[12]	玄英律, 万源, 陈嘉慧. 基于多尺度卷积和注意力机制的LSTM时间序列分类[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2343-2352.
[13]	李坤, 侯庆. 基于注意力机制的轻量型人体姿态估计[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2407-2414.
[14]	徐成霞, 阎庆, 李腾, 苗开超. 基于联合注意力机制的单幅图像去雨算法[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2578-2585.
[15]	王海起, 王志海, 李留珂, 孔浩然, 王琼, 徐建波. 基于网格划分的城市短时交通流量时空预测模型[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2274-2280.