引入通信与探索的多智能体强化学习QMIX算法

doi:10.11772/j.issn.1001-9081.2021111886

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (1): 202-208.DOI: 10.11772/j.issn.1001-9081.2021111886

所属专题：先进计算

引入通信与探索的多智能体强化学习QMIX算法

邓晖奕^1,2, 李勇振¹, 尹奇跃³

1.北京建筑大学电气与信息工程学院，北京 102616
2.厦门大学自动化系，福建厦门 361002
3.中国科学院自动化研究所，北京 100190

收稿日期:2021-11-08 修回日期:2022-05-26 发布日期:2023-01-12
通讯作者: 李勇振（1983—），男，北京人，高级实验师，博士，主要研究方向：软件理论、人工智能liyongzhen@bucea.edu.cn
作者简介:邓晖奕（1999—），男，福建武夷山人，硕士研究生，主要研究方向：强化学习、深度学习；尹奇跃（1990—），男，河南南阳人，副研究员，博士，CCF会员，主要研究方向：机器学习、游戏AI；
基金资助:
北京高等学校高水平人才交叉培养“实培计划”项目；北京建筑大学2022年度青年教师科研能力提升计划项目（X22022）。

Improved QMIX algorithm from communication and exploration for multi-agent reinforcement learning

DENG Huiyi^1,2, LI Yongzhen¹, YIN Qiyue³

1.School of Electrical and Information Engineering， Beijing University of Civil Engineering and Architecture， Beijing 102616， China
2.Department of Automation， Xiamen University， Xiamen Fujian 361002， China
3.Institute of Automation， Chinese Academy of Sciences， Beijing 100190， China

Received:2021-11-08 Revised:2022-05-26 Online:2023-01-12
Contact: LI Yongzhen， born in 1983， Ph. D.， senior experimentalist. His research interests include software theory， artificial intelligence.
About author:DENG Huiyi， born in 1999， M. S. candidate. His research interests include reinforcement learning， deep learning；YIN Qiyue， born in 1990， Ph. D.， research associate. His research interests include machine learning， game AI；

摘要/Abstract

摘要： 非平稳性问题是多智能体环境中深度学习面临的主要挑战之一，它打破了大多数单智能体强化学习算法都遵循的马尔可夫假设，使每个智能体在学习过程中都有可能会陷入由其他智能体所创建的环境而导致无终止的循环。为解决上述问题，研究了中心式训练分布式执行（CTDE）架构在强化学习中的实现方法，并分别从智能体间通信和智能体探索这两个角度入手，采用通过方差控制的强化学习算法（VBC）并引入好奇心机制来改进QMIX算法。通过星际争霸Ⅱ学习环境（SC2LE）中的微操场景对所提算法加以验证。实验结果表明，与QMIX算法相比，所提算法的性能有所提升，并且能够得到收敛速度更快的训练模型。

关键词: 多智能体环境, 深度强化学习, 中心式训练分布式执行架构, 好奇心机制, 智能体通信

Abstract: Non-stationarity that breaks the Markov assumption followed by most single-agent reinforcement learning algorithms is one of the main challenges in multi-agent environment， making each agent may be caught in an infinite loop caused by the environment created by the other agents during the learning process. To solve above problem， the implementation method of Centralized Training with Decentralized Execution （CTDE） structure in reinforcement learning was studied， and from two perspectives of agent communication and exploration， the QMIX algorithm was improved by introducing a Variance Control-Based （VBC） communication model and a curiosity mechanism. The proposed algorithm was validated in micro control scenarios of StarCraft Ⅱ Learning Environment （SC2LE）. Experimental results show that the proposed algorithm can improve the performance and obtain a training model with higher convergence speed compared to QMIX algorithm.

Key words: multi-agent environment, deep reinforcement learning, Centralized Training with Decentralized Execution (CTDE) structure, curiosity mechanism, agent communication

中图分类号:

TP18

邓晖奕, 李勇振, 尹奇跃. 引入通信与探索的多智能体强化学习QMIX算法[J]. 计算机应用, 2023, 43(1): 202-208.

DENG Huiyi, LI Yongzhen, YIN Qiyue. Improved QMIX algorithm from communication and exploration for multi-agent reinforcement learning[J]. Journal of Computer Applications, 2023, 43(1): 202-208.

参考文献

1 JENNINGS N R， SYCARA K， WOOLDRIDGE M. A roadmap of agent research and development［J］. Autonomous Agents and Multi-Agent Systems， 1998， 1（1）：7-38. 10.1023/a:1010090405266
2 SUTTON R S， BARTO A G. Reinforcement Learning： An Introduction［M］. 2nd ed. Cambridge， MA： MIT Press， 2018： 1-4.
3 刘全，翟建伟，章宗长，等. 深度强化学习综述［J］. 计算机学报， 2018， 41（1）：1-27. 10.11897/SP.J.1016.2018.00001 LIU Q， ZHAI J W， ZHANG Z Z， et al. A survey on deep reinforcement learning［J］. Chinese Journal of Computers， 2018， 41（1）：1-27. 10.11897/SP.J.1016.2018.00001
4 赵冬斌，邵坤，朱圆恒，等. 深度强化学习综述：兼论计算机围棋的发展［J］. 控制理论与应用， 2016， 33（6）：701-717. 10.7641/CTA.2016.60173 ZHAO D B， SHAO K， ZHU Y H， et al. Review of deep reinforcement learning and discussions on the development of computer Go［J］. Control Theory and Applications， 2016， 33（6）：701-717. 10.7641/CTA.2016.60173
5 周志华. AlphaGo专题介绍［J］. 自动化学报， 2016， 42（5）：670-670. ZHOU Z H. Introduction of special topic AlphaGo［J］. Acta Automatica Sinica， 2016， 42（5）：670-670.
6 KRAEMER L， BANERJEE B. Multi-agent reinforcement learning as a rehearsal for decentralized planning［J］. Neurocomputing， 2016， 190：82-94. 10.1016/j.neucom.2016.01.031
7 孙彧，曹雷，陈希亮，等. 多智能体深度强化学习研究综述［J］. 计算机工程与应用， 2020， 56（5）： 13-24. 10.3778/j.issn.1002-8331.1912-0100 SUN Y， CAO L， CHEN X L， et al. Overview of multi-agent deep reinforcement learning［J］. Computer Engineering and Applications， 2020， 56（5）： 13-24. 10.3778/j.issn.1002-8331.1912-0100
8 CASTA?EDA A O. Deep reinforcement learning variants of multi- agent learning algorithms［D］. Edinburgh： University of Edinburgh， 2016： 19-30.
9 DIALLO E A O， SUGIYAMA A， SUGAWARA T. Learning to coordinate with deep reinforcement learning in doubles pong game［C］// Proceedings of the 16th IEEE International Conference on Machine Learning and Applications. Piscataway： IEEE， 2017： 14-19. 10.1109/icmla.2017.0-184
10 FOERSTER J， ASSAEL Y M， DE FREITAS N， et al. Learning to communicate with deep multi-agent reinforcement learning［C］// Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2016： 2145-2153.
11 MNIH V， KAVUKCUOGLU K， SILVER D， et al. Human-level control through deep reinforcement learning［J］. Nature， 2015， 518（7540）：529-533. 10.1038/nature14236
12 周戎. 基于Q学习的RoboCup多智能体决策优化［D］. 南京：南京邮电大学， 2018：50-51. ZHOU R. Research on multi-agent decision-making based on Q-learning in RoboCup rescue simulation［D］. Nanjing： Nanjing University of Posts and Telecommunications， 2018：50-51.
13 孙长银，穆朝絮. 多智能体深度强化学习的若干关键科学问题［J］. 自动化学报， 2020， 46（7）： 1301-1312. 10.16383/j.aas.c200159 SUN C Y， MU C X. Important scientific problems of multi-agent deep reinforcement learning［J］. Acta Automatica Sinica， 2020， 46（7）： 1301-1312. 10.16383/j.aas.c200159
14 SUNEHAG P， LEVER G， GRUSLYS A， et al. Value-decomposition networks for cooperative multi-agent learning based on team reward［C］// Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. Richland， SC： International Foundation for Autonomous Agents and MultiAgent Systems， 2018： 2085-2087.
15 RASHID T， SAMVELYAN M， DE WITT C S， et al. QMIX： monotonic value function factorisation for deep multi-agent reinforcement learning［C］// Proceedings of the 35th International Conference on Machine Learning. New York： JMLR.org， 2018： 4295-4304. 10.48550/arXiv.1803.11485
16 ZHANG S Q， ZHANG Q， LIN J Y. Efficient communication in multi-agent reinforcement learning via variance based control［C/OL］// Proceedings of the 33rd Conference on Neural Information Processing Systems. ［2021-10-12］.https：//proceedings.neurips.cc/paper/2019/file/14cfdb59b5bda1fc245aadae15b1984a-Paper.pdf. 10.4271/2022-01-7014
17 杨瑞，严江鹏，李秀. 强化学习稀疏奖励算法研究——理论与实验［J］. 智能系统学报， 2020， 15（5）： 888-999. 10.11992/tis.202003031 YANG R， YAN J P， LI X. Survey of sparse reward algorithms in reinforcement learning — theory and experiment［J］. CAAI Transactions on Intelligent Systems， 2020， 15（5）： 888-999. 10.11992/tis.202003031
18 OSBAND I， ASLANIDES J， CASSIRER A. Randomized prior functions for deep reinforcement learning［C］// Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2018： 8626-8638. 10.7551/mitpress/11474.003.0014

[1]	周毅, 高华, 田永谌. 基于裁剪优化和策略指导的近端策略优化算法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2334-2341.
[2]	马天, 席润韬, 吕佳豪, 曾奕杰, 杨嘉怡, 张杰慧. 基于深度强化学习的移动机器人三维路径规划方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2055-2064.
[3]	赵晓焱, 韩威, 张俊娜, 袁培燕. 基于异步深度强化学习的车联网协作卸载策略[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1501-1510.
[4]	唐睿, 庞川林, 张睿智, 刘川, 岳士博. D2D通信增强的蜂窝网络中基于DDPG的资源分配[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1562-1569.
[5]	秦鑫彤, 宋政育, 侯天为, 王飞越, 孙昕, 黎伟. 基于自适应p持续的移动自组网信道接入和资源分配算法[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 863-868.
[6]	李源潮, 陶重犇, 王琛. 基于最大熵深度强化学习的双足机器人步态控制方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 445-451.
[7]	邓辅秦, 官桧锋, 谭朝恩, 付兰慧, 王宏民, 林天麟, 张建民. 基于请求与应答通信机制和局部注意力机制的多机器人强化学习路径规划方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 432-438.
[8]	余家宸, 杨晔. 基于裁剪近端策略优化算法的软机械臂不规则物体抓取[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3629-3638.
[9]	龙杰, 谢良, 徐海蛟. 集成的深度强化学习投资组合模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 300-310.
[10]	王昱, 任田君, 范子琳. 基于引导Minimax-DDQN的无人机空战机动决策[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2636-2643.
[11]	王子腾, 于亚新, 夏子芳, 乔佳琪. 融合好奇心和策略蒸馏的稀疏奖励探索机制[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2082-2090.
[12]	方和平, 刘曙光, 冉泳屹, 钟坤华. 基于深度强化学习的多数据中心一体化调度优化[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1884-1892.
[13]	李校林, 江雨桑. 无人机辅助移动边缘计算中的任务卸载算法[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1893-1899.
[14]	黄晓辉, 杨凯铭, 凌嘉壕. 基于共享注意力的多智能体强化学习订单派送[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1620-1624.
[15]	曹腾飞, 刘延亮, 王晓英. 基于改进深度强化学习的边缘计算服务卸载算法[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1543-1550.

引入通信与探索的多智能体强化学习QMIX算法

Improved QMIX algorithm from communication and exploration for multi-agent reinforcement learning

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics