Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (1): 1-7.DOI: 10.11772/j.issn.1001-9081.2020061009

Special Issue: 第八届中国数据挖掘会议(CCDM 2020)

• China Conference on Data Mining 2020 (CCDM 2020) •     Next Articles

Reward highway network based global credit assignment algorithm in multi-agent reinforcement learning

YAO Xinghu1,2,3, TAN Xiaoyang1,2,3   

  1. 1. College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing Jiangsu 211106, China;
    2. MIIT Key Laboratory of Pattern Analysis and Machine Intelligence(Nanjing University of Aeronautics and Astronautics), Nanjing Jiangsu 211106, China;
    3. Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing University of Aeronautics and Astronautics, Nanjing Jiangsu 211106, China
  • Received:2020-05-31 Revised:2020-09-24 Online:2021-01-10 Published:2020-11-12
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61976115,61672280, 61732006), the Equipment Pre-research Fund (6140312020413), the Artificial Intelligence+ Project of Nanjing University of Aeronautics and Astronautics (56XZA18009), the Project of Pre-research on Military Shared Information System Equipment (315025305), the Graduate Innovation Foundation of Nanjing University of Aeronautics and Astronautics (Kfjj20191608).

基于奖励高速路网络的多智能体强化学习中的全局信用分配算法

姚兴虎1,2,3, 谭晓阳1,2,3   

  1. 1. 南京航空航天大学 计算机科学与技术学院, 南京 211106;
    2. 模式分析与机器智能工业和信息化部重点实验室(南京航空航天大学), 南京 211106;
    3. 南京航空航天大学 软件新技术与产业化协同创新中心, 南京 211106
  • 通讯作者: 谭晓阳
  • 作者简介:姚兴虎(1996-),男,山东济宁人,硕士研究生,CCF会员,主要研究方向:深度强化学习、多智能体强化学习;谭晓阳(1971-),男,江苏南京人,教授,博士,CCF会员,主要研究方向:机器学习、深度强化学习。
  • 基金资助:
    国家自然科学基金资助项目(61976115,61672280, 61732006);装备预研基金资助项目(6140312020413);南京航空航天大学人工智能+项目(56XZA18009);全军共用信息系统装备预研项目(315025305);南京航空航天大学研究生创新基金资助项目(Kfjj20191608)。

Abstract: For the problem of exponential explosion of joint action space with the increase of the number of agents in multi-agent systems, the "central training-decentralized execution" framework was adopted to solve the curse of dimensionality of joint action space and reduce the optimization cost of the algorithm. A new global credit assignment mechanism, Reward HighWay Network (RHWNet), was proposed to solve the problem that only the global reward corresponding to the joint behavior of all agents was given by the environment in multiple multi-agent reinforcement learning scenarios. By introducing the reward highway connection in the global reward assignment mechanism of the original algorithm, the value function of each agent was directly connected with the global reward, so that each agent was able to consider both the global reward signal and its actual reward value when making strategy selection. Firstly, in the training process, each agent was coordinated through a centralized value function structure. At the same time, this centralized structure was also able to play a role in global reward assignment. Then, the reward highway connection was introduced in the central value function structure to assist the global reward assignment, thus establishing the reward highway network. Then, in the execution phase, each agent's strategy depended only on its own value function. Experimental results on the StarCraft Multi-Agent Challenge (SMAC) microoperation scenarios show that the proposed reward highway network achieves a performance improvement of more than 20% in testing winning rate on four complex maps compared to the advanced Counterfactual multi-agent policy gradient (Coma) and QMIX algorithms. More importantly, in 3s5z and 3s6z scenarios with a large number and different types of agents, the proposed network can achieve better results when the required number of samples is only 30% of algorithms such as Coma and QMIX.

Key words: deep learning, deep reinforcement learning, multi-agent reinforcement learning, multi-agent system, global credit assignment

摘要: 针对多智能体系统中联合动作空间随智能体数量的增加而产生的指数爆炸的问题,采用“中心训练-分散执行”的框架来避免联合动作空间的维数灾难并降低算法的优化代价。针对在众多的多智能体强化学习场景下,环境仅给出所有智能体的联合行为所对应的全局奖励这一问题,提出一种新的全局信用分配机制——奖励高速路网络(RHWNet)。通过在原有算法的奖励分配机制上引入奖励高速路连接,将每个智能体的值函数与全局奖励直接建立联系,进而使得每个智能体在进行策略选择时能够综合考虑全局的奖励信号与其自身实际分得的奖励值。首先,在训练过程中,通过中心化的值函数结构对每个智能体进行协调;同时,这一中心化的结构也能起到全局奖励分配的作用;然后,在中心值函数结构中引入奖励高速路链接来辅助进行全局奖励分配,从而构建出奖励高速路网络;之后,在执行阶段,每个智能体的策略仅仅依赖于其自身的值函数。在星际争霸多智能体挑战的微操作场景中的实验结果表明,相比当前较先进的反直觉的策略梯度(Coma)算法和单调Q值函数分解(QMIX)算法,该网络所提出的奖励高速路在4个复杂的地图上的测试胜率提升超过20%。更重要的是,在智能体数量较多且种类不同的3s5z和3s6z场景中,该网络在所需样本数量为QMIX和Coma等算法的30%的情况下便能取得更好的结果。

关键词: 深度学习, 深度强化学习, 多智能体强化学习, 多智能体系统, 全局信用分配

CLC Number: