基于请求与应答通信机制和局部注意力机制的多机器人强化学习路径规划方法

doi:10.11772/j.issn.1001-9081.2023020193

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (2): 432-438.DOI: 10.11772/j.issn.1001-9081.2023020193

所属专题：人工智能

基于请求与应答通信机制和局部注意力机制的多机器人强化学习路径规划方法

邓辅秦¹^,²^,³, 官桧锋¹, 谭朝恩¹, 付兰慧¹, 王宏民¹, 林天麟², 张建民¹()

^1.五邑大学智能制造学部, 广东江门 529000
^2.香港中文大学(深圳) 深圳市人工智能与机器人研究院, 广东深圳 518000
^3.深圳市杉川机器人有限公司, 广东深圳 518000

收稿日期:2023-02-28 修回日期:2023-05-26 接受日期:2023-05-29 发布日期:2024-02-22 出版日期:2024-02-10
通讯作者: 张建民
作者简介:邓辅秦（1982—），男，湖南郴州人，高级工程师，博士，主要研究方向：机器学习、移动机器人系统、多机器人系统
官桧锋（1998—），男，广东韶关人，硕士研究生，主要研究方向：多机器人路径规划
谭朝恩（1999—），男，广东顺德人，硕士研究生，主要研究方向：多机器人路径规划
付兰慧（1987—），女，河南新乡人，讲师，博士，主要研究方向：机器学习、图像信息处理
王宏民（1981—），男，河北承德人，副教授，博士，主要研究方向：机器人、仿生机器人、机器人运动控制操作
林天麟（1984—），男，香港人，助理教授，博士，CCF会员，主要研究方向：模块化自重构机器人、多机器人系统；
基金资助:
国家重点研发计划项目(2020YFB1313300);深圳市科技计划项目(KQTD2016113010470345);深圳市人工智能与机器人研究院探索性研究项目(AC01202101103);五邑大学横向课题(33520098)

Multi-robot reinforcement learning path planning method based on request-response communication mechanism and local attention mechanism

Fuqin DENG¹^,²^,³, Huifeng GUAN¹, Chaoen TAN¹, Lanhui FU¹, Hongmin WANG¹, Tinlun LAM², Jianmin ZHANG¹()

^1.School of Intelligent Manufacturing，Wuyi University，Jiangmen Guangdong 529000，China
^2.Shenzhen Institute of Artifical Intelligence and Robotics for Society，The Chinese University of Hong Kong （Shenzhen），Shenzhen Guangdong 518000，China
^3.Shenzhen 3irobotix Company Limited，Shenzhen Guangdong 518000，China

Received:2023-02-28 Revised:2023-05-26 Accepted:2023-05-29 Online:2024-02-22 Published:2024-02-10
Contact: Jianmin ZHANG
About author:DENG Fuqin， born in 1982， Ph. D.， senior engineer. His research interests include machine learning， mobile robotic systems， multi-robot systems.
GUAN Huifeng， born in 1998， M. S. candidate. His research interests include multi-agent path planning.
TAN Chaoen， born in 1999， M. S. candidate. His research interests include multi-robot path planning.
FU Lanhui， born in 1987， Ph. D.， lecturer. Her research interests include machine learning， image information processing.
WANG Hongmin， born in 1981， Ph. D.， associate professor. His research interests include robotics， bionic robots， robot motion control teleoperation.
LAM Tinlun， born in 1984，Ph. D.， assistant professor. His research interests include modularized self-reconfigurable robots， multi-robot systems.
Supported by:
National Key Research and Development Program(2020YFB1313300);Shenzhen Science and Technology Plan Project(KQTD2016113010470345);Shenzhen Institute of Artificial Intelligence and Robotics Exploratory Research Project(AC01202101103);Wuyi University Horizontal Project(33520098)

摘要/Abstract

摘要：

为降低多机器人在动态环境下路径规划的阻塞率，基于深度强化学习方法框架Actor-Critic，设计一种基于请求与应答通信机制和局部注意力机制的分布式深度强化学习路径规划方法（DCAMAPF）。在Actor网络，基于请求与应答通信机制，每个机器人请求视野内的其他机器人的局部观测信息和动作信息，进而规划出协同的动作策略。在Critic网络，每个机器人基于局部注意力机制将注意力权重动态地分配到在视野内成功应答的其他机器人局部观测和动作信息上。实验结果表明，与传统动态路径规划方法D^* Lite、最新的分布式强化学习方法MAPPER和最新的集中式强化学习方法AB-MAPPER相比，DCAMAPF在离散初始化环境，阻塞率均值均约降低了6.91、4.97、3.56个百分点；在集中初始化环境下能更高效地避免发生阻塞，阻塞率均值均约降低了15.86、11.71、5.54个百分点，并减少占用的计算缓存。所提方法确保了路径规划的效率，适用于求解不同动态环境下的多机器人路径规划任务。

关键词: 多机器人路径规划, 深度强化学习, 注意力机制, 通信, 动态环境

Abstract:

To reduce the blocking rate of multi-robot path planning in dynamic environments， a Distributed Communication and local Attention based Multi-Agent Path Finding （DCAMAPF） was proposed based on Actor-Critic deep reinforcement learning method framework， using request-response communication mechanism and local attention mechanism. In the Actor network， local observation and action information was requested by each robot from other robots in its field of view based on the request-response communication mechanism， and a coordinated action strategy was planned accordingly. In the Critic network， attention weights were dynamically allocated by each robot to the local observation and action information of other robots that had successfully responded within its field of view based on the local attention mechanism. The experimental results showed that， the blocking rate was reduced by approximately 6.91， 4.97， and 3.56 percentage points， respectively， in a discrete initialization environment， compared with traditional dynamic path planning methods such as D^* Lite， the latest distributed reinforcement learning method MAPPER， and the latest centralized reinforcement learning method AB-MAPPER （Attention and BicNet based MAPPER）； in a centralized initialization environment， the mean blocking rate was reduced by approximately 15.86， 11.71 and 5.54 percentage points； while the occupied computing cache was also reduced. Therefore， the proposed method ensures the efficiency of path planning and is applicable for solving multi-robot path planning tasks in different dynamic environments.

Key words: multi-agent path finding, deep reinforcement learning, attention mechanism, communication, dynamic environment

中图分类号:

TP242

邓辅秦, 官桧锋, 谭朝恩, 付兰慧, 王宏民, 林天麟, 张建民. 基于请求与应答通信机制和局部注意力机制的多机器人强化学习路径规划方法[J]. 计算机应用, 2024, 44(2): 432-438.

Fuqin DENG, Huifeng GUAN, Chaoen TAN, Lanhui FU, Hongmin WANG, Tinlun LAM, Jianmin ZHANG. Multi-robot reinforcement learning path planning method based on request-response communication mechanism and local attention mechanism[J]. Journal of Computer Applications, 2024, 44(2): 432-438.

图/表 8

图1 请求与应答机制（以3号机器人为例）

Fig.1 Request-response mechanism （taking robot No. 3 as example）

图2 DCAMAPF网络架构

Fig. 2 Network architecture of DCAMAPF

图3 注意力机制

Fig. 3 Attention mechanism

图4 实验环境

Fig. 4 Experimental environments

表1 奖励机制

Tab. 1 Reward mechanism

惩罚项	值
单步奖惩 $r s$	-0.1（运动）/-0.5（停止）
碰撞奖惩 $r c$	-0.5
震荡奖惩 $r ο$	-0.3
偏航奖惩 $r f$	$- m i n p ∈ S p a - p 2$
抵达奖惩 $r g$	30

表1 奖励机制

Tab. 1 Reward mechanism

惩罚项	值
单步奖惩 $r s$	-0.1（运动）/-0.5（停止）
碰撞奖惩 $r c$	-0.5
震荡奖惩 $r ο$	-0.3
偏航奖惩 $r f$	$- m i n p ∈ S p a - p 2$
抵达奖惩 $r g$	30

表2 四种方法在图4不同环境阻塞率均值和成功率均值比较 ( %)

Tab. 2 Comparison of mean blocking rate and mean success rate among four methods in different environments in Fig. 4

环境	D^* Lite		MAPPER		AB-MAPPER		DCAMAPF
环境	阻塞率均值	成功率均值	阻塞率均值	成功率均值	阻塞率均值	成功率均值	阻塞率均值	成功率均值
图4（a）	88.89	94.51	84.73	95.32	76.91	95.82	70.07	97.45
图4（b）	38.60	96.21	35.28	96.48	33.47	97.64	27.81	98.19
图4（c）	91.47	93.24	87.33	94.60	82.81	96.28	78.57	97.64
图4（d）	22.94	96.74	22.39	97.25	21.38	97.38	19.94	98.81

图5 消融实验方法训练曲线

Fig. 5 Training curves of ablation experiment methods

表3 三种深度强化学习方法每个机器人所需显卡缓存 (MB)

Tab. 3 Graphic card cache required by each robot for three deep reinforcement learning methods

方法	值	方法	值
MAPPER	55.48	DCAMAPF	60.82
AB-MAPPER	1 192.00

参考文献 26

1	郑延斌，李波，安德宇，等.基于分层强化学习及人工势场的多Agent路径规划方法［J］. 计算机应用， 2015， 35（12）： 3491-3496. 10.11772/j.issn.1001-9081.2015.12.3491
	ZHENG Y B， LI B， AN D Y， et al. Multi-agent path planning method based on hierarchical reinforcement learning and artificial potential field ［J］. Journal of Computer Applications， 2015， 35（12）： 3491-3496. 10.11772/j.issn.1001-9081.2015.12.3491
2	LESTER P. A* pathfinding for beginners ［EB/OL］. ［2023-02-01］. .
3	KOENIG S， LIKHACHEV M. D* Lite ［C］// Proceedings of the 18th National Conference on Artificial Intelligence. Menlo Park： AAAI Press， 2002： 476-483.
4	SHARON G， STERN R， FELNER A， et al. Conflict-based search for optimal multi-agent pathfinding ［J］. Artificial Intelligence， 2015， 219： 40-66. 10.1016/j.artint.2014.11.006
5	祁玄玄，黄家骏，曹建安. 基于改进A^*算法的无人车路径规划［J］. 计算机应用， 2020， 40（7）： 2021-2027. 10.11772/j.issn.1001-9081.2019112016
	QI X X， HUANG J J， CAO J A. Path planning for unmanned vehicle based on improved A^* algorithm ［J］. Journal of Computer Applications， 2020， 40（7）： 2021-2027. 10.11772/j.issn.1001-9081.2019112016
6	王维，裴东，冯璋. 改进A^*算法的移动机器人最短路径规划［J］. 计算机应用， 2018， 38（5）： 1523-1526. 10.11772/j.issn.1001-9081.2017102446
	WANG W， PEI D， FENG Z. The shortest path planning for mobile robots using improved A^* algorithm ［J］. Journal of Computer Applications， 2018， 38（5）： 1523-1526. 10.11772/j.issn.1001-9081.2017102446
7	VINYALS O， BABUSCHKIN I， CZARNECKI W M， et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning ［J］. Nature， 2019， 575： 350-354. 10.1038/s41586-019-1724-z
8	SILVER D， HUANG A， MADDISON C J， et al. Mastering the game of Go with deep neural networks and tree search ［J］. Nature， 2016， 529： 484-489. 10.1038/nature16961
9	DAMANI M， LUO Z， WENZEL E， et al. PRIMAL₂： pathfinding via reinforcement and imitation multi-agent learning-lifelong ［J］. IEEE Robotics and Automation Letters， 2021， 6（2）： 2666-2673. 10.1109/lra.2021.3062803
10	SUNEHAG P， LEVER G， GRUSLYS A， et al. Value-decomposition networks for cooperative multi-agent learning［EB/OL］.［2023-02-01］. .
11	RASHID T， SAMVELYAN M， SCHROEDER C， et al. QMIX： monotonic value function factorisation for deep multi-agent reinforcement learning［J］ . The Journal of Machine Learning Research， 2020， 21（1）： 1-51.
12	邓晖奕，李勇振，尹奇跃. 引入通信与探索的多智能体强化学习QMIX算法［J］. 计算机应用， 2023， 43（1）： 202-208.
	DENG H Y， LI Y Z， YIN Q Y. Improved QMIX algorithm from communication and exploration for multi-agent reinforcement learning［J］.Journal of Computer Applications， 2023， 43（1）： 202-208.
13	SON K， KIM D， KANG W J， et al. QTRAN： learning to factorize with transformation for cooperative multi-agent reinforcement learning ［C］// Proceedings of the36th International Conference on Machine Learning. New York： PMLR， 2019： 5887-5896. 10.48550/arXiv.1905.05408
14	LIU Z， CHEN B， ZHOU H， et al. MAPPER： multi-agent path planning with evolutionary reinforcement learning in mixed dynamic environments［C］// Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway： IEEE， 2020： 11748-11754. 10.1109/iros45743.2020.9340876
15	GUAN H， GAO Y， ZHAO M， et al. AB-MAPPER： Attention and BicNet based multi-agent path planning for dynamic environment［C］// Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway： IEEE， 2022： 13799-13806. 10.1109/iros47612.2022.9981513
16	SUKHBAATAR S， SZLAM A， FERGUS R. Learning multiagent communication with backpropagation ［C］// Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2016： 2252-2260.
17	PENG P， WEN Y， YANG Y， et al. Multiagent bidirectionally-coordinated nets： emergence of human-level coordination in learning to play StarCraft combat games ［EB/OL］. （2019-03-29）［2023-02-01］. .
18	KIM D， MOON S， HOSTALLERO D， et al. Learning to schedule communication in multi-agent reinforcement learning ［EB/OL］. （2019-02-05）［2023-02-01］. .
19	JIANG J， LU Z. Learning attentional communication for multi-agent cooperation ［C］// Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2018： 7265-7275.
20	DAS A， GERVET T， ROMOFF J， et al. TarMAC： targeted multi-agent communication［C］// Proceedings of the 36th International Conference on Machine Learning. New York： PMLR， 2019：1538-1546.
21	DING Z， HUANG T， LU Z. Learning individually inferred communication for multi-agent cooperation ［C/OL］// Proceedings of the 34th International Conference on Neural Information Processing Systems， 2020［2023-02-01］. .
22	PARNIKA P， DIDDIGI R B， DANDA S K R， et al. Attention actor-critic algorithm for multi-agent constrained co-operative reinforcement learning ［EB/OL］. （2021-01-07）［2023-02-01］..
23	LIU S， TANG J. Modified deep reinforcement learning with efficient convolution feature for small target detection in VHR remote sensing imagery ［J］. ISPRS International Journal of Geo-Information， 2021， 10（3）： 170. 10.3390/ijgi10030170
24	CHOI J， DANCE C， KIM J-E， et al. Fast adaptation of deep reinforcement learning-based navigation skills to human preference［C］// Proceedings of the 2020 IEEE International Conference on Robotics and Automation. Piscataway： IEEE， 2020： 3363-3370. 10.1109/icra40945.2020.9197159
25	XU C， ZHAO W， CHEN Q， et al. An actor-critic based learning method for decision-making and planning of autonomous vehicles ［J］. Science China Technological Sciences， 2021， 64： 984-994. 10.1007/s11431-020-1729-2
26	IQBAL S， SHA F. Actor-attention-critic for multi-agent reinforcement learning［C］// Proceedings of the 36th International Conference on Machine Learning. New York： PMLR， 2019： 2961-2970.

[1]	赵志强, 马培红, 黑新宏. 基于双重注意力机制的人群计数方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2886-2892.
[2]	花敏, 魏佳楠, 赵伟, 孟硕. LoRa信号干扰分析与性能研究[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2848-2854.
[3]	庞川林, 唐睿, 张睿智, 刘川, 刘佳, 岳士博. D2D通信系统中基于图卷积网络的分布式功率控制算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2855-2862.
[4]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[5]	李力铤, 华蓓, 贺若舟, 徐况. 基于解耦注意力机制的多变量时序预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2732-2738.
[6]	薛凯鹏, 徐涛, 廖春节. 融合自监督和多层交叉注意力的多模态情感分析网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2387-2392.
[7]	汪雨晴, 朱广丽, 段文杰, 李书羽, 周若彤. 基于交互注意力机制的心理咨询文本情感分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2393-2399.
[8]	高鹏淇, 黄鹤鸣, 樊永红. 融合坐标与多头注意力机制的交互语音情感识别[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2400-2406.
[9]	李钟华, 白云起, 王雪津, 黄雷雷, 林初俊, 廖诗宇. 基于图像增强的低照度人脸检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2588-2594.
[10]	莫尚斌, 王文君, 董凌, 高盛祥, 余正涛. 基于多路信息聚合协同解码的单通道语音增强[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2611-2617.
[11]	周毅, 高华, 田永谌. 基于裁剪优化和策略指导的近端策略优化算法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2334-2341.
[12]	熊武, 曹从军, 宋雪芳, 邵云龙, 王旭升. 基于多尺度混合域注意力机制的笔迹鉴别方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2225-2232.
[13]	李欢欢, 黄添强, 丁雪梅, 罗海峰, 黄丽清. 基于多尺度时空图卷积网络的交通出行需求预测[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2065-2072.
[14]	毛典辉, 李学博, 刘峻岭, 张登辉, 颜文婧. 基于并行异构图和序列注意力机制的中文实体关系抽取模型[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2018-2025.
[15]	马天, 席润韬, 吕佳豪, 曾奕杰, 杨嘉怡, 张杰慧. 基于深度强化学习的移动机器人三维路径规划方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2055-2064.

基于请求与应答通信机制和局部注意力机制的多机器人强化学习路径规划方法

Multi-robot reinforcement learning path planning method based on request-response communication mechanism and local attention mechanism

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 8

参考文献 26

相关文章 15

编辑推荐

Metrics