基于时空Transformer的混合回报隐式Q学习人群导航

doi:10.11772/j.issn.1001-9081.2024111654

《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (11): 3666-3673.DOI: 10.11772/j.issn.1001-9081.2024111654

• 先进计算 • 上一篇

基于时空Transformer的混合回报隐式Q学习人群导航

周帅¹^,², 符浩¹^,²(), 刘伟¹^,²

^1.武汉科技大学计算机科学与技术学院，武汉 430065
^2.湖北省智能信息处理与实时工业系统重点实验室，武汉 430081

收稿日期:2024-11-27 修回日期:2025-03-31 接受日期:2025-04-08 发布日期:2025-04-22 出版日期:2025-11-10
通讯作者: 符浩
作者简介:周帅（2000—），男，湖北天门人，硕士研究生，主要研究方向：离线强化学习、智能机器人
刘伟（1998—），男，湖北黄冈人，硕士研究生，主要研究方向：多机器人智能控制。
基金资助:
国家自然科学基金资助项目(62173262);国家自然科学基金资助项目(62303357);湖北省自然科学基金资助项目(2023AFB109)

Spatial-temporal Transformer-based hybrid return implicit Q-learning for crowd navigation

Shuai ZHOU¹^,², Hao FU¹^,²(), Wei LIU¹^,²

^1.School of Computer Science and Technology，Wuhan University of Science and Technology，Wuhan Hubei 430065，China
^2.Hubei Province Key Laboratory of Intelligent Information Processing and Real Time Industrial System，Wuhan Hubei 430081，China

Received:2024-11-27 Revised:2025-03-31 Accepted:2025-04-08 Online:2025-04-22 Published:2025-11-10
Contact: Hao FU
About author:ZHOU Shuai， born in 2000， M. S. candidate. His research interests include offline reinforcement learning， intelligent robot.
LIU Wei， born in 1998， M. S. candidate. His research interests include multi-robot intelligent control.
Supported by:
National Natural Science Foundation of China(62173262);Hubei Provincial Natural Science Foundation(2023AFB109)

摘要/Abstract

摘要：

在人群密集环境中，机器人执行人群导航任务时通常采用在线强化学习算法。然而，行人运动复杂多变的特性显著降低了在线强化学习的样本效率。针对这一问题，提出一种在离线强化学习（ORL）框架下的基于时空Transformer的混合回报隐式Q学习（STHRIQL）算法。首先，将蒙特卡洛（MC）回报机制融入隐式Q学习（IQL）算法中，旨在增强学习过程的收敛性；其次，进一步将时空Transformer模型整合至Actor-Critic中，以有效捕捉并解析离线人群导航数据集中机器人与行人之间高度动态且复杂的交互信息，从而优化算法的训练流程与效率；最后，通过仿真实验将所提算法与现有基于在线强化学习的人群导航算法进行对比，并根据评估机制进行定量与定性分析。实验结果显示，STHRIQL算法不仅在人群导航任务中展现出了优越的性能，而且相较于现有的在线人群导航算法，样本效率提升了30.5%~55.8%。STHRIQL算法可为提升机器人在复杂人群环境中的导航能力提供新的思路与解决方案。

关键词: 人群导航, 深度强化学习, 离线学习, 神经网络, 时空Transformer

Abstract:

In crowded environments， robots typically utilize online reinforcement learning algorithms to perform crowd navigation tasks. However， the complex and dynamic characteristics of pedestrian movements significantly reduce the sample efficiency of online reinforcement learning. To address this issue， a Spatial-temporal Transformer-based Hybrid Return Implicit Q-Learning （STHRIQL） algorithm within Offline Reinforcement Learning （ORL） framework was proposed. Firstly， the Monte Carlo （MC） return mechanism was incorporated into the Implicit Q-Learning （IQL） algorithm to enhance the convergence of the learning process. Then， a spatial-temporal Transformer model was further integrated into the Actor-Critic framework， so as to effectively capture and analyze the highly dynamic and complex interactions between robots and pedestrians in offline crowd navigation datasets， thereby optimizing the training process and efficiency of the algorithm. Finally， simulation experiments were conducted to compare STHRIQL algorithm with existing online reinforcement learning-based crowd navigation algorithms， followed by quantitative and qualitative analyses based on evaluation metrics. Experimental results show that STHRIQL algorithm has superior performance in crowd navigation tasks， and improves sample efficiency by 30.5% - 55.8% compared to existing online crowd navigation algorithms. This indicates that the STHRIQL algorithm provides a new approach and solution for enhancing robot navigation capabilities in complex crowd environments.

Key words: crowd navigation, Deep Reinforcement Learning (DRL), offline learning, neural network, spatial-temporal Transformer

中图分类号:

TP242.6

周帅, 符浩, 刘伟. 基于时空Transformer的混合回报隐式Q学习人群导航[J]. 计算机应用, 2025, 45(11): 3666-3673.

Shuai ZHOU, Hao FU, Wei LIU. Spatial-temporal Transformer-based hybrid return implicit Q-learning for crowd navigation[J]. Journal of Computer Applications, 2025, 45(11): 3666-3673.

图/表 9

图1 人群导航场景描述

Fig. 1 Description of crowd navigation scenario

图2 STHRIQL算法框架

Fig. 2 Framework of STHRIQL algorithm

图3 仿真环境描述

Fig. 3 Description of simulation environment

表1 数据集参数

Tab. 1 Dataset parameters

参数	取值	参数	取值
导航最长时限/s	50	机器人首选速度/（m·s^-1）	1
时间步长/s	0.25	行人首选速度/（m·s^-1）	1
碰撞半径r/m	0.3	状态维度	48
机器人起点	（0，-4）	动作维度	2
导航目标点	（0， 4）	样本容量 $N$	5×10⁵

表1 数据集参数

Tab. 1 Dataset parameters

参数	取值	参数	取值
导航最长时限/s	50	机器人首选速度/（m·s^-1）	1
时间步长/s	0.25	行人首选速度/（m·s^-1）	1
碰撞半径r/m	0.3	状态维度	48
机器人起点	（0，-4）	动作维度	2
导航目标点	（0， 4）	样本容量 $N$	5×10⁵

表2 训练参数

Tab. 2 Training parameters

参数	取值	参数	取值
批次大小	128	时间步数T	8
学习率	3×10^-4	期望回归因子 $τ$	0.8
训练轮次	5×10⁵	目标熵 $H ¯$	-2
折扣因子 $γ$	0.9	$α$ 初始权重	1.0
多头自注意头数	8	策略权重 $β$	50

表2 训练参数

Tab. 2 Training parameters

参数	取值	参数	取值
批次大小	128	时间步数T	8
学习率	3×10^-4	期望回归因子 $τ$	0.8
训练轮次	5×10⁵	目标熵 $H ¯$	-2
折扣因子 $γ$	0.9	$α$ 初始权重	1.0
多头自注意头数	8	策略权重 $β$	50

表3 定量评估结果

Tab. 3 Quantitative evaluation results

算法	训练模式	成功率/%	碰撞率/%	平均导航时间/s	平均回报C	样本效率 $η$
LSTM-RL	在线训练	97.8	2.2	12.71	0.932	0.129
SARL		98.6	1.4	11.47	1.016	0.146
ST²		99.0	1.0	11.32	1.022	0.154
HRIQL	离线训练	89.4	10.6	12.54	0.786	0.157
STIQL		98.4	1.6	12.30	0.949	0.190
STHRIQL		99.2	0.8	12.08	1.003	0.201

表3 定量评估结果

Tab. 3 Quantitative evaluation results

算法	训练模式	成功率/%	碰撞率/%	平均导航时间/s	平均回报C	样本效率 $η$
LSTM-RL	在线训练	97.8	2.2	12.71	0.932	0.129
SARL		98.6	1.4	11.47	1.016	0.146
ST²		99.0	1.0	11.32	1.022	0.154
HRIQL	离线训练	89.4	10.6	12.54	0.786	0.157
STIQL		98.4	1.6	12.30	0.949	0.190
STHRIQL		99.2	0.8	12.08	1.003	0.201

图4 空间Transformer的行人注意力权重分布

Fig. 4 Pedestrian attention weight distributions of spatial Transformer

图5 时空Transformer的行人注意力权重分布

Fig. 5 Pedestrian attention weight distributions of spatial-temporal Transformer

图6 所有算法的导航轨迹

Fig. 6 Navigation trajectories of all algorithms

参考文献 32

[1]	杨姝慧.基于时空信息融合的深度强化学习机器人人群导航研究［D］.济南：齐鲁工业大学，2024：2.
	YANG S H. Research on robot crowd navigation with deep reinforcement learning based on spatio-temporal information fusion［D］. Jinan： Qilu University of Technology， 2024： 2.
[2]	何丽，张恒，袁亮，等.服务机器人社会意识导航方法综述［J］. 计算机工程与应用，2022，58（11）：1-11.
	HE L， ZHANG H， YUAN L， et al. Review of socially-aware navigation methods of service robots［J］. Computer Engineering and Applications， 2022， 58（11）： 1-11.
[3]	GARRELL A， SANFELIU A. Cooperative social robots to accompany groups of people［J］. The International Journal of Robotics Research， 2012， 31（13）： 1675-1701.
[4]	FERRER G， GARRELL A， SANFELIU A. Robot companion： a social-force based approach with human awareness-navigation in crowded environments［C］// Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway： IEEE， 2013： 1688-1694.
[5]	HELBING D， MOLNÁR P. Social force model for pedestrian dynamics［J］. Physical Review. E， Statistical Physics， Plasmas， Fluids， and Related Interdisciplinary Topics， 1995， 51（5）： 4282-4286.
[6]	VAN DEN BERG J， GUY S J， LIN M， et al. Reciprocal n-body collision avoidance［C］// Robotics Research： The 14th International Symposium ISRR， STAR 70. Berlin： Springer， 2011： 3-19.
[7]	KRETZSCHMAR H， SPIES M， SPRUNK C， et al. Socially compliant mobile robot navigation via inverse reinforcement learning［J］. The International Journal of Robotics Research， 2016， 35（11）： 1289-1307.
[8]	TRAUTMAN P， MA J， MURRAY R M， et al. Robot navigation in dense human crowds： the case for cooperation［C］// Proceedings of the 2013 IEEE International Conference on Robotics and Automation. Piscataway： IEEE， 2013： 2153-2160.
[9]	TRAUTMAN P， KRAUSE A. Unfreezing the robot： navigation in dense， interacting crowds［C］// Proceedings of the 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway： IEEE， 2010： 797-803.
[10]	王少桐，况立群，韩慧妍，等.基于优势后见经验回放的强化学习导航方法［J］.计算机工程，2024，50（1）：313-319.
	WANG S T， KUANG L Q， HAN H Y， et al. Reinforcement learning navigation method based on advantage hindsight experience replay［J］. Computer Engineering， 2024， 50（1）： 313-319.
[11]	李永迪，李彩虹，张耀玉，等.基于改进SAC算法的移动机器人路径规划［J］.计算机应用，2023，43（2）：654-660.
	LI Y D， LI C H， ZHANG Y Y， et al. Mobile robot path planning based on improved SAC algorithm［J］. Journal of Computer Applications， 2023， 43（2）： 654-660.
[12]	SHI H， SHI L， XU M， et al. End-to-end navigation strategy with deep reinforcement learning for mobile robots［J］. IEEE Transactions on Industrial Informatics， 2020， 16（4）： 2393-2402.
[13]	LONG P， FAN T， LIAO X， et al. Towards optimally decentralized multi-robot collision avoidance via deep reinforcement learning［C］// Proceedings of the 2018 IEEE International Conference on Robotics and Automation. Piscataway： IEEE， 2018： 6252-6259.
[14]	XUE J， ZHANG S， LU Y， et al. Bidirectional obstacle avoidance enhancement‐deep deterministic policy gradient： a novel algorithm for mobile‐robot path planning in unknown dynamic environments［J］. Advanced Intelligent Systems， 2024， 6（4）： No.2300444.
[15]	马天，席润韬，吕佳豪，等.基于深度强化学习的移动机器人三维路径规划方法［J］.计算机应用，2024，44（7）：2055-2064.
	MA T， XI R T， LYU J H， et al. Mobile robot 3D space path planning method based on deep reinforcement learning［J］. Journal of Computer Applications， 2024， 44（7）： 2055-2064.
[16]	LU Y， CHEN Y， ZHAO D， et al. MGRL： graph neural network based inference in a Markov network with reinforcement learning for visual navigation［J］. Neurocomputing， 2021， 421： 140-150.
[17]	李忠伟，刘伟鹏，罗偲.基于轨迹引导的移动机器人导航策略优化算法［J］.计算机应用研究，2024，41（5）：1456-1461.
	LI Z W， LIU W P， LUO C. Autonomous navigation policy optimization algorithm for mobile robots based on trajectory guidance［J］. Application Research of Computers， 2024， 41（5）： 1456-1461.
[18]	CHEN Y F， LIU M， EVERETT M， et al. Decentralized non-communicating multiagent collision avoidance with deep reinforcement learning［C］// Proceedings of the 2017 IEEE International Conference on Robotics and Automation. Piscataway： IEEE， 2017： 285-292.
[19]	CHEN Y F， EVERETT M， LIU M， et al. Socially aware motion planning with deep reinforcement learning［C］// Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway： IEEE， 2017： 1343-1350.
[20]	EVERETT M， CHEN Y F， HOW J P. Motion planning among dynamic， decision-making agents with deep reinforcement learning［C］// Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway： IEEE， 2018： 3052-3059.
[21]	LIU S， CHANG P， LIANG W， et al. Decentralized structural-RNN for robot crowd navigation with deep reinforcement learning［C］// Proceedings of the 2021 IEEE International Conference on Robotics and Automation. Piscataway： IEEE， 2021： 3517-3524.
[22]	CHEN C， LIU Y， KREISS S， et al. Crowd-robot interaction： crowd-aware robot navigation with attention-based deep reinforcement learning［C］// Proceedings of the 2019 IEEE International Conference on Robotics and Automation. Piscataway： IEEE， 2019： 6015-6022.
[23]	YANG Y， JIANG J， ZHANG J， et al. ST2： spatial-temporal state transformer for crowd-aware autonomous navigation［J］. IEEE Robotics and Automation Letters， 2023， 8（2）： 912-919.
[24]	陈锶奇，耿婕，汪云飞，等.基于离线强化学习的研究综述［J］.无线电通信技术，2024，50（5）：831-842.
	CHEN S Q， GENG J， WANG Y F， et al. Survey of research on offline reinforcement learning［J］. Radio Communication Technology， 2024， 50（5）： 831-842.
[25]	FIGUEIREDO PRUDENCIO R， MAXIMO M R O A， COLOMBINI E L. A survey on offline reinforcement learning： taxonomy， review， and open problems［J］. IEEE Transactions on Neural Networks and Learning Systems， 2024， 35（8）： 10237-10257.
[26]	FUJIMOTO S， MEGER D， PRECUP D. Off-policy deep reinforcement learning without exploration［C］// Proceedings of the 36th International Conference on Machine Learning. New York： JMLR.org， 2019： 2052-2062.
[27]	王洋，张震，王迪，等.基于可变保守程度离线强化学习的机器人运动控制方法［J/OL］.控制工程［2024-12-22］..
	WANG Y， ZHANG Z， WANG D， et al. Robot motion control method based on offline reinforcement learning with variable conservatism［J/OL］. Control Engineering ［2024-10-22］..
[28]	KOSTRIKOV I， NAIR A， LEVINE S. Offline reinforcement learning with implicit Q-learning［EB/OL］. ［2024-03-19］..
[29]	SHAH D， BHORKAR A， LEEN H， et al. Offline reinforcement learning for visual navigation［C］// Proceedings of the 6th Conference on Robot Learning. New York： JMLR.org， 2023： 44-54.
[30]	NAIR A， GUPTA A， DALAL M， et al. AWAC： accelerating online reinforcement learning with offline datasets［EB/OL］. ［2024-11-02］..
[31]	HAARNOJA T， ZHOU A， HARTIKAINEN K， et al. Soft actor-critic algorithms and applications［EB/OL］. ［2024-09-03］..
[32]	FU J， KUMAR A， NACHUM O， et al. D4 RL： datasets for deep data-driven reinforcement learning［EB/OL］.［2024-06-06］..

基于时空Transformer的混合回报隐式Q学习人群导航

Spatial-temporal Transformer-based hybrid return implicit Q-learning for crowd navigation

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 32

相关文章 15

编辑推荐

Metrics

[1]	梁永濠, 李金龙. 用于神经布尔可满足性问题求解器的新型消息传递网络[J]. 《计算机应用》唯一官方网站, 2025, 45(9): 2934-2940.
[2]	卢燕群, 赵奕奕. 基于层次图神经网络和差异化特征学习的客户流失预测模型[J]. 《计算机应用》唯一官方网站, 2025, 45(9): 3057-3066.
[3]	石超, 周昱昕, 扶倩, 唐万宇, 何凌, 李元媛. 基于骨架和3D热图的注意缺陷多动障碍患者动作识别算法[J]. 《计算机应用》唯一官方网站, 2025, 45(9): 3036-3044.
[4]	张宏俊, 潘高军, 叶昊, 陆玉彬, 缪宜恒. 结合深度学习和张量分解的多源异构数据分析方法[J]. 《计算机应用》唯一官方网站, 2025, 45(9): 2838-2847.
[5]	刘超, 余岩化. 融合降噪策略与多视图对比学习的知识感知推荐模型[J]. 《计算机应用》唯一官方网站, 2025, 45(9): 2827-2837.
[6]	吕景刚, 彭绍睿, 高硕, 周金. 复频域注意力和多尺度频域增强驱动的语音增强网络[J]. 《计算机应用》唯一官方网站, 2025, 45(9): 2957-2965.
[7]	王义, 马应龙. 基于项图动态适应性生成的多任务社交项推荐方法[J]. 《计算机应用》唯一官方网站, 2025, 45(8): 2592-2599.
[8]	蒋权, 黄文清, 苟志勇. 基于等变图神经网络的拉格朗日粒子流模拟[J]. 《计算机应用》唯一官方网站, 2025, 45(8): 2666-2671.
[9]	林进浩, 罗川, 李天瑞, 陈红梅. 基于跨尺度注意力网络的胸部疾病分类方法[J]. 《计算机应用》唯一官方网站, 2025, 45(8): 2712-2719.
[10]	彭鹏, 蔡子婷, 刘雯玲, 陈才华, 曾维, 黄宝来. 基于CNN和双向GRU混合孪生网络的语音情感识别方法[J]. 《计算机应用》唯一官方网站, 2025, 45(8): 2515-2521.
[11]	赵彪, 秦玉华, 田荣坤, 胡月航, 陈芳锐. 依赖类型及距离增强的方面级情感分析模型[J]. 《计算机应用》唯一官方网站, 2025, 45(8): 2507-2514.
[12]	涂银川, 郭勇, 毛恒, 任怡, 张建锋, 李宝. 基于分布式环境的图神经网络模型训练效率与训练性能评估[J]. 《计算机应用》唯一官方网站, 2025, 45(8): 2409-2420.
[13]	向尔康, 黄荣, 董爱华. 开放生成与特征优化的开集识别方法[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2195-2202.
[14]	梁辰, 王奕森, 魏强, 杜江. 基于Tsransformer-GCN的源代码漏洞检测方法[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2296-2303.
[15]	张子墨, 赵雪专. 多尺度稀疏图引导的视觉图神经网络[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2188-2194.