基于路径模仿和SAC强化学习的机械臂路径规划算法

doi:10.11772/j.issn.1001-9081.2023020132

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (2): 439-444.DOI: 10.11772/j.issn.1001-9081.2023020132

• 人工智能 • 上一篇

基于路径模仿和SAC强化学习的机械臂路径规划算法

宋紫阳¹, 李军怀¹^,², 王怀军¹^,²(), 苏鑫¹, 于蕾¹^,²

^1.西安理工大学计算机科学与工程学院，西安 710048
^2.陕西省网络计算与安全技术重点实验室，西安 710048

收稿日期:2023-02-16 修回日期:2023-04-24 接受日期:2023-04-24 发布日期:2023-06-06 出版日期:2024-02-10
通讯作者: 王怀军
作者简介:宋紫阳（1998—），男，山西临汾人，硕士研究生，主要研究方向：物联网、行为识别；
李军怀（1969—），男，陕西宝鸡人，教授，博士，CCF高级会员，主要研究方向：物联网、行为识别、网络计算；
苏鑫（1994—），男，陕西西安人，硕士研究生，主要研究方向：物联网；
于蕾（1976—），女，吉林长岭人，讲师，硕士，主要研究方向：物联网、计算机网络。
基金资助:
国家重点研发计划项目(2018YFB1703003);陕西省重点研发计划项目(2022SF?353);西安市科技计划项目(2022JH?RYFW?0072)

Path planning algorithm of manipulator based on path imitation and SAC reinforcement learning

Ziyang SONG¹, Junhuai LI¹^,², Huaijun WANG¹^,²(), Xin SU¹, Lei YU¹^,²

^1.School of Computer Science and Engineering，Xi’an University of Technology，Xi’an Shaanxi 710048，China
^2.Shaanxi Key Laboratory for Network Computing and Security Technology，Xi’an Shaanxi 710048，China

Received:2023-02-16 Revised:2023-04-24 Accepted:2023-04-24 Online:2023-06-06 Published:2024-02-10
Contact: Huaijun WANG
About author:SONG Ziyang， born in 1998， M. S. candidate. His research interests include internet of things， behavior recognition.
LI Junhuai， born in 1969， Ph. D.， professor. His research interests include internet of things， behavior recognition， network computing.
SU Xin， born in 1994， M. S. candidate. His research interests include internet of things.
YU Lei， born in 1976， M. S.， lecturer. Her research interests include internet of things， computer networks.
Supported by:
National Key Research and Development Program of China(2018YFB1703003);Shaanxi Provincial Key Research and Development Program(2022SF-353);Xi’an Science and Technology Plan Program(2022JH-RYFW-0072)

摘要/Abstract

摘要：

在机械臂路径规划算法的训练过程中，由于动作空间和状态空间巨大导致奖励稀疏，机械臂路径规划训练效率低，面对海量的状态数和动作数较难评估状态价值和动作价值。针对上述问题，提出一种基于SAC（Soft Actor-Critic）强化学习的机械臂路径规划算法。通过将示教路径融入奖励函数使机械臂在强化学习过程中对示教路径进行模仿以提高学习效率，并采用SAC算法使机械臂路径规划算法的训练更快、稳定性更好。基于所提算法和深度确定性策略梯度（DDPG）算法分别规划10条路径，所提算法和DDPG算法规划的路径与参考路径的平均距离分别是0.8 cm和1.9 cm。实验结果表明，路径模仿机制能提高训练效率，所提算法比DDPG算法能更好地探索环境，使得规划路径更加合理。

关键词: 模仿学习, 强化学习, SAC算法, 路径规划, 奖励函数

Abstract:

In the training process of manipulator path planning algorithm， the training efficiency of manipulator path planning is low due to the huge action space and state space leading to sparse rewards， and it becomes challenging to evaluate the value of both states and actions given the immense number of states and actions. To address the above problems， a robotic manipulator planning algorithm based on SAC （Soft Actor-Critic） reinforcement learning was proposed. The learning efficiency was improved by incorporating the demonstrated path into the reward function so that the manipulator imitated the demonstrated path during reinforcement learning， and the SAC algorithm was used to make the training of the manipulator path planning algorithm faster and more stable. The proposed algorithm and Deep Deterministic Policy Gradient （DDPG） algorithm were used to plan 10 paths respectively， and the average distances between paths planned by the proposed algorithm and the DDPG algorithm and the reference paths were 0.8 cm and 1.9 cm respectively. The experimental results show that the path imitation mechanism can improve the training efficiency， and the proposed algorithm can better explore the environment and make the planned paths more reasonable than DDPG algorithm.

Key words: imitative learning, Reinforcement Learning (RL), Soft Actor-Critic (SAC) algorithm, path planning, reward function

中图分类号:

TP241.2

宋紫阳, 李军怀, 王怀军, 苏鑫, 于蕾. 基于路径模仿和SAC强化学习的机械臂路径规划算法[J]. 计算机应用, 2024, 44(2): 439-444.

Ziyang SONG, Junhuai LI, Huaijun WANG, Xin SU, Lei YU. Path planning algorithm of manipulator based on path imitation and SAC reinforcement learning[J]. Journal of Computer Applications, 2024, 44(2): 439-444.

图/表 9

图1 机械臂路径规划算法原理

Fig. 1 Principle of manipulator path planning algorithm

图2 本文算法的框架

Fig. 2 Framework of proposed algorithm

图3 机械臂路径规划训练场景

Fig. 3 Training scene of manipulator path planning

图4 两种算法的奖励变化曲线

Fig. 4 Reward change curves for two algorithms

图5 两种算法的成功率变化曲线

Fig. 5 Change curves of success rate for two algorithms

表1 10条路径的路径规划实验结果 ( cm)

Tab. 1 Experimental results of path planning for ten paths

路径规划算法	路径平均长度	与参考路径的最大距离	与参考路径的平均距离
SAC	42.2	2.1	0.8
DDPG	53.7	3.8	1.9

图6 参考路径及基于路径模仿和SAC算法规划的路径对比

Fig. 6 Comparison between reference path and path based on path imitation and SAC algorithm

图7 参考路径和基于路径模仿和DDPG算法规划的路径对比

Fig. 7 Comparison between reference path and path based on path imitation and DDPG algorithm

图8 不同的参考路径及基于路径模仿和SAC算法规划的路径对比

Fig. 8 Comparison of different reference path and path based on path imitation and SAC algorithm

参考文献 21

1	YU M， CHEN S， YAO Y， et al. Research on dual-arm coordinated trajectory planning in grid coordinate space ［C］// Proceedings of the 2020 IEEE 5th Information Technology and Mechatronics Engineering Conference. Piscataway： IEEE， 2020： 985-989. 10.1109/itoec49072.2020.9141634
2	VAREDI-KOULAEI S M， MOKHTARI M. Trajectory tracking solution of a robotic arm based on optimized ANN ［C］// Proceedings of the 2018 6th RSI International Conference on Robotics and Mechatronics. Piscataway： IEEE， 2018： 76-81. 10.1109/icrom.2018.8657567
3	GANDIKOTA R. Computer vision for autonomous vehicles ［EB/OL］. ［2022-10-18］. .
4	SCHWARTING W， ALONSO-MORA J， RUS D. Planning and decision-making for autonomous vehicles ［J］. Annual Review of Control， Robotics， and Autonomous Systems， 2018， 1： 187-210. 10.1146/annurev-control-060117-105157
5	LV N， LIU J， JIA Y. Dynamic modeling and control of deformable linear objects for single-arm and dual-arm robot manipulations ［J］. IEEE Transactions on Robotics， 2022， 38（4）： 2341-2353. 10.1109/tro.2021.3139838
6	LI S， JIN L， MIRZA M A. Kinematic Control of Redundant Robot Arms Using Neural Networks ［M］. New York： Wiley-IEEE Press， 2019. 10.1002/9781119557005
7	MATHEW M J M R， HIREMATH S S. Reinforcement learning based approach for mobile robot navigation ［C］// Proceedings of the 2019 International Conference on Computational Intelligence and Knowledge Economy. Piscataway： IEEE， 2019： 523-526. 10.1109/iccike47802.2019.9004256
8	ABDUL HAMEED M S， KHAN M M， SCHWUNG A. Curiosity based RL on robot manufacturing cell ［C］// Proceedings of the 2021 22nd IEEE International Conference on Industrial Technology. Piscataway： IEEE， 2021： 1048-1053. 10.1109/icit46573.2021.9453577
9	WANG Z， SCHAUL T， HESSEL M， et al. Dueling network architectures for deep reinforcement learning ［C］// Proceedings of the 33rd International Conference on Machine Learning. New York： JMLR.org， 2016： 1995-2003.
10	GU S， LILLICRAP T， SUTSKEVER I， et al. Continuous deep Q-learning with model-based acceleration ［C］// Proceedings of the 33rd International Conference on Machine Learning. New York： JMLR.org， 2016： 2829-2838.
11	PRIANTO E， KIM M S， J-H PARK， et al. Path planning for multi-arm manipulators using deep reinforcement learning： soft actor-critic with hindsight experience replay ［J］. Sensors， 2020， 20（20）： 5911. 10.3390/s20205911
12	张永梅，赵家瑞，吴爱燕.好奇心驱动的深度强化学习机器人路径规划算法［J］.科学技术与工程， 2022， 22（25）： 11075-11083. 10.3969/j.issn.1671-1815.2022.25.032
	ZHANG Y M， ZHAO J R， WU A Y. A robot path planning algorithm based on curiosity-driven deep reinforcement learning ［J］. Science Technology and Engineering， 2022， 22（25）： 11075-11083. 10.3969/j.issn.1671-1815.2022.25.032
13	于建均，徐骢驰，阮晓钢，等.基于神经网络的机械臂的模仿学习研究［J］.控制工程， 2017， 24（11）： 2368-2373.
	YU J J， XU C C， RUAN X G， et al. Research of imitation learning in robot arm based on neural network ［J］. Control Engineering of China， 2017， 24（11）： 2368-2373.
14	RAHMATIZADEH R， ABOLGHASEMI P， BEHAL A， et al. From virtual demonstration to real-world manipulation using LSTM and MDN ［C］// Proceedings of the 32nd AAAI Conference on Artificial Intelligence and 30th Innovative Applications of Artificial Intelligence Conference and 8th AAAI Symposium on Educational Advances in Artificial Intelligence. Menlo Park： AAAI Press， 2018： 6524-6531. 10.1609/aaai.v32i1.12099
15	FINN C， LEVINE S， ABBEEL P. Guided cost learning： deep inverse optimal control via policy optimization ［C］// Proceedings of the 33rd International Conference on Machine Learning. New York： JMLR.org， 2016： 49-58. 10.1109/icra.2016.7487173
16	汤自林，高霄，肖晓晖.基于模仿学习的变刚度人机协作搬运控制［J］.浙江大学学报（工学版）， 2021， 55（11）： 2091-2099.
	TANG Z L， GAO X， XIAO X H. Variable stiffness control for human-robot cooperative transportation based on imitation learning ［J］. Journal of Zhejiang University （Engineering Science）， 2021， 55（11）： 2091-2099.
17	HAARNOJA T， ZHOU A， ABBEEL P， et al. Soft actor-critic： off-policy maximum entropy deep reinforcement learning with a stochastic actor ［C］// Proceedings of the 35th International Conference on Machine Learning. New York： JMLR.org， 2018： 1861-1870. 10.1109/icra.2018.8460756
18	CHENG Y， SONG Y. Autonomous decision-making generation of UAV based on soft actor-critic algorithm ［C］// Proceedings of the 2020 39th Chinese Control Conference. Piscataway： IEEE， 2020： 7350-7355. 10.23919/ccc50068.2020.9188886
19	陈松，章晓芳，章宗长，等.基于线性动态跳帧的深度双Q网络［J］.计算机学报， 2019， 42（11）： 2561-2573. 10.11897/SP.J.1016.2019.02561
	CHEN S， ZHANG X F， ZHANG Z Z， et al. Deep double Q-network based on linear dynamic frame skip ［J］. Chinese Journal of Computers， 2019， 42（11）： 2561-2573. 10.11897/SP.J.1016.2019.02561
20	HU H. VReLU activation functions for artificial neural networks ［C］// Proceedings of the 2018 14th International Conference on Natural Computation， Fuzzy Systems and Knowledge Discovery. Piscataway： IEEE， 2018： 856-860. 10.1109/fskd.2018.8687140
21	LILLICRAP T P， HUNT J J， PRITZEL A， et al. Continuous control with deep reinforcement learning ［EB/OL］. ［2022-10-18］. .

[1]	邓辅秦, 官桧锋, 谭朝恩, 付兰慧, 王宏民, 林天麟, 张建民. 基于请求与应答通信机制和局部注意力机制的多机器人强化学习路径规划方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 432-438.
[2]	龙杰, 谢良, 徐海蛟. 集成的深度强化学习投资组合模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 300-310.
[3]	王昱, 任田君, 范子琳. 基于引导Minimax-DDQN的无人机空战机动决策[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2636-2643.
[4]	王子腾, 于亚新, 夏子芳, 乔佳琪. 融合好奇心和策略蒸馏的稀疏奖励探索机制[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2082-2090.
[5]	李校林, 江雨桑. 无人机辅助移动边缘计算中的任务卸载算法[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1893-1899.
[6]	方和平, 刘曙光, 冉泳屹, 钟坤华. 基于深度强化学习的多数据中心一体化调度优化[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1884-1892.
[7]	曹腾飞, 刘延亮, 王晓英. 基于改进深度强化学习的边缘计算服务卸载算法[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1543-1550.
[8]	丁正凯, 傅启明, 陈建平, 陆悠, 吴宏杰, 方能炜, 邢镔. 结合注意力机制与深度强化学习的超短期光伏功率预测[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1647-1654.
[9]	黄晓辉, 杨凯铭, 凌嘉壕. 基于共享注意力的多智能体强化学习订单派送[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1620-1624.
[10]	李永迪, 李彩虹, 张耀玉, 张国胜. 基于改进SAC算法的移动机器人路径规划[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 654-660.
[11]	廖兴滨, 秦小林, 张思齐, 钱杨舸. 交互式机器翻译综述[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 329-334.
[12]	黄霖, 符强, 童楠. 基于自适应调整哈里斯鹰优化算法求解机器人路径规划问题[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3840-3847.
[13]	王龙宝, 栾茵琪, 徐亮, 曾昕, 张帅, 徐淑芳. 基于动态簇粒子群优化的无人机集群路径规划方法[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3816-3823.
[14]	王哲, 王启名, 李陶深, 葛丽娜. 基于深度强化学习的SWIPT边缘网络联合优化方法[J]. 《计算机应用》唯一官方网站, 2023, 43(11): 3540-3550.
[15]	刘晨, 陈洋, 符浩. 基于值函数迭代的持续监测无人机路径规划[J]. 《计算机应用》唯一官方网站, 2023, 43(10): 3290-3296.

基于路径模仿和SAC强化学习的机械臂路径规划算法

Path planning algorithm of manipulator based on path imitation and SAC reinforcement learning

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 21

相关文章 15

编辑推荐

Metrics