《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (2): 439-444.DOI: 10.11772/j.issn.1001-9081.2023020132

• 人工智能 • 上一篇    

基于路径模仿和SAC强化学习的机械臂路径规划算法

宋紫阳1, 李军怀1,2, 王怀军1,2(), 苏鑫1, 于蕾1,2   

  1. 1.西安理工大学 计算机科学与工程学院,西安 710048
    2.陕西省网络计算与安全技术重点实验室,西安 710048
  • 收稿日期:2023-02-16 修回日期:2023-04-24 接受日期:2023-04-24 发布日期:2023-06-06 出版日期:2024-02-10
  • 通讯作者: 王怀军
  • 作者简介:宋紫阳(1998—),男,山西临汾人,硕士研究生,主要研究方向:物联网、行为识别;
    李军怀(1969—),男,陕西宝鸡人,教授,博士,CCF高级会员,主要研究方向:物联网、行为识别、网络计算;
    苏鑫(1994—),男,陕西西安人,硕士研究生,主要研究方向:物联网;
    于蕾(1976—),女,吉林长岭人,讲师,硕士,主要研究方向:物联网、计算机网络。
  • 基金资助:
    国家重点研发计划项目(2018YFB1703003);陕西省重点研发计划项目(2022SF?353);西安市科技计划项目(2022JH?RYFW?0072)

Path planning algorithm of manipulator based on path imitation and SAC reinforcement learning

Ziyang SONG1, Junhuai LI1,2, Huaijun WANG1,2(), Xin SU1, Lei YU1,2   

  1. 1.School of Computer Science and Engineering,Xi’an University of Technology,Xi’an Shaanxi 710048,China
    2.Shaanxi Key Laboratory for Network Computing and Security Technology,Xi’an Shaanxi 710048,China
  • Received:2023-02-16 Revised:2023-04-24 Accepted:2023-04-24 Online:2023-06-06 Published:2024-02-10
  • Contact: Huaijun WANG
  • About author:SONG Ziyang, born in 1998, M. S. candidate. His research interests include internet of things, behavior recognition.
    LI Junhuai, born in 1969, Ph. D., professor. His research interests include internet of things, behavior recognition, network computing.
    SU Xin, born in 1994, M. S. candidate. His research interests include internet of things.
    YU Lei, born in 1976, M. S., lecturer. Her research interests include internet of things, computer networks.
  • Supported by:
    National Key Research and Development Program of China(2018YFB1703003);Shaanxi Provincial Key Research and Development Program(2022SF-353);Xi’an Science and Technology Plan Program(2022JH-RYFW-0072)

摘要:

在机械臂路径规划算法的训练过程中,由于动作空间和状态空间巨大导致奖励稀疏,机械臂路径规划训练效率低,面对海量的状态数和动作数较难评估状态价值和动作价值。针对上述问题,提出一种基于SAC(Soft Actor-Critic)强化学习的机械臂路径规划算法。通过将示教路径融入奖励函数使机械臂在强化学习过程中对示教路径进行模仿以提高学习效率,并采用SAC算法使机械臂路径规划算法的训练更快、稳定性更好。基于所提算法和深度确定性策略梯度(DDPG)算法分别规划10条路径,所提算法和DDPG算法规划的路径与参考路径的平均距离分别是0.8 cm和1.9 cm。实验结果表明,路径模仿机制能提高训练效率,所提算法比DDPG算法能更好地探索环境,使得规划路径更加合理。

关键词: 模仿学习, 强化学习, SAC算法, 路径规划, 奖励函数

Abstract:

In the training process of manipulator path planning algorithm, the training efficiency of manipulator path planning is low due to the huge action space and state space leading to sparse rewards, and it becomes challenging to evaluate the value of both states and actions given the immense number of states and actions. To address the above problems, a robotic manipulator planning algorithm based on SAC (Soft Actor-Critic) reinforcement learning was proposed. The learning efficiency was improved by incorporating the demonstrated path into the reward function so that the manipulator imitated the demonstrated path during reinforcement learning, and the SAC algorithm was used to make the training of the manipulator path planning algorithm faster and more stable. The proposed algorithm and Deep Deterministic Policy Gradient (DDPG) algorithm were used to plan 10 paths respectively, and the average distances between paths planned by the proposed algorithm and the DDPG algorithm and the reference paths were 0.8 cm and 1.9 cm respectively. The experimental results show that the path imitation mechanism can improve the training efficiency, and the proposed algorithm can better explore the environment and make the planned paths more reasonable than DDPG algorithm.

Key words: imitative learning, Reinforcement Learning (RL), Soft Actor-Critic (SAC) algorithm, path planning, reward function

中图分类号: