《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (2): 445-451.DOI: 10.11772/j.issn.1001-9081.2023020153

• 人工智能 • 上一篇    

基于最大熵深度强化学习的双足机器人步态控制方法

李源潮1, 陶重犇1,2(), 王琛1   

  1. 1.苏州科技大学 电子与信息工程学院,江苏 苏州 215009
    2.清华大学 苏州汽车研究院,江苏 苏州 215134
  • 收稿日期:2023-02-21 修回日期:2023-04-20 接受日期:2023-05-05 发布日期:2023-08-14 出版日期:2024-02-10
  • 通讯作者: 陶重犇
  • 作者简介:李源潮(1999—),男,江苏连云港人,硕士研究生,主要研究方向:人工智能、双足机器人运动控制
    王琛(1990—),山西太原人,讲师,博士,主要研究方向:双足机器人运动控制。
  • 基金资助:
    国家自然科学基金资助项目(62201375);中国博士后科学基金资助项目(2021M691848);江苏省自然科学基金资助项目(BK20220635);苏州市科技项目(SS2019029)

Gait control method based on maximum entropy deep reinforcement learning for biped robot

Yuanchao LI1, Chongben TAO1,2(), Chen WANG1   

  1. 1.School of Electronic and Information Engineering,Suzhou University of Science and Technology,Suzhou Jiangsu 215009,China
    2.Suzhou Automotive Research Institute,Tsinghua University,Suzhou Jiangsu 215134,China
  • Received:2023-02-21 Revised:2023-04-20 Accepted:2023-05-05 Online:2023-08-14 Published:2024-02-10
  • Contact: Chongben TAO
  • About author:LI Yuanchao, born in 1999, M. S. candidate. His research interests include artificial intelligence, biped robot motion control.
    WANG Chen, born in 1990, Ph. D., lecturer. His research interests include biped robot motion control.
  • Supported by:
    National Natural Science Foundation of China(62201375);China Postdoctoral Science Foundation(2021M691848);Natural Science Foundation of Jiangsu Province(BK20220635);Science and Technology Project of Suzhou(SS2019029)

摘要:

针对双足机器人连续直线行走的步态稳定控制问题,提出一种基于最大熵深度强化学习(DRL)的柔性演员-评论家(SAC)步态控制方法。首先,该方法无需事先建立准确的机器人动力学模型,所有参数均来自关节角而无需额外的传感器;其次,采用余弦相似度方法对经验样本分类,优化经验回放机制;最后,根据知识和经验设计奖励函数,使双足机器人在直线行走训练过程中不断进行姿态调整,确保直线行走的鲁棒性。在Roboschool仿真环境中与其他先进深度强化学习算法,如近端策略优化(PPO)方法和信赖域策略优化(TRPO)方法的实验对比结果表明,所提方法不仅实现了双足机器人快速稳定的直线行走,而且鲁棒性更好。

关键词: 双足机器人, 步态控制, 深度强化学习, 最大熵, 柔性演员-评论家算法

Abstract:

For the problem of gait stability control for continuous linear walking of a biped robot, a Soft Actor-Critic (SAC) gait control algorithm based on maximum entropy Deep Reinforcement Learning (DRL) was proposed. Firstly, without accurate robot dynamic model built in advance, all parameters were derived from joint angles without additional sensors. Secondly, the cosine similarity method was used to classify experience samples and optimize the experience replay mechanism. Finally, reward functions were designed based on knowledge and experience to enable the biped robot continuously adjust its attitude during the linear walking training process, and the reward functions ensured the robustness of straight walking. The proposed method was compared with other DRL methods such as PPO (Proximal Policy Optimization) and TRPO (Trust Region Policy Optimization) in Roboschool simulation environment. The results show that the proposed method not only achieves fast and stable linear walking of the biped robot, but also has better algorithmic robustness.

Key words: biped robot, gait control, deep reinforcement learning, maximum entropy, Soft Actor-Critic (SAC) algorithm

中图分类号: