基于最大熵深度强化学习的双足机器人步态控制方法

doi:10.11772/j.issn.1001-9081.2023020153

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (2): 445-451.DOI: 10.11772/j.issn.1001-9081.2023020153

所属专题：人工智能

基于最大熵深度强化学习的双足机器人步态控制方法

李源潮¹, 陶重犇¹^,²(), 王琛¹

^1.苏州科技大学电子与信息工程学院，江苏苏州 215009
^2.清华大学苏州汽车研究院，江苏苏州 215134

收稿日期:2023-02-21 修回日期:2023-04-20 接受日期:2023-05-05 发布日期:2023-08-14 出版日期:2024-02-10
通讯作者: 陶重犇
作者简介:李源潮（1999—），男，江苏连云港人，硕士研究生，主要研究方向：人工智能、双足机器人运动控制
王琛（1990—），山西太原人，讲师，博士，主要研究方向：双足机器人运动控制。
基金资助:
国家自然科学基金资助项目(62201375);中国博士后科学基金资助项目(2021M691848);江苏省自然科学基金资助项目(BK20220635);苏州市科技项目(SS2019029)

Gait control method based on maximum entropy deep reinforcement learning for biped robot

Yuanchao LI¹, Chongben TAO¹^,²(), Chen WANG¹

^1.School of Electronic and Information Engineering，Suzhou University of Science and Technology，Suzhou Jiangsu 215009，China
^2.Suzhou Automotive Research Institute，Tsinghua University，Suzhou Jiangsu 215134，China

Received:2023-02-21 Revised:2023-04-20 Accepted:2023-05-05 Online:2023-08-14 Published:2024-02-10
Contact: Chongben TAO
About author:LI Yuanchao， born in 1999， M. S. candidate. His research interests include artificial intelligence， biped robot motion control.
WANG Chen， born in 1990， Ph. D.， lecturer. His research interests include biped robot motion control.
Supported by:
National Natural Science Foundation of China(62201375);China Postdoctoral Science Foundation(2021M691848);Natural Science Foundation of Jiangsu Province(BK20220635);Science and Technology Project of Suzhou(SS2019029)

摘要/Abstract

摘要：

针对双足机器人连续直线行走的步态稳定控制问题，提出一种基于最大熵深度强化学习（DRL）的柔性演员-评论家（SAC）步态控制方法。首先，该方法无需事先建立准确的机器人动力学模型，所有参数均来自关节角而无需额外的传感器；其次，采用余弦相似度方法对经验样本分类，优化经验回放机制；最后，根据知识和经验设计奖励函数，使双足机器人在直线行走训练过程中不断进行姿态调整，确保直线行走的鲁棒性。在Roboschool仿真环境中与其他先进深度强化学习算法，如近端策略优化（PPO）方法和信赖域策略优化（TRPO）方法的实验对比结果表明，所提方法不仅实现了双足机器人快速稳定的直线行走，而且鲁棒性更好。

关键词: 双足机器人, 步态控制, 深度强化学习, 最大熵, 柔性演员-评论家算法

Abstract:

For the problem of gait stability control for continuous linear walking of a biped robot， a Soft Actor-Critic （SAC） gait control algorithm based on maximum entropy Deep Reinforcement Learning （DRL） was proposed. Firstly， without accurate robot dynamic model built in advance， all parameters were derived from joint angles without additional sensors. Secondly， the cosine similarity method was used to classify experience samples and optimize the experience replay mechanism. Finally， reward functions were designed based on knowledge and experience to enable the biped robot continuously adjust its attitude during the linear walking training process， and the reward functions ensured the robustness of straight walking. The proposed method was compared with other DRL methods such as PPO （Proximal Policy Optimization） and TRPO （Trust Region Policy Optimization） in Roboschool simulation environment. The results show that the proposed method not only achieves fast and stable linear walking of the biped robot， but also has better algorithmic robustness.

Key words: biped robot, gait control, deep reinforcement learning, maximum entropy, Soft Actor-Critic (SAC) algorithm

中图分类号:

TP242.6

李源潮, 陶重犇, 王琛. 基于最大熵深度强化学习的双足机器人步态控制方法[J]. 计算机应用, 2024, 44(2): 445-451.

Yuanchao LI, Chongben TAO, Chen WANG. Gait control method based on maximum entropy deep reinforcement learning for biped robot[J]. Journal of Computer Applications, 2024, 44(2): 445-451.

图/表 10

图1 本文步态控制方法的总体框架

Fig.1 Overall framework of the proposed gait control method

图2 策略网络结构

Fig. 2 Structure of policy network

表1 状态空间

Tab. 1 State space

参数	描述
Dis	行走距离
Hip_R/L pitch	髋关节绕y轴旋转的角度
Hip_R/L roll	髋关节绕x轴旋转的角度
Hip_R/L yaw	髋关节绕z轴旋转的角度
Knee_R/L pitch	膝关节绕y轴旋转的角度
Ankle_R/L pitch	踝关节绕y轴旋转的角度
Ankle_R/L roll	踝关节绕x轴旋转的角度
CoM_offx	质心在x轴的偏差
CoM_offy	质心在y轴的偏差

表2 动作空间

Tab. 2 Action space

参数	描述
Hip_R/L pitch	髋关节绕y轴旋转的角度
Hip_R/L roll	髋关节绕x轴旋转的角度
Hip_R/L yaw	髋关节绕z轴旋转的角度
Knee_R/L pitch	膝关节绕y轴旋转的角度
Ankle_R/L roll	踝关节绕x轴旋转的角度
Ankle_R/L pitch	踝关节绕y轴旋转的角度

图3 四种算法的奖励值比较

Fig. 3 Comparison of reward values among four algorithms

图4 双足机器人在Roboschool中行走的细节

Fig.4 Detail of biped robot walking in Roboschool

图5 踝关节角度的变化

Fig. 5 Changes in ankle joint angle

图6 髋关节和膝关节的角度变化

Fig. 6 Changes in angles of hip and knee joints

图7 向前施加外力时的鲁棒性控制

Fig. 7 Robust control when external forces being applied forward

图8 向双足机器人右侧施加外力时的鲁棒性控制

Fig. 8 Robust control when external forces being applied to right side of biped robot

参考文献 28

1	LI Z， CHENG X， PENG X B， et al. Reinforcement learning for robust parameterized locomotion control of bipedal robots［C］// Proceedings of the 2021 IEEE International Conference on Robotics and Automation. Piscataway： IEEE， 2021： 2811-2817. 10.1109/icra48506.2021.9560769
2	KHAN A T， LI S， CAO X. Human guided cooperative robotic agents in smart home using beetle antennae search［J］. Science China Information Sciences， 2022， 65： 122204. 10.1007/s11432-020-3073-5
3	XIN S， VIJAYAKUMAR S. Online dynamic motion planning and control for wheeled biped robots［C］// Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway： IEEE， 2020： 3892-3899. 10.1109/iros45743.2020.9340967
4	KHAN A T， LI S， ZHOU X. Trajectory optimization of 5-link biped robot using beetle antennae search［J］. IEEE Transactions on Circuits and Systems II： Express Briefs， 2021， 68（10）： 3276-3280. 10.1109/tcsii.2021.3062639
5	JEONG H， LEE I， OH J， et al. A robust walking controller based on online optimization of ankle， hip， and stepping strategies［J］. IEEE Transactions on Robotics， 2019， 35（6）： 1367-1386. 10.1109/tro.2019.2926487
6	廖发康，周亚丽，张奇志. 变长度柔性双足机器人行走控制及稳定性分析［J］. 计算机应用， 2023， 43（1）： 312-320.
	LIAO F K， ZHOU Y L， ZHANG Q Z. Walking control and stability analysis of flexible biped robot with variable length legs［J］. Journal of Computer Applications， 2023， 43（1）： 312-320.
7	张瑞，张奇志，周亚丽. 变长度弹性伸缩腿双足机器人半被动起步行走仿人控制［J］. 计算机应用， 2022， 42（1）： 252-257.
	ZHANG R， ZHANG Q Z， ZHOU Y L. Starting and walking human-like control of semi-passive bipedal robot with variable length telescopic legs［J］. Journal of Computer Applications， 2022， 42（1）： 252-257.
8	YU J， LIU Y， LI R， et al. Stable walking of seven-link biped robot based on CPG-ZMP hybrid control method［C］// Proceedings of the 2021 IEEE International Conference on Robotics and Biomimetics. Piscataway： IEEE， 2021： 870-874. 10.1109/robio54168.2021.9739430
9	YAMAMOTO T， SUGIHARA T. Foot-guided control of a biped robot through ZMP manipulation［J］. Advanced Robotics， 2020， 34（21/22）： 1472-1489. 10.1080/01691864.2020.1827031
10	TAN J， ZHANG T， COUMANS E， et al. Sim-to-real： learning agile locomotion for quadruped robots［EB/OL］. （2018-05-16）［2023-02-11］. . 10.15607/rss.2018.xiv.010
11	HAARNOJA T， HA S， ZHOU A， et al. Learning to walk via deep reinforcement learning［EB/OL］. （2019-06-19）［2023-02-11］. . 10.15607/rss.2019.xv.011
12	ARULKUMARAN K， DEISENROTH M P， BRUNDAGE M， et al. Deep reinforcement learning： a brief survey［J］. IEEE Signal Processing Magazine， 2017， 34（6）： 26-38. 10.1109/msp.2017.2743240
13	LILLICRAP T P， HUNT J J， PRITZEL A， et al. Continuous control with deep reinforcement learning［EB/OL］. （2019-07-05）［2023-02-11］. .
14	MNIH V， BADIA A P， MIRZA M， et al. Asynchronous methods for deep reinforcement learning［C］// Proceedings of the 2016 International Conference on Machine Learning. New York： JMLR.org， 2016： 1928-1937.
15	WU Y， YAO D， XIAO X， et al. Intelligent controller for passivity-based biped robot using deep Q network［J］. Journal of Intelligent & Fuzzy Systems， 2019， 36（1）： 731-745. 10.3233/jifs-172180
16	WU X， LIU S， ZHANG T， et al. Motion control for biped robot via DDPG-based deep reinforcement learning［C］// Proceedings of the 2018 WRC Symposium on Advanced Robotics and Automation. Piscataway： IEEE， 2018： 40-45. 10.1109/wrc-sara.2018.8584227
17	SCHULMAN J， WOLSKI F， DHARIWAL P， et al. Proximal policy optimization algorithms［EB/OL］. （2017-08-28）［2023-02-11］. .
18	SCHULMAN J， LEVINE S， MORITZ P， et al. Trust region policy optimization［C］// Proceedings of the 32nd International Conference on Machine Learning. New York： JMLR.org， 2015： 1889-1897.
19	WU Y-H， YU Z-C， LI C-Y， et al. Reinforcement learning in dual-arm trajectory planning for a free-floating space robot［J］. Aerospace Science and Technology， 2020， 98： 105657. 10.1016/j.ast.2019.105657
20	赵玉婷，韩宝玲，罗庆生. 基于deep Q-network双足机器人非平整地面行走稳定性控制方法［J］. 计算机应用， 2018， 38（9）： 2459-2463.
	ZHAO Y T， HAN B L， LUO Q S. Walking stability control method based on deep Q-network for biped robot on uneven ground［J］. Journal of Computer Applications， 2018， 38（9）： 2459-2463.
21	TAO C， XUE J， ZHANG Z， et al. Parallel deep reinforcement learning method for gait control of biped robot［J］. IEEE Transactions on Circuits and Systems II： Express Briefs， 2022， 69（6）： 2802-2806. 10.1109/tcsii.2022.3145373
22	RODRIGUEZ D， BEHNKE S. DeepWalk： omnidirectional bipedal gait by deep reinforcement learning［C］// Proceedings of the 2021 IEEE International Conference on Robotics and Automation. Piscataway： IEEE， 2021： 3033-3039. 10.1109/icra48506.2021.9561717
23	HAARNOJA T， PONG V， ZHOU A， et al. Composable deep reinforcement learning for robotic manipulation［C］// Proceedings of the 2018 IEEE International Conference on Robotics and Automation. Piscataway： IEEE， 2018： 6244-6251. 10.1109/icra.2018.8460756
24	MNIH V， KAVUKCUOGLU K， SILVER D， et al. Playing Atari with deep reinforcement learning［EB/OL］. （2013-12-19）［2023-02-11］. . 10.1038/nature14236
25	HAARNOJA T， ZHOU A， ABBEEL P， et al. Soft actor-critic： off-policy maximum entropy deep reinforcement learning with a stochastic actor［C］// Proceedings of the 2018 International Conference on Machine Learning. New York： JMLR.org， 2018： 1861-1870. 10.1109/icra.2018.8460756
26	SCHAUL T， QUAN J， ANTONOGLOU I， et al. Prioritized experience replay［EB/OL］. （2016-02-25）［2023-02-11］. .
27	FUJITA Y， NAGARAJAN P， KATAOKA T， et al. ChainerRL： a deep reinforcement learning library［J］. The Journal of Machine Learning Research， 2021， 22（1）： 3557-3570.
28	CASTILLO G A， WENG B， HEREID A， et al. Reinforcement learning meets hybrid zero dynamics： a case study for rabbit［C］// Proceedings of the 2019 International Conference on Robotics and Automation. Piscataway： IEEE， 2019： 284-290. 10.1109/icra.2019.8793627

[1]	周毅, 高华, 田永谌. 基于裁剪优化和策略指导的近端策略优化算法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2334-2341.
[2]	马天, 席润韬, 吕佳豪, 曾奕杰, 杨嘉怡, 张杰慧. 基于深度强化学习的移动机器人三维路径规划方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2055-2064.
[3]	赵晓焱, 韩威, 张俊娜, 袁培燕. 基于异步深度强化学习的车联网协作卸载策略[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1501-1510.
[4]	唐睿, 庞川林, 张睿智, 刘川, 岳士博. D2D通信增强的蜂窝网络中基于DDPG的资源分配[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1562-1569.
[5]	秦鑫彤, 宋政育, 侯天为, 王飞越, 孙昕, 黎伟. 基于自适应p持续的移动自组网信道接入和资源分配算法[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 863-868.
[6]	邓辅秦, 官桧锋, 谭朝恩, 付兰慧, 王宏民, 林天麟, 张建民. 基于请求与应答通信机制和局部注意力机制的多机器人强化学习路径规划方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 432-438.
[7]	余家宸, 杨晔. 基于裁剪近端策略优化算法的软机械臂不规则物体抓取[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3629-3638.
[8]	龙杰, 谢良, 徐海蛟. 集成的深度强化学习投资组合模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 300-310.
[9]	王昱, 任田君, 范子琳. 基于引导Minimax-DDQN的无人机空战机动决策[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2636-2643.
[10]	王子腾, 于亚新, 夏子芳, 乔佳琪. 融合好奇心和策略蒸馏的稀疏奖励探索机制[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2082-2090.
[11]	李校林, 江雨桑. 无人机辅助移动边缘计算中的任务卸载算法[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1893-1899.
[12]	方和平, 刘曙光, 冉泳屹, 钟坤华. 基于深度强化学习的多数据中心一体化调度优化[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1884-1892.
[13]	黄晓辉, 杨凯铭, 凌嘉壕. 基于共享注意力的多智能体强化学习订单派送[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1620-1624.
[14]	曹腾飞, 刘延亮, 王晓英. 基于改进深度强化学习的边缘计算服务卸载算法[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1543-1550.
[15]	丁正凯, 傅启明, 陈建平, 陆悠, 吴宏杰, 方能炜, 邢镔. 结合注意力机制与深度强化学习的超短期光伏功率预测[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1647-1654.

基于最大熵深度强化学习的双足机器人步态控制方法

Gait control method based on maximum entropy deep reinforcement learning for biped robot

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 28

相关文章 15

编辑推荐

Metrics