Proximal policy optimization algorithm based on clipping optimization and policy guidance

doi:10.11772/j.issn.1001-9081.2023081079

Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (8): 2334-2341.DOI: 10.11772/j.issn.1001-9081.2023081079

• Artificial intelligence • Previous Articles Next Articles

Proximal policy optimization algorithm based on clipping optimization and policy guidance

Yi ZHOU(), Hua GAO, Yongshen TIAN

School of Information Science and Engineering，Wuhan University of Science and Technology，Wuhan Hubei 430081，China

Received:2023-08-10 Revised:2023-10-15 Accepted:2023-10-24 Online:2023-12-18 Published:2024-08-10
Contact: Yi ZHOU
About author:ZHOU Yi， born in 1983， Ph. D.， associate professor. His researchinterests include swarm intelligence optimization， deep reinforcementlearning.
GAO Hua， born in 1997， M. S. candidate. His research interestsinclude deep reinforcement learning， automatic driving.
TIAN Yongshen， born in 1999， M. S. candidate. His researchinterests include deep reinforcement learning， intelligent game.
Supported by:
This work is partially supported by National Natural ScienceFoundation of China（ 62372343）.

基于裁剪优化和策略指导的近端策略优化算法

周毅(), 高华, 田永谌

武汉科技大学信息科学与工程学院，武汉 430081

通讯作者: 周毅
作者简介:周毅（1983—），男，湖北汉川人，副教授，博士，CCF会员，主要研究方向：群体智能优化、深度强化学习 zhouyi83@wust.edu.cn
高华（1997—），男，安徽合肥人，硕士研究生，主要研究方向：深度强化学习、自动驾驶
田永谌（1999—），男，湖北潜江人，硕士研究生，主要研究方向：深度强化学习、智能博弈。
基金资助:
国家自然科学基金资助项目(62372343)

Abstract

Abstract:

Addressing the two issues in the Proximal Policy Optimization （PPO） algorithm， the difficulty in strictly constraining the difference between old and new policies and the relatively low efficiency in exploration and utilization， a PPO based on Clipping Optimization And Policy Guidance （COAPG-PPO） algorithm was proposed. Firstly， by analyzing the clipping mechanism of PPO， a trust-region clipping approach based on the Wasserstein distance was devised， strengthening the constraint on the difference between old and new policies. Secondly， within the policy updating process， ideas from simulated annealing and greedy algorithms were incorporated， improving the exploration efficiency and learning speed of algorithm. To validate the effectiveness of COAPG-PPO algorithm， comparative experiments were conducted using the MuJoCo testing benchmarks between PPO based on Clipping Optimization （CO-PPO）， PPO with Covariance Matrix Adaptation （PPO-CMA）， Trust Region-based PPO with RollBack （TR-PPO-RB）， and PPO algorithm. The experimental results indicate that COAPG-PPO algorithm demonstrates stricter constraint capabilities， higher exploration and exploitation efficiencies， and higher reward values in most environments.

Key words: deep reinforcement learning, proximal policy optimization, trust region constraint, simulated annealing, greedy algorithm

摘要：

针对近端策略优化（PPO）算法难以严格约束新旧策略的差异和探索与利用效率较低这2个问题，提出一种基于裁剪优化和策略指导的PPO（COAPG-PPO）算法。首先，通过分析PPO的裁剪机制，设计基于Wasserstein距离的信任域裁剪方案，加强对新旧策略差异的约束；其次，在策略更新过程中，融入模拟退火和贪心算法的思想，提升算法的探索效率和学习速度。为了验证所提算法的有效性，使用MuJoCo测试基准对COAPG-PPO与CO-PPO（PPO based on Clipping Optimization）、PPO-CMA（PPO with Covariance Matrix Adaptation）、TR-PPO-RB（Trust Region-based PPO with RollBack）和PPO算法进行对比实验。实验结果表明，COAPG-PPO算法在大多数环境中具有更严格的约束能力、更高的探索和利用效率，以及更高的奖励值。

关键词: 深度强化学习, 近端策略优化, 信任域约束, 模拟退火, 贪心算法

CLC Number:

TP183

Yi ZHOU, Hua GAO, Yongshen TIAN. Proximal policy optimization algorithm based on clipping optimization and policy guidance[J]. Journal of Computer Applications, 2024, 44(8): 2334-2341.

周毅, 高华, 田永谌. 基于裁剪优化和策略指导的近端策略优化算法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2334-2341.

Figures/Tables 11

References 21

1	MNIH V， KAVUKCUOGLU K， SILVER D， et al. Human-level control through deep reinforcement learning ［J］. Nature， 2015， 518（7540）： 529-533.
2	LUONG N C， HOANG D T， GONG S， et al. Applications of deep reinforcement learning in communications and networking： a survey ［J］. IEEE Communications Surveys & Tutorials， 2019， 21（4）： 3133-3174.
3	刘全，翟建伟，章宗长，等.深度强化学习综述［J］. 计算机学报，2018，41（1）： 1-27.
	LIU Q， ZHAI J W， ZHANG Z Z， et al. A survey on deep reinforcement learning ［J］. Chinese Journal of Computers， 2018， 41（1）： 1-27.
4	AFSAR M M， CRUMP T， FAR B. Reinforcement learning based recommender systems： a survey ［EB/OL］. （2022-06-08）［2023-08-01］. .
5	王雪松，王荣荣，程玉虎.安全强化学习综述［J］.自动化学报，2023，49（9）： 1813-1835.
	WANG X S， WANG R R， CHENG Y H. Safety reinforcement learning： a survey ［J］. Acta Automatica Sinica， 2023， 49（9）： 1813-1835.
6	KOHL N， STONE P. Policy gradient reinforcement learning for fast quadrupedal locomotion ［C］// Proceedings of the 2004 IEEE International Conference on Robotics and Automation. Piscataway： IEEE， 2004， 3： 2619-2624.
7	PETERS J， SCHAAL S. Reinforcement learning of motor skills with policy gradients ［J］. Neural Networks， 2008， 21（4）： 682-697.
8	SCHULMAN J， LEVINE S， MORITZ P， et al. Trust region policy optimization ［C］// Proceedings of the 32nd International Conference on Machine Learning. New York： JMLR.org， 2015： 1889-1897.
9	SCHULMAN J， WOLSKI F， DHARIWAL P， et al. Proximal policy optimization algorithms ［EB/OL］. （2017-08-28）［2023-06-28］. .
10	WANG Y， HE H， TAN X. Truly proximal policy optimization ［C］// Proceedings of the 35th Uncertainty in Artificial Intelligence Conference. New York： JMLR.org， 2020： 113-122.
11	CHENG Y， HUANG L， WANG X. Authentic boundary proximal policy optimization ［J］. IEEE Transactions on Cybernetics， 2021， 52（9）： 9428-9438.
12	ZHANG L， SHEN L， YANG L， et al. Penalized proximal policy optimization for safe reinforcement learning ［EB/OL］. （2022-06-17）［2023-06-29］. .
13	GU Y， CHENG Y， CHEN C L P， et al. Proximal policy optimization with policy feedback ［J］. IEEE Transactions on Systems， Man， and Cybernetics： Systems， 2021， 52（7）： 4600-4610.
14	KOBAYASHI T. Proximal policy optimization with relative pearson divergence ［C］// Proceedings of the 2021 IEEE International Conference on Robotics and Automation. Piscataway： IEEE， 2021： 8416-8421.
15	QUEENEY J， PASCHALIDIS Y， CASSANDRAS C G. Generalized proximal policy optimization with sample reuse ［C］// Proceedings of the 35th Conference on Neural Information Processing Systems. La Jolla， CA： NIPS， 2021： 11909-11919.
16	张峻伟，吕帅，张正昊，等.基于样本效率优化的深度强化学习方法综述［J］. 软件学报，2022，33（11）： 4217-4238.
	ZHANG J W， LYU S， ZHANG Z H， et al. Survey on deep reinforcement learning methods based on sample efficiency optimization ［J］. Journal of Software， 2022， 33（11）： 4217-4238.
17	TODOROV E， EREZ T， TASSA Y. MuJoCo： a physics engine for model-based control ［C］// Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway： IEEE， 2012： 5026-5033.
18	HÄMÄLÄINEN P， BABADI A， MA X， et al. PPO-CMA： proximal policy optimization with covariance matrix adaptation ［C］// Proceedings of the 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing. Piscataway： IEEE， 2020： 1-6.
19	PANARETOS V M， ZEMEL Y. Statistical aspects of Wasserstein distances ［J］. Annual Review of Statistics and Its Application， 2019， 6： 405-431.
20	BERTSIMAS D， TSITSIKLIS J. Simulated annealing ［J］. Statistical Science， 1993， 8（1）： 10-15.
21	VINCE A. A framework for the greedy algorithm ［J］. Discrete Applied Mathematics， 2002， 121（1/2/3）： 247-260.

算法	Swimmer-v2	Reacher-v2	Walker2d-v2	Humanoid-v2	HumanoidStandup-v2
PPO	102.95	-8.70	3 310.04	506.26	105 795.94
TR-PPO-RB	97.70	-7.74	3 786.54	530.88	140 587.55
PPO-CMA	119.17	-7.92	3 942.12	598.77	150 800.39
CO-PPO	86.51	-7.05	4 294.67	539.33	143 369.69
COAPG-PPO	111.01	-6.37	4670.17	638.98	162024.21

算法	Swimmer-v2	Reacher-v2	Walker2d-v2	Humanoid-v2	HumanoidStandup-v2
PPO	102.95	-8.70	3 310.04	506.26	105 795.94
TR-PPO-RB	97.70	-7.74	3 786.54	530.88	140 587.55
PPO-CMA	119.17	-7.92	3 942.12	598.77	150 800.39
CO-PPO	86.51	-7.05	4 294.67	539.33	143 369.69
COAPG-PPO	111.01	-6.37	4670.17	638.98	162024.21

任务	奖励阈值	各算法达到奖励阈值的时间步/10³
任务	奖励阈值	PPO	TR-PPO-RB	PPO-CMA	CO-PPO	COAPG-PPO
Swimmer-v2	80	370	558	434	260	276
Reacher-v2	-8	668	636	641	483	397
Walker2d-v2	3 000	409	423	470	299	255
Humanoid-v2	500	179	500	385	292	247
HumanoidStandup-v2	110 000	224	287	283	323	162

任务	奖励阈值	各算法达到奖励阈值的时间步/10³
任务	奖励阈值	PPO	TR-PPO-RB	PPO-CMA	CO-PPO	COAPG-PPO
Swimmer-v2	80	370	558	434	260	276
Reacher-v2	-8	668	636	641	483	397
Walker2d-v2	3 000	409	423	470	299	255
Humanoid-v2	500	179	500	385	292	247
HumanoidStandup-v2	110 000	224	287	283	323	162

[1]	Tian MA, Runtao XI, Jiahao LYU, Yijie ZENG, Jiayi YANG, Jiehui ZHANG. Mobile robot 3D space path planning method based on deep reinforcement learning [J]. Journal of Computer Applications, 2024, 44(7): 2055-2064.
[2]	Yan LI, Dazhi PAN, Siqing ZHENG. Improved adaptive large neighborhood search algorithm for multi-depot vehicle routing problem with time window [J]. Journal of Computer Applications, 2024, 44(6): 1897-1904.
[3]	Xiaoyan ZHAO, Wei HAN, Junna ZHANG, Peiyan YUAN. Collaborative offloading strategy in internet of vehicles based on asynchronous deep reinforcement learning [J]. Journal of Computer Applications, 2024, 44(5): 1501-1510.
[4]	Rui TANG, Chuanlin PANG, Ruizhi ZHANG, Chuan LIU, Shibo YUE. DDPG-based resource allocation in D2D communication-empowered cellular network [J]. Journal of Computer Applications, 2024, 44(5): 1562-1569.
[5]	Xintong QIN, Zhengyu SONG, Tianwei HOU, Feiyue WANG, Xin SUN, Wei LI. Channel access and resource allocation algorithm for adaptive p-persistent mobile ad hoc network [J]. Journal of Computer Applications, 2024, 44(3): 863-868.
[6]	Fuqin DENG, Huifeng GUAN, Chaoen TAN, Lanhui FU, Hongmin WANG, Tinlun LAM, Jianmin ZHANG. Multi-robot reinforcement learning path planning method based on request-response communication mechanism and local attention mechanism [J]. Journal of Computer Applications, 2024, 44(2): 432-438.
[7]	Yuanchao LI, Chongben TAO, Chen WANG. Gait control method based on maximum entropy deep reinforcement learning for biped robot [J]. Journal of Computer Applications, 2024, 44(2): 445-451.
[8]	Jiachen YU, Ye YANG. Irregular object grasping by soft robotic arm based on clipped proximal policy optimization algorithm [J]. Journal of Computer Applications, 2024, 44(11): 3629-3638.
[9]	Jie LONG, Liang XIE, Haijiao XU. Integrated deep reinforcement learning portfolio model [J]. Journal of Computer Applications, 2024, 44(1): 300-310.
[10]	Yu WANG, Tianjun REN, Zilin FAN. Air combat maneuver decision-making of unmanned aerial vehicle based on guided Minimax-DDQN [J]. Journal of Computer Applications, 2023, 43(8): 2636-2643.
[11]	Ziteng WANG, Yaxin YU, Zifang XIA, Jiaqi QIAO. Sparse reward exploration mechanism fusing curiosity and policy distillation [J]. Journal of Computer Applications, 2023, 43(7): 2082-2090.
[12]	Heping FANG, Shuguang LIU, Yongyi RAN, Kunhua ZHONG. Integrated scheduling optimization of multiple data centers based on deep reinforcement learning [J]. Journal of Computer Applications, 2023, 43(6): 1884-1892.
[13]	Xiaolin LI, Yusang JIANG. Task offloading algorithm for UAV-assisted mobile edge computing [J]. Journal of Computer Applications, 2023, 43(6): 1893-1899.
[14]	Xiaohui HUANG, Kaiming YANG, Jiahao LING. Order dispatching by multi-agent reinforcement learning based on shared attention [J]. Journal of Computer Applications, 2023, 43(5): 1620-1624.
[15]	Tengfei CAO, Yanliang LIU, Xiaoying WANG. Edge computing and service offloading algorithm based on improved deep reinforcement learning [J]. Journal of Computer Applications, 2023, 43(5): 1543-1550.

Proximal policy optimization algorithm based on clipping optimization and policy guidance

基于裁剪优化和策略指导的近端策略优化算法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 11

References 21

Related Articles 15

Recommended Articles

Metrics