Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (8): 2334-2341.DOI: 10.11772/j.issn.1001-9081.2023081079

• Artificial intelligence • Previous Articles     Next Articles

Proximal policy optimization algorithm based on clipping optimization and policy guidance

Yi ZHOU(), Hua GAO, Yongshen TIAN   

  1. School of Information Science and Engineering,Wuhan University of Science and Technology,Wuhan Hubei 430081,China
  • Received:2023-08-10 Revised:2023-10-15 Accepted:2023-10-24 Online:2023-12-18 Published:2024-08-10
  • Contact: Yi ZHOU
  • About author:ZHOU Yi, born in 1983, Ph. D., associate professor. His researchinterests include swarm intelligence optimization, deep reinforcementlearning.
    GAO Hua, born in 1997, M. S. candidate. His research interestsinclude deep reinforcement learning, automatic driving.
    TIAN Yongshen, born in 1999, M. S. candidate. His researchinterests include deep reinforcement learning, intelligent game.
  • Supported by:
    This work is partially supported by National Natural ScienceFoundation of China( 62372343).

基于裁剪优化和策略指导的近端策略优化算法

周毅(), 高华, 田永谌   

  1. 武汉科技大学 信息科学与工程学院,武汉 430081
  • 通讯作者: 周毅
  • 作者简介:周毅(1983—),男,湖北汉川人,副教授,博士,CCF会员,主要研究方向:群体智能优化、深度强化学习 zhouyi83@wust.edu.cn
    高华(1997—),男,安徽合肥人,硕士研究生,主要研究方向:深度强化学习、自动驾驶
    田永谌(1999—),男,湖北潜江人,硕士研究生,主要研究方向:深度强化学习、智能博弈。
  • 基金资助:
    国家自然科学基金资助项目(62372343)

Abstract:

Addressing the two issues in the Proximal Policy Optimization (PPO) algorithm, the difficulty in strictly constraining the difference between old and new policies and the relatively low efficiency in exploration and utilization, a PPO based on Clipping Optimization And Policy Guidance (COAPG-PPO) algorithm was proposed. Firstly, by analyzing the clipping mechanism of PPO, a trust-region clipping approach based on the Wasserstein distance was devised, strengthening the constraint on the difference between old and new policies. Secondly, within the policy updating process, ideas from simulated annealing and greedy algorithms were incorporated, improving the exploration efficiency and learning speed of algorithm. To validate the effectiveness of COAPG-PPO algorithm, comparative experiments were conducted using the MuJoCo testing benchmarks between PPO based on Clipping Optimization (CO-PPO), PPO with Covariance Matrix Adaptation (PPO-CMA), Trust Region-based PPO with RollBack (TR-PPO-RB), and PPO algorithm. The experimental results indicate that COAPG-PPO algorithm demonstrates stricter constraint capabilities, higher exploration and exploitation efficiencies, and higher reward values in most environments.

Key words: deep reinforcement learning, proximal policy optimization, trust region constraint, simulated annealing, greedy algorithm

摘要:

针对近端策略优化(PPO)算法难以严格约束新旧策略的差异和探索与利用效率较低这2个问题,提出一种基于裁剪优化和策略指导的PPO(COAPG-PPO)算法。首先,通过分析PPO的裁剪机制,设计基于Wasserstein距离的信任域裁剪方案,加强对新旧策略差异的约束;其次,在策略更新过程中,融入模拟退火和贪心算法的思想,提升算法的探索效率和学习速度。为了验证所提算法的有效性,使用MuJoCo测试基准对COAPG-PPO与CO-PPO(PPO based on Clipping Optimization)、PPO-CMA(PPO with Covariance Matrix Adaptation)、TR-PPO-RB(Trust Region-based PPO with RollBack)和PPO算法进行对比实验。实验结果表明,COAPG-PPO算法在大多数环境中具有更严格的约束能力、更高的探索和利用效率,以及更高的奖励值。

关键词: 深度强化学习, 近端策略优化, 信任域约束, 模拟退火, 贪心算法

CLC Number: