Journal of Computer Applications

    Next Articles

Proximal policy optimization algorithm based on clipping optimization and policy guidance

  

  • Received:2023-08-10 Revised:2023-10-15 Online:2023-12-18 Published:2023-12-18

基于裁剪优化和策略指导的近端策略优化算法

周毅1,高华2,田永谌1   

  1. 1. 武汉科技大学信息科学与工程学院
    2. 武汉科技大学
  • 通讯作者: 周毅
  • 基金资助:
    国家自然科学基金项目

Abstract: Two problems of Proximal Policy Optimization (PPO) algorithm are addressed: 1) it is difficult to strictly constrain the difference between the old and new policies; 2) the exploration and exploitation efficiency is low. A proximal policy optimization based on Clipping Optimization and Policy Guidance (COAPG-PPO) algorithm is proposed. First, by analyzing the clipping mechanism of PPO, a trust domain clipping scheme based on Wasserstein distance is designed, which strengthens the constraints on the differences between old and new policies. Second, the ideas of simulated annealing and greedy algorithm are incorporated in the process of strategy updating to enhance the exploration efficiency and learning speed of the algorithm. In order to verify the effectiveness of the algorithm, comparative experiments are conducted on COAPG-PPO, CO-PPO, PPO-CMA, TR-PPO-RB and PPO using Mujoco test benchmarks. The experimental results show that COAPG-PPO performs better in most environments.

Key words: Deep Reinforcement Learning, Proximal Policy Optimization, Trust Region Constraint, Simulated Annealing, Greedy Algorithm

摘要: 针对近端策略优化(Proximal Policy Optimization,PPO)算法存在的两个问题:1)难以严格约束新旧策略的差异;2)探索与利用效率较低。提出了一种基于裁剪优化和策略指导的近端策略优化算法(Proximal Policy Optimization based on Clipping Optimization and Policy Guidance, COAPG-PPO)。首先,通过分析PPO的裁剪机制,设计了基于Wasserstein距离的信任域裁剪方案,加强了对新旧策略差异的约束。其次,在策略更新过程中,融入了模拟退火和贪心算法的思想,提升了算法的探索效率和学习速度。为了验证算法的有效性,使用Mujoco测试基准对COAPG-PPO、CO-PPO、PPO-CMA、TR-PPO-RB和PPO进行对比实验。实验结果表明,COAPG-PPO在大多数环境中都有更好的表现。

关键词: 深度强化学习, 近端策略优化, 信任域约束, 模拟退火, 贪心算法

CLC Number: