Proximal policy optimization algorithm based on clipping optimization and policy guidance

doi:10.11772/j.issn.1001-9081.2023081079

Journal of Computer Applications

Received:2023-08-10 Revised:2023-10-15 Online:2023-12-18 Published:2023-12-18

基于裁剪优化和策略指导的近端策略优化算法

周毅¹,高华²,田永谌¹

1. 武汉科技大学信息科学与工程学院
2. 武汉科技大学

通讯作者: 周毅
基金资助:
国家自然科学基金项目

Abstract

Abstract: Two problems of Proximal Policy Optimization (PPO) algorithm are addressed: 1) it is difficult to strictly constrain the difference between the old and new policies; 2) the exploration and exploitation efficiency is low. A proximal policy optimization based on Clipping Optimization and Policy Guidance (COAPG-PPO) algorithm is proposed. First, by analyzing the clipping mechanism of PPO, a trust domain clipping scheme based on Wasserstein distance is designed, which strengthens the constraints on the differences between old and new policies. Second, the ideas of simulated annealing and greedy algorithm are incorporated in the process of strategy updating to enhance the exploration efficiency and learning speed of the algorithm. In order to verify the effectiveness of the algorithm, comparative experiments are conducted on COAPG-PPO, CO-PPO, PPO-CMA, TR-PPO-RB and PPO using Mujoco test benchmarks. The experimental results show that COAPG-PPO performs better in most environments.

Key words: Deep Reinforcement Learning, Proximal Policy Optimization, Trust Region Constraint, Simulated Annealing, Greedy Algorithm

摘要： 针对近端策略优化（Proximal Policy Optimization，PPO）算法存在的两个问题：1）难以严格约束新旧策略的差异；2）探索与利用效率较低。提出了一种基于裁剪优化和策略指导的近端策略优化算法（Proximal Policy Optimization based on Clipping Optimization and Policy Guidance, COAPG-PPO）。首先，通过分析PPO的裁剪机制，设计了基于Wasserstein距离的信任域裁剪方案，加强了对新旧策略差异的约束。其次，在策略更新过程中，融入了模拟退火和贪心算法的思想，提升了算法的探索效率和学习速度。为了验证算法的有效性，使用Mujoco测试基准对COAPG-PPO、CO-PPO、PPO-CMA、TR-PPO-RB和PPO进行对比实验。实验结果表明，COAPG-PPO在大多数环境中都有更好的表现。

关键词: 深度强化学习, 近端策略优化, 信任域约束, 模拟退火, 贪心算法

CLC Number:

TP183

周毅高华田永谌. 基于裁剪优化和策略指导的近端策略优化算法[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2023081079.

[1]	Xintong QIN, Zhengyu SONG, Tianwei HOU, Feiyue WANG, Xin SUN, Wei LI. Channel access and resource allocation algorithm for adaptive p-persistent mobile ad hoc network [J]. Journal of Computer Applications, 2024, 44(3): 863-868.
[2]	Yuanchao LI, Chongben TAO, Chen WANG. Gait control method based on maximum entropy deep reinforcement learning for biped robot [J]. Journal of Computer Applications, 2024, 44(2): 445-451.
[3]	Fuqin DENG, Huifeng GUAN, Chaoen TAN, Lanhui FU, Hongmin WANG, Tinlun LAM, Jianmin ZHANG. Multi-robot reinforcement learning path planning method based on request-response communication mechanism and local attention mechanism [J]. Journal of Computer Applications, 2024, 44(2): 432-438.
[4]	Jie LONG, Liang XIE, Haijiao XU. Integrated deep reinforcement learning portfolio model [J]. Journal of Computer Applications, 2024, 44(1): 300-310.
[5]	Yu WANG, Tianjun REN, Zilin FAN. Air combat maneuver decision-making of unmanned aerial vehicle based on guided Minimax-DDQN [J]. Journal of Computer Applications, 2023, 43(8): 2636-2643.
[6]	Ziteng WANG, Yaxin YU, Zifang XIA, Jiaqi QIAO. Sparse reward exploration mechanism fusing curiosity and policy distillation [J]. Journal of Computer Applications, 2023, 43(7): 2082-2090.
[7]	Xiaolin LI, Yusang JIANG. Task offloading algorithm for UAV-assisted mobile edge computing [J]. Journal of Computer Applications, 2023, 43(6): 1893-1899.
[8]	Heping FANG, Shuguang LIU, Yongyi RAN, Kunhua ZHONG. Integrated scheduling optimization of multiple data centers based on deep reinforcement learning [J]. Journal of Computer Applications, 2023, 43(6): 1884-1892.
[9]	Xiaohui HUANG, Kaiming YANG, Jiahao LING. Order dispatching by multi-agent reinforcement learning based on shared attention [J]. Journal of Computer Applications, 2023, 43(5): 1620-1624.
[10]	Tengfei CAO, Yanliang LIU, Xiaoying WANG. Edge computing and service offloading algorithm based on improved deep reinforcement learning [J]. Journal of Computer Applications, 2023, 43(5): 1543-1550.
[11]	Zhengkai DING, Qiming FU, Jianping CHEN, You LU, Hongjie WU, Nengwei FANG, Bin XING. Ultra-short-term photovoltaic power prediction by deep reinforcement learning based on attention mechanism [J]. Journal of Computer Applications, 2023, 43(5): 1647-1654.
[12]	Zhe WANG, Qiming WANG, Taoshen LI, Lina GE. Joint optimization method for SWIPT edge network based on deep reinforcement learning [J]. Journal of Computer Applications, 2023, 43(11): 3540-3550.
[13]	DENG Huiyi, LI Yongzhen, YIN Qiyue. Improved QMIX algorithm from communication and exploration for multi-agent reinforcement learning [J]. Journal of Computer Applications, 2023, 43(1): 202-208.
[14]	Shaobin DENG, Jun ZHU, Xiaofeng ZHOU, Shuai LI, Shurui LIU. Industrial process control method based on local policy interaction exploration-based deep deterministic policy gradient [J]. Journal of Computer Applications, 2022, 42(5): 1642-1648.
[15]	Sheng CHEN, Jun ZHOU, Xiaobing HU, Ji MA. Optimization of airport arrival procedures based on hybrid simulated annealing algorithm [J]. Journal of Computer Applications, 2022, 42(2): 606-615.