Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (8): 2334-2341.DOI: 10.11772/j.issn.1001-9081.2023081079
• Artificial intelligence • Previous Articles Next Articles
					
						                                                                                                                                                                                                                    Yi ZHOU(
), Hua GAO, Yongshen TIAN
												  
						
						
						
					
				
Received:2023-08-10
															
							
																	Revised:2023-10-15
															
							
																	Accepted:2023-10-24
															
							
							
																	Online:2023-12-18
															
							
																	Published:2024-08-10
															
							
						Contact:
								Yi ZHOU   
													About author:ZHOU Yi, born in 1983, Ph. D., associate professor. His researchinterests include swarm intelligence optimization, deep reinforcementlearning.Supported by:通讯作者:
					周毅
							作者简介:周毅(1983—),男,湖北汉川人,副教授,博士,CCF会员,主要研究方向:群体智能优化、深度强化学习 zhouyi83@wust.edu.cn基金资助:CLC Number:
Yi ZHOU, Hua GAO, Yongshen TIAN. Proximal policy optimization algorithm based on clipping optimization and policy guidance[J]. Journal of Computer Applications, 2024, 44(8): 2334-2341.
周毅, 高华, 田永谌. 基于裁剪优化和策略指导的近端策略优化算法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2334-2341.
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2023081079
| 任务 | 状态 维度  | 动作 维度  | 任务 | 状态 维度  | 动作 维度  | 
|---|---|---|---|---|---|
| Swimmer-v2 | 8 | 2 | Humanoid-v2 | 376 | 17 | 
| Reacher-v2 | 11 | 2 | HumanoidStandup-v2 | 376 | 17 | 
| Walker2d-v2 | 17 | 6 | 
Tab. 1 State dimensions and action dimensions of experimental tasks
| 任务 | 状态 维度  | 动作 维度  | 任务 | 状态 维度  | 动作 维度  | 
|---|---|---|---|---|---|
| Swimmer-v2 | 8 | 2 | Humanoid-v2 | 376 | 17 | 
| Reacher-v2 | 11 | 2 | HumanoidStandup-v2 | 376 | 17 | 
| Walker2d-v2 | 17 | 6 | 
| 任务 | 初始 阈值  | 衰减 系数  | 任务 | 初始 阈值  | 衰减 系数  | 
|---|---|---|---|---|---|
| Swimmer-v2 | 0.10 | 0.010 | Humanoid-v2 | 0.14 | 0.010 | 
| Reacher-v2 | 0.04 | 0.016 | HumanoidStandup-v2 | 0.13 | 0.004 | 
| Walker2d-v2 | 0.10 | 0.008 | 
Tab. 2 Initial threshold and attenuation parameter setting for CO-PPO algorithm
| 任务 | 初始 阈值  | 衰减 系数  | 任务 | 初始 阈值  | 衰减 系数  | 
|---|---|---|---|---|---|
| Swimmer-v2 | 0.10 | 0.010 | Humanoid-v2 | 0.14 | 0.010 | 
| Reacher-v2 | 0.04 | 0.016 | HumanoidStandup-v2 | 0.13 | 0.004 | 
| Walker2d-v2 | 0.10 | 0.008 | 
| 算法 | Swimmer-v2 | Reacher-v2 | Walker2d-v2 | Humanoid-v2 | HumanoidStandup-v2 | 
|---|---|---|---|---|---|
| PPO | 102.95 | -8.70 | 3 310.04 | 506.26 | 105 795.94 | 
| TR-PPO-RB | 97.70 | -7.74 | 3 786.54 | 530.88 | 140 587.55 | 
| PPO-CMA | 119.17 | -7.92 | 3 942.12 | 598.77 | 150 800.39 | 
| CO-PPO | 86.51 | -7.05 | 4 294.67 | 539.33 | 143 369.69 | 
| COAPG-PPO | 111.01 | -6.37 | 4670.17 | 638.98 | 162024.21 | 
Tab. 3 Average rewards of various algorithms in last 40% episodes
| 算法 | Swimmer-v2 | Reacher-v2 | Walker2d-v2 | Humanoid-v2 | HumanoidStandup-v2 | 
|---|---|---|---|---|---|
| PPO | 102.95 | -8.70 | 3 310.04 | 506.26 | 105 795.94 | 
| TR-PPO-RB | 97.70 | -7.74 | 3 786.54 | 530.88 | 140 587.55 | 
| PPO-CMA | 119.17 | -7.92 | 3 942.12 | 598.77 | 150 800.39 | 
| CO-PPO | 86.51 | -7.05 | 4 294.67 | 539.33 | 143 369.69 | 
| COAPG-PPO | 111.01 | -6.37 | 4670.17 | 638.98 | 162024.21 | 
| 任务 | 奖励阈值 | 各算法达到奖励阈值的时间步/103 | ||||
|---|---|---|---|---|---|---|
| PPO | TR-PPO-RB | PPO-CMA | CO-PPO | COAPG-PPO | ||
| Swimmer-v2 | 80 | 370 | 558 | 434 | 260 | 276 | 
| Reacher-v2 | -8 | 668 | 636 | 641 | 483 | 397 | 
| Walker2d-v2 | 3 000 | 409 | 423 | 470 | 299 | 255 | 
| Humanoid-v2 | 500 | 179 | 500 | 385 | 292 | 247 | 
| HumanoidStandup-v2 | 110 000 | 224 | 287 | 283 | 323 | 162 | 
Tab. 4 Timesteps for various algorithms to reach reward thresholds
| 任务 | 奖励阈值 | 各算法达到奖励阈值的时间步/103 | ||||
|---|---|---|---|---|---|---|
| PPO | TR-PPO-RB | PPO-CMA | CO-PPO | COAPG-PPO | ||
| Swimmer-v2 | 80 | 370 | 558 | 434 | 260 | 276 | 
| Reacher-v2 | -8 | 668 | 636 | 641 | 483 | 397 | 
| Walker2d-v2 | 3 000 | 409 | 423 | 470 | 299 | 255 | 
| Humanoid-v2 | 500 | 179 | 500 | 385 | 292 | 247 | 
| HumanoidStandup-v2 | 110 000 | 224 | 287 | 283 | 323 | 162 | 
| 1 | MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning [J]. Nature, 2015, 518(7540): 529-533. | 
| 2 | LUONG N C, HOANG D T, GONG S, et al. Applications of deep reinforcement learning in communications and networking: a survey [J]. IEEE Communications Surveys & Tutorials, 2019, 21(4): 3133-3174. | 
| 3 | 刘全,翟建伟,章宗长,等.深度强化学习综述[J]. 计算机学报,2018,41(1): 1-27. | 
| LIU Q, ZHAI J W, ZHANG Z Z, et al. A survey on deep reinforcement learning [J]. Chinese Journal of Computers, 2018, 41(1): 1-27. | |
| 4 | AFSAR M M, CRUMP T, FAR B. Reinforcement learning based recommender systems: a survey [EB/OL]. (2022-06-08)[2023-08-01]. . | 
| 5 | 王雪松,王荣荣,程玉虎.安全强化学习综述[J].自动化学报,2023,49(9): 1813-1835. | 
| WANG X S, WANG R R, CHENG Y H. Safety reinforcement learning: a survey [J]. Acta Automatica Sinica, 2023, 49(9): 1813-1835. | |
| 6 | KOHL N, STONE P. Policy gradient reinforcement learning for fast quadrupedal locomotion [C]// Proceedings of the 2004 IEEE International Conference on Robotics and Automation. Piscataway: IEEE, 2004, 3: 2619-2624. | 
| 7 | PETERS J, SCHAAL S. Reinforcement learning of motor skills with policy gradients [J]. Neural Networks, 2008, 21(4): 682-697. | 
| 8 | SCHULMAN J, LEVINE S, MORITZ P, et al. Trust region policy optimization [C]// Proceedings of the 32nd International Conference on Machine Learning. New York: JMLR.org, 2015: 1889-1897. | 
| 9 | SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms [EB/OL]. (2017-08-28) [2023-06-28]. . | 
| 10 | WANG Y, HE H, TAN X. Truly proximal policy optimization [C]// Proceedings of the 35th Uncertainty in Artificial Intelligence Conference. New York: JMLR.org, 2020: 113-122. | 
| 11 | CHENG Y, HUANG L, WANG X. Authentic boundary proximal policy optimization [J]. IEEE Transactions on Cybernetics, 2021, 52(9): 9428-9438. | 
| 12 | ZHANG L, SHEN L, YANG L, et al. Penalized proximal policy optimization for safe reinforcement learning [EB/OL]. (2022-06-17) [2023-06-29]. . | 
| 13 | GU Y, CHENG Y, CHEN C L P, et al. Proximal policy optimization with policy feedback [J]. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2021, 52(7): 4600-4610. | 
| 14 | KOBAYASHI T. Proximal policy optimization with relative pearson divergence [C]// Proceedings of the 2021 IEEE International Conference on Robotics and Automation. Piscataway: IEEE, 2021: 8416-8421. | 
| 15 | QUEENEY J, PASCHALIDIS Y, CASSANDRAS C G. Generalized proximal policy optimization with sample reuse [C]// Proceedings of the 35th Conference on Neural Information Processing Systems. La Jolla, CA: NIPS, 2021: 11909-11919. | 
| 16 | 张峻伟, 吕帅, 张正昊,等.基于样本效率优化的深度强化学习方法综述[J]. 软件学报,2022,33(11): 4217-4238. | 
| ZHANG J W, LYU S, ZHANG Z H, et al. Survey on deep reinforcement learning methods based on sample efficiency optimization [J]. Journal of Software, 2022, 33(11): 4217-4238. | |
| 17 | TODOROV E, EREZ T, TASSA Y. MuJoCo: a physics engine for model-based control [C]// Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway: IEEE, 2012: 5026-5033. | 
| 18 | HÄMÄLÄINEN P, BABADI A, MA X, et al. PPO-CMA: proximal policy optimization with covariance matrix adaptation [C]// Proceedings of the 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing. Piscataway: IEEE, 2020: 1-6. | 
| 19 | PANARETOS V M, ZEMEL Y. Statistical aspects of Wasserstein distances [J]. Annual Review of Statistics and Its Application, 2019, 6: 405-431. | 
| 20 | BERTSIMAS D, TSITSIKLIS J. Simulated annealing [J]. Statistical Science, 1993, 8(1): 10-15. | 
| 21 | VINCE A. A framework for the greedy algorithm [J]. Discrete Applied Mathematics, 2002, 121(1/2/3): 247-260. | 
| [1] | Tian MA, Runtao XI, Jiahao LYU, Yijie ZENG, Jiayi YANG, Jiehui ZHANG. Mobile robot 3D space path planning method based on deep reinforcement learning [J]. Journal of Computer Applications, 2024, 44(7): 2055-2064. | 
| [2] | Yan LI, Dazhi PAN, Siqing ZHENG. Improved adaptive large neighborhood search algorithm for multi-depot vehicle routing problem with time window [J]. Journal of Computer Applications, 2024, 44(6): 1897-1904. | 
| [3] | Xiaoyan ZHAO, Wei HAN, Junna ZHANG, Peiyan YUAN. Collaborative offloading strategy in internet of vehicles based on asynchronous deep reinforcement learning [J]. Journal of Computer Applications, 2024, 44(5): 1501-1510. | 
| [4] | Rui TANG, Chuanlin PANG, Ruizhi ZHANG, Chuan LIU, Shibo YUE. DDPG-based resource allocation in D2D communication-empowered cellular network [J]. Journal of Computer Applications, 2024, 44(5): 1562-1569. | 
| [5] | Xintong QIN, Zhengyu SONG, Tianwei HOU, Feiyue WANG, Xin SUN, Wei LI. Channel access and resource allocation algorithm for adaptive p-persistent mobile ad hoc network [J]. Journal of Computer Applications, 2024, 44(3): 863-868. | 
| [6] | Fuqin DENG, Huifeng GUAN, Chaoen TAN, Lanhui FU, Hongmin WANG, Tinlun LAM, Jianmin ZHANG. Multi-robot reinforcement learning path planning method based on request-response communication mechanism and local attention mechanism [J]. Journal of Computer Applications, 2024, 44(2): 432-438. | 
| [7] | Yuanchao LI, Chongben TAO, Chen WANG. Gait control method based on maximum entropy deep reinforcement learning for biped robot [J]. Journal of Computer Applications, 2024, 44(2): 445-451. | 
| [8] | Jiachen YU, Ye YANG. Irregular object grasping by soft robotic arm based on clipped proximal policy optimization algorithm [J]. Journal of Computer Applications, 2024, 44(11): 3629-3638. | 
| [9] | Jie LONG, Liang XIE, Haijiao XU. Integrated deep reinforcement learning portfolio model [J]. Journal of Computer Applications, 2024, 44(1): 300-310. | 
| [10] | Yu WANG, Tianjun REN, Zilin FAN. Air combat maneuver decision-making of unmanned aerial vehicle based on guided Minimax-DDQN [J]. Journal of Computer Applications, 2023, 43(8): 2636-2643. | 
| [11] | Ziteng WANG, Yaxin YU, Zifang XIA, Jiaqi QIAO. Sparse reward exploration mechanism fusing curiosity and policy distillation [J]. Journal of Computer Applications, 2023, 43(7): 2082-2090. | 
| [12] | Heping FANG, Shuguang LIU, Yongyi RAN, Kunhua ZHONG. Integrated scheduling optimization of multiple data centers based on deep reinforcement learning [J]. Journal of Computer Applications, 2023, 43(6): 1884-1892. | 
| [13] | Xiaolin LI, Yusang JIANG. Task offloading algorithm for UAV-assisted mobile edge computing [J]. Journal of Computer Applications, 2023, 43(6): 1893-1899. | 
| [14] | Xiaohui HUANG, Kaiming YANG, Jiahao LING. Order dispatching by multi-agent reinforcement learning based on shared attention [J]. Journal of Computer Applications, 2023, 43(5): 1620-1624. | 
| [15] | Tengfei CAO, Yanliang LIU, Xiaoying WANG. Edge computing and service offloading algorithm based on improved deep reinforcement learning [J]. Journal of Computer Applications, 2023, 43(5): 1543-1550. | 
| Viewed | ||||||
| 
										Full text | 
									
										 | 
								|||||
| 
										Abstract | 
									
										 | 
								|||||