Policy optimization method integrating overestimated and underestimated value functions

doi:10.11772/j.issn.1001-9081.2025091175

Journal of Computer Applications

Policy optimization method integrating overestimated and underestimated value functions

ZHANG Ziheng^1,2, QIN Jin^1,2

1.State Key Laboratory of Public Big Data (Guizhou University) 2.College of Computer Science and Technology, Guizhou University

Received:2025-10-09 Revised:2025-12-12 Online:2025-12-26 Published:2025-12-26
About author:ZHANG ZiHeng, born in 2002, M. S. candidate. His research interests include reinforcement learning. QIN Jin, born in 1978, Ph. D., associated professor. His research interests include computational intelligence, reinforcement learning.
Supported by:
Natural Science Foundation of China (62162007); Scientific and Technological Projects in Guizhou (KJZY [2025]020)

结合高低估价值函数的策略优化方法

张紫衡^1,2，秦进^1,2

1.公共大数据国家重点实验室（贵州大学） 2.贵州大学计算机科学与技术学院

通讯作者: 秦进
作者简介:张紫衡(2002—)，男，河北衡水人，硕士研究生，主要研究方向：强化学习；秦进(1978—)，男，贵州黔西人，副教授，博士，CCF会员，主要研究方向：计算智能、强化学习。
基金资助:
国家自然科学基金资助项目(62162007)；贵州省科技计划项目（黔科合人才KJZY〔2025〕020）

Abstract

Abstract: In reinforcement learning, the estimation bias of value functions is a critical challenge that restricts the improvement of algorithm performance. It stems from factors such as environmental dynamic uncertainty, function approximation errors, and data distribution shifts, causing the strategy optimization direction to deviate from the true optimal solution and even leading to divergence in the training process. To systematically alleviate this problem, a strategy optimization method combining overestimated value functions and underestimated value functions was proposed. Meanwhile, a dynamic weight adjustment method was designed, through which the weight of the overestimated value function was dynamically adjusted by real-time approximation of the underestimation degree of the current underestimated value function. Additionally, a method for dynamically selecting underestimated value functions was presented. To verify the effectiveness of the proposed strategy optimization method, it was respectively combined with the TD3 (Twin Delayed Deep Deterministic policy gradient), TADD (Triplet Average Deep Deterministic policy gradient algorithm), and QMD3 (Quasi-Median Delayed Deep Deterministic policy gradient) algorithms, and tested and evaluated on six control tasks in OpenAI Gym. The experimental results showed that the proposed method achieved performance improvements in most tasks.

Key words: complementary value network, dynamic weight adjustment, policy optimization, estimation bias, underestimated value network selection

摘要： 在强化学习中，价值函数估计偏差是制约算法性能提升的关键难题，源于环境动态不确定性、函数近似误差及数据分布偏移等因素，导致策略优化方向偏离真实最优解，甚至引发训练过程发散。为系统性缓解该问题，提出一种结合高估价值函数与低估价值函数的策略优化方法。同时，设计了一种动态调节的权重调节方法，该方法通过实时近似判断当前低估价值函数的低估程度动态调节高估价值函数权重；并提出一种能够动态选择低估价值函数的方法。为验证该策略优化方法的有效性，分别将该方法与TD3（Twin Delayed Deep Deterministic policy gradient）、TADD（Triplet Average Deep Deterministic policy gradient algorithm）、QMD3（Quasi-Median Delayed Deep Deterministic policy gradient）算法相结合，并在OpenAI Gym的6个控制任务上进行测试评估。实验结果表明，所提方法在大部分任务上取得了性能提升。

关键词: 互补价值网络, 动态权重调节, 策略优化, 估计偏差, 低估价值网络选择

CLC Number:

TP181

ZHANG Ziheng, QIN Jin. Policy optimization method integrating overestimated and underestimated value functions[J]. Journal of Computer Applications, DOI: 10.11772/j.issn.1001-9081.2025091175.

张紫衡秦进. 结合高低估价值函数的策略优化方法[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2025091175.

[1]	Zeyi GUO, Fenglian LI, Lichun XU. Double decision mechanism-based deep symbolic regression algorithm [J]. Journal of Computer Applications, 2026, 46(2): 406-415.
[2]	Kaile YU, Jiajun LIAO, Jiali MAO, Xiaopeng HUANG. Multi-objective optimization of steel logistics vehicle-cargo matching under multiple constraints [J]. Journal of Computer Applications, 2025, 45(8): 2477-2483.
[3]	Yi ZHOU, Hua GAO, Yongshen TIAN. Proximal policy optimization algorithm based on clipping optimization and policy guidance [J]. Journal of Computer Applications, 2024, 44(8): 2334-2341.
[4]	Tian MA, Runtao XI, Jiahao LYU, Yijie ZENG, Jiayi YANG, Jiehui ZHANG. Mobile robot 3D space path planning method based on deep reinforcement learning [J]. Journal of Computer Applications, 2024, 44(7): 2055-2064.
[5]	Jiachen YU, Ye YANG. Irregular object grasping by soft robotic arm based on clipped proximal policy optimization algorithm [J]. Journal of Computer Applications, 2024, 44(11): 3629-3638.
[6]	CHEN Yufeng, XIANG Zhengtao, DONG Yabo, XIA Ming. Review of modeling, statistical properties analysis and routing strategies optimization in Internet of vehicles [J]. Journal of Computer Applications, 2015, 35(12): 3321-3324.

Policy optimization method integrating overestimated and underestimated value functions

结合高低估价值函数的策略优化方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 6

Recommended Articles

Metrics