Journal of Computer Applications

    Next Articles

Policy optimization method integrating overestimated and underestimated value functions

ZHANG Ziheng1,2, QIN Jin1,2   

  1. 1.State Key Laboratory of Public Big Data (Guizhou University) 2.College of Computer Science and Technology, Guizhou University
  • Received:2025-10-09 Revised:2025-12-12 Online:2025-12-26 Published:2025-12-26
  • About author:ZHANG ZiHeng, born in 2002, M. S. candidate. His research interests include reinforcement learning. QIN Jin, born in 1978, Ph. D., associated professor. His research interests include computational intelligence, reinforcement learning.
  • Supported by:
    Natural Science Foundation of China (62162007); Scientific and Technological Projects in Guizhou (KJZY [2025]020)

结合高低估价值函数的策略优化方法

张紫衡1,2,秦进1,2   

  1. 1.公共大数据国家重点实验室(贵州大学) 2.贵州大学 计算机科学与技术学院
  • 通讯作者: 秦进
  • 作者简介:张紫衡(2002—),男,河北衡水人,硕士研究生,主要研究方向:强化学习;秦进(1978—),男,贵州黔西人,副教授,博士,CCF会员,主要研究方向:计算智能、强化学习。
  • 基金资助:
    国家自然科学基金资助项目(62162007);贵州省科技计划项目(黔科合人才KJZY〔2025〕020)

Abstract: In reinforcement learning, the estimation bias of value functions is a critical challenge that restricts the improvement of algorithm performance. It stems from factors such as environmental dynamic uncertainty, function approximation errors, and data distribution shifts, causing the strategy optimization direction to deviate from the true optimal solution and even leading to divergence in the training process. To systematically alleviate this problem, a strategy optimization method combining overestimated value functions and underestimated value functions was proposed. Meanwhile, a dynamic weight adjustment method was designed, through which the weight of the overestimated value function was dynamically adjusted by real-time approximation of the underestimation degree of the current underestimated value function. Additionally, a method for dynamically selecting underestimated value functions was presented. To verify the effectiveness of the proposed strategy optimization method, it was respectively combined with the TD3 (Twin Delayed Deep Deterministic policy gradient), TADD (Triplet Average Deep Deterministic policy gradient algorithm), and QMD3 (Quasi-Median Delayed Deep Deterministic policy gradient) algorithms, and tested and evaluated on six control tasks in OpenAI Gym. The experimental results showed that the proposed method achieved performance improvements in most tasks.

Key words: complementary value network, dynamic weight adjustment, policy optimization, estimation bias, underestimated value network selection

摘要: 在强化学习中,价值函数估计偏差是制约算法性能提升的关键难题,源于环境动态不确定性、函数近似误差及数据分布偏移等因素,导致策略优化方向偏离真实最优解,甚至引发训练过程发散。为系统性缓解该问题,提出一种结合高估价值函数与低估价值函数的策略优化方法。同时,设计了一种动态调节的权重调节方法,该方法通过实时近似判断当前低估价值函数的低估程度动态调节高估价值函数权重;并提出一种能够动态选择低估价值函数的方法。为验证该策略优化方法的有效性,分别将该方法与TD3(Twin Delayed Deep Deterministic policy gradient)、TADD(Triplet Average Deep Deterministic policy gradient algorithm)、QMD3(Quasi-Median Delayed Deep Deterministic policy gradient)算法相结合,并在OpenAI Gym的6个控制任务上进行测试评估。实验结果表明,所提方法在大部分任务上取得了性能提升。

关键词: 互补价值网络, 动态权重调节, 策略优化, 估计偏差, 低估价值网络选择

CLC Number: