Journal of Computer Applications

    Next Articles

Adaptive dual-Critic fusion method for mitigating value estimation bias

LI Shanshan1,2, QIN Jin1,2   

  1. 1. State Key Laboratory of Public Big Data (Guizhou University) 2. College of Computer Science and Technology, Guizhou University
  • Received:2025-10-13 Revised:2026-01-07 Online:2026-03-13 Published:2026-03-13
  • About author:LI Shanshan, born in 2000, M. S. candidate. Her research interests include reinforcement learning. QIN Jin, born in 1978, Ph. D., associate professor. His research interests include computational intelligence, reinforcement learning.
  • Supported by:
    Natural Science Foundation of China (62162007); Scientific and Technological Projects in Guizhou (KJZY〔2025〕020)

自适应融合双Critic缓解价值估计偏差的方法

李姗姗1,2,秦进1,2   

  1. 1.公共大数据国家重点实验室(贵州大学) 2.贵州大学 计算机科学与技术学院
  • 通讯作者: 秦进
  • 作者简介:李姗姗(2000—),女,河南洛阳人,硕士研究生,CCF会员,主要研究方向:强化学习;秦进(1978—),男,贵州黔西人,副教授,博士,CCF会员,主要研究方向:计算智能、强化学习。
  • 基金资助:
    国家自然科学基金资助项目(62162007);贵州省科技计划项目(黔科合人才KJZY〔2025〕020)

Abstract: Value estimation bias is a critical challenge in model-free off-policy deep reinforcement learning. This bias accumulates throughout the training process, often leading to suboptimal policies or training divergence. Existing methods typically rely on fixed bias suppression strategies, which lack the flexibility to adapt to the dynamic characteristics of bias. To address this issue, an adaptive fusion method for value function target estimation using dual Critics was proposed Based on the discrepancy between the estimated values of two independent Critic networks, the minimum and average of their estimates were dynamically weighted by this method to construct a more robust temporal-difference target. By adaptively adjusting the combination weights according to the inter-network discrepancy, the value estimation bias was flexibly mitigated. Experimental results on five robotic control tasks in the MuJoCo simulation environment show that compared with baseline algorithms such as Twin Delayed Deep Deterministic Policy Gradient (TD3), Soft Actor-Critic (SAC), Deep Deterministic Policy Gradient (DDPG), and Triplet-Average Deep Deterministic Policy Gradient (TADD), the proposed method achieves better final performance and stronger training stability in most tasks.

Key words: deep reinforcement learning, value function, value estimation bias, dual Critic network, adaptive target value estimation, temporal difference target

摘要: 价值估计偏差是无模型异策略深度强化学习面临的关键问题。偏差在训练过程中不断累积,易引发策略次优或训练发散。现有方法通常依赖固定的偏差抑制策略,难以灵活应对偏差的动态变化。为此,提出一种自适应融合双Critic的价值函数目标值估计方法。该方法基于2个独立Critic网络的估计值差异将2个Critic网络的最小值和平均值进行动态加权组合以构建更稳健的时序差分目标值;基于两个网络的差异动态调节组合权重,从而自适应地缓解价值估计偏差。在MuJoCo模拟环境中的5个机器人控制任务上的实验结果表明,与双延迟深度确定性策略梯度(TD3)、SAC(Soft Actor-Critic)、深度确定性策略梯度(DDPG)和TADD(Triplet-Average Deep Deterministic policy gradient)等基线算法相比,所提方法在大部分任务上取得了更优的最终性能和更强的训练稳定性。

关键词: 深度强化学习, 价值函数, 价值估计偏差, 双Critic网络, 自适应目标值估计, 时序差分目标

CLC Number: