Adaptive dual-Critic fusion method for mitigating value estimation bias

doi:10.11772/j.issn.1001-9081.2025091196

Journal of Computer Applications

Adaptive dual-Critic fusion method for mitigating value estimation bias

LI Shanshan^1,2, QIN Jin^1,2

1. State Key Laboratory of Public Big Data (Guizhou University) 2. College of Computer Science and Technology, Guizhou University

Received:2025-10-13 Revised:2026-01-07 Online:2026-03-13 Published:2026-03-13
About author:LI Shanshan, born in 2000, M. S. candidate. Her research interests include reinforcement learning. QIN Jin, born in 1978, Ph. D., associate professor. His research interests include computational intelligence, reinforcement learning.
Supported by:
Natural Science Foundation of China (62162007); Scientific and Technological Projects in Guizhou (KJZY〔2025〕020)

自适应融合双Critic缓解价值估计偏差的方法

李姗姗^1,2，秦进^1,2

1.公共大数据国家重点实验室（贵州大学） 2.贵州大学计算机科学与技术学院

通讯作者: 秦进
作者简介:李姗姗(2000—)，女，河南洛阳人，硕士研究生，CCF会员，主要研究方向：强化学习；秦进(1978—)，男，贵州黔西人，副教授，博士，CCF会员，主要研究方向：计算智能、强化学习。
基金资助:
国家自然科学基金资助项目(62162007)；贵州省科技计划项目（黔科合人才KJZY〔2025〕020）

Abstract

Abstract: Value estimation bias is a critical challenge in model-free off-policy deep reinforcement learning. This bias accumulates throughout the training process, often leading to suboptimal policies or training divergence. Existing methods typically rely on fixed bias suppression strategies, which lack the flexibility to adapt to the dynamic characteristics of bias. To address this issue, an adaptive fusion method for value function target estimation using dual Critics was proposed Based on the discrepancy between the estimated values of two independent Critic networks, the minimum and average of their estimates were dynamically weighted by this method to construct a more robust temporal-difference target. By adaptively adjusting the combination weights according to the inter-network discrepancy, the value estimation bias was flexibly mitigated. Experimental results on five robotic control tasks in the MuJoCo simulation environment show that compared with baseline algorithms such as Twin Delayed Deep Deterministic Policy Gradient (TD3), Soft Actor-Critic (SAC), Deep Deterministic Policy Gradient (DDPG), and Triplet-Average Deep Deterministic Policy Gradient (TADD), the proposed method achieves better final performance and stronger training stability in most tasks.

Key words: deep reinforcement learning, value function, value estimation bias, dual Critic network, adaptive target value estimation, temporal difference target

摘要： 价值估计偏差是无模型异策略深度强化学习面临的关键问题。偏差在训练过程中不断累积，易引发策略次优或训练发散。现有方法通常依赖固定的偏差抑制策略，难以灵活应对偏差的动态变化。为此，提出一种自适应融合双Critic的价值函数目标值估计方法。该方法基于2个独立Critic网络的估计值差异将2个Critic网络的最小值和平均值进行动态加权组合以构建更稳健的时序差分目标值；基于两个网络的差异动态调节组合权重，从而自适应地缓解价值估计偏差。在MuJoCo模拟环境中的5个机器人控制任务上的实验结果表明，与双延迟深度确定性策略梯度（TD3）、SAC（Soft Actor-Critic）、深度确定性策略梯度（DDPG）和TADD（Triplet-Average Deep Deterministic policy gradient）等基线算法相比，所提方法在大部分任务上取得了更优的最终性能和更强的训练稳定性。

关键词: 深度强化学习, 价值函数, 价值估计偏差, 双Critic网络, 自适应目标值估计, 时序差分目标

CLC Number:

TP181

LI Shanshan, QIN Jin. Adaptive dual-Critic fusion method for mitigating value estimation bias[J]. Journal of Computer Applications, DOI: 10.11772/j.issn.1001-9081.2025091196.

李姗姗秦进. 自适应融合双Critic缓解价值估计偏差的方法[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2025091196.

[1]	Xiayu WU, Hong ZHANG. Review of evolution and changes in crowd evacuation calculation methods [J]. Journal of Computer Applications, 2026, 46(4): 1309-1322.
[2]	Tianyu XUE, Aiping LI, Liguo DUAN. Vehicular edge computing scheme with task offloading and resource optimization [J]. Journal of Computer Applications, 2025, 45(6): 1766-1775.
[3]	Pengcheng XU, Lei HE, Chuan LI, Weiqi QIAN, Tun ZHAO. Deep symbolic regression method based on Transformer [J]. Journal of Computer Applications, 2025, 45(5): 1455-1463.
[4]	Jing WANG, Xuming FANG. Intelligent joint power and channel allocation algorithm for Wi-Fi7 multi-link integrated communication and sensing [J]. Journal of Computer Applications, 2025, 45(2): 563-570.
[5]	Huahua WANG, Liang HUANG, Jiajie CHEN, Jiening FANG. Dynamic allocation algorithm for multi-beam subcarriers of low orbit satellites based on deep reinforcement learning [J]. Journal of Computer Applications, 2025, 45(2): 571-577.
[6]	Jun ZENG, Yinghua TONG, Defang WANG. Anomaly detection method based on cumulative probability fluctuation and automated clustering [J]. Journal of Computer Applications, 2025, 45(12): 3864-3871.
[7]	Lin WEI, Shihao ZHANG, Mengyang HE. Workflow task optimization and energy-efficient offloading method for computing power network [J]. Journal of Computer Applications, 2025, 45(12): 3916-3924.
[8]	Lin WEI, Jinyang LI, Yajie WANG, Mengyang HE. Highly reliable matching method based on multi-dimensional resource measurement and rescheduling in computing power network [J]. Journal of Computer Applications, 2025, 45(11): 3632-3641.
[9]	Yanpeng ZHANG, Yuqian ZHAO, Fan ZHANG, Tenghai QIU, Gui GUI, Lingli YU. Capacitated vehicle routing problem solving method based on improved MAML and GVAE [J]. Journal of Computer Applications, 2025, 45(11): 3642-3648.
[10]	Shuai ZHOU, Hao FU, Wei LIU. Spatial-temporal Transformer-based hybrid return implicit Q-learning for crowd navigation [J]. Journal of Computer Applications, 2025, 45(11): 3666-3673.
[11]	Zijun MIAO, Fei LUO, Weichao DING, Wenbo DONG. Traffic signal control algorithm based on overall state prediction and fair experience replay [J]. Journal of Computer Applications, 2025, 45(1): 337-344.
[12]	Yi ZHOU, Hua GAO, Yongshen TIAN. Proximal policy optimization algorithm based on clipping optimization and policy guidance [J]. Journal of Computer Applications, 2024, 44(8): 2334-2341.
[13]	Tian MA, Runtao XI, Jiahao LYU, Yijie ZENG, Jiayi YANG, Jiehui ZHANG. Mobile robot 3D space path planning method based on deep reinforcement learning [J]. Journal of Computer Applications, 2024, 44(7): 2055-2064.
[14]	Xiaoyan ZHAO, Wei HAN, Junna ZHANG, Peiyan YUAN. Collaborative offloading strategy in internet of vehicles based on asynchronous deep reinforcement learning [J]. Journal of Computer Applications, 2024, 44(5): 1501-1510.
[15]	Rui TANG, Chuanlin PANG, Ruizhi ZHANG, Chuan LIU, Shibo YUE. DDPG-based resource allocation in D2D communication-empowered cellular network [J]. Journal of Computer Applications, 2024, 44(5): 1562-1569.

Adaptive dual-Critic fusion method for mitigating value estimation bias

自适应融合双Critic缓解价值估计偏差的方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics