Maximum entropy reinforcement learning method with temperature parameter adaptive adjustment

doi:10.11772/j.issn.1001-9081.2025081006

Journal of Computer Applications

Received:2025-09-03 Revised:2026-01-21 Online:2026-02-12 Published:2026-02-12
Contact: Jin nullQIN

温度系数自适应调节的最大熵强化学习方法

许涛¹,²,胡滨²,秦进³

1. 公共大数据国家重点实验室（贵州大学）
2. 贵州大学计算机科学与技术学院
3. 贵州大学

通讯作者: 秦进
基金资助:
面向视觉 SLAM可调参数智能体的异质图构建研究;基于数据与业务融合的国资国企在线智慧监管关键技术研究与应用

Abstract

Abstract: Abstract: Maximum entropy reinforcement learning has attracted considerable attention due to its exceptional exploration capabilities in complex tasks. The temperature parameter, which regulates the importance of the policy's entropy term, is a key factor in balancing policy exploration and exploitation, and its setting significantly impacts algorithm performance. However, existing methods for adjusting the temperature parameter typically rely on empirical pre-sets or fixed target entropy adjustments, neglecting state-dependent exploration variations and lacking effective adaptive mechanisms. To address this, this paper proposes a state-based adaptive temperature parameter adjustment method. This method uses a neural network model to predict the appropriate temperature parameter based on a given state. Normalized temporal difference errors are used to construct supervisory information to guide model training, enabling state-based adaptive adjustment of the entropy term weights. This adaptive temperature parameter adjustment method is combined with the SAC (Soft Actor-Critic) algorithm to propose a state-based temperature parameter adaptive adjustment SAC algorithm. On the MuJoCo standard control task, the proposed algorithm outperforms baseline methods in terms of performance and training stability, validating the effectiveness of the state-based adaptive temperature parameter adjustment method.

Key words: Keywords: reinforcement learning, maximum entropy theory, temporal difference error, temperature parameter, exploration-Exploitation balance, Soft Actor-Critic algorithm

摘要： 摘要: 最大熵强化学习因其在复杂任务中卓越的探索能力备受关注。调节策略熵项重要性的温度系数是平衡策略探索与利用的关键因素，其设定方式对算法性能具有重要影响。然而，调节温度系数的现有方法通常依赖经验预设或固定目标熵调整，忽视了与状态相关的探索的差异，缺乏有效的自适应机制。为此，本文提出基于状态的温度系数自适应调节方法，采用神经网络模型根据给定状态预测适合的温度系数，利用规范化时序差分误差构建监督信息指导模型的训练，实现基于状态的熵项权重自适应调节。将该温度系数自适应调节方法与SAC(Soft Actor-Critic)算法相结合，提出了基于状态的温度系数自适应调节SAC算法。在MuJoCo标准控制任务上，该算法在性能表现与训练稳定性方面总体优于基准方法，验证了基于状态的温度系数自适应调节方法的有效性。

关键词: 关键词: 强化学习, 最大熵理论, 时序差分误差, 温度系数, 探索与利用平衡, 软演员-评论家算法

CLC Number:

中图分类号:TP181

许涛胡滨秦进. 温度系数自适应调节的最大熵强化学习方法[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2025081006.