《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (7): 2082-2090.DOI: 10.11772/j.issn.1001-9081.2022071116

• 第39届CCF中国数据库学术会议(NDBC 2022) • 上一篇    

融合好奇心和策略蒸馏的稀疏奖励探索机制

王子腾1,2, 于亚新1,2(), 夏子芳1,2, 乔佳琪1,2   

  1. 1.东北大学 计算机科学与工程学院,沈阳 110169
    2.医学影像智能计算教育部重点实验室(东北大学),沈阳 110169
  • 收稿日期:2022-07-12 修回日期:2022-08-30 接受日期:2022-09-09 发布日期:2023-07-20 出版日期:2023-07-10
  • 通讯作者: 于亚新
  • 作者简介:王子腾(1998—),男,辽宁大连人,硕士研究生,主要研究方向:强化学习、迁移学习;
    于亚新(1971—),女,辽宁沈阳人,副教授,博士,CCF会员,主要研究方向:数据挖掘、社交网络;
    夏子芳(1998—),女,河北邢台人,硕士研究生,主要研究方向:推荐系统、因果推断;
    乔佳琪(1998—),女,黑龙江伊春人,硕士研究生,主要研究方向:自然语言处理、计算机视觉。
  • 基金资助:
    国家自然科学基金资助项目(61871106)

Sparse reward exploration mechanism fusing curiosity and policy distillation

Ziteng WANG1,2, Yaxin YU1,2(), Zifang XIA1,2, Jiaqi QIAO1,2   

  1. 1.School of Computer Science and Engineering,Northeastern University,Shenyang Liaoning 110169,China
    2.Key Laboratory of Intelligent Computing in Medical Image,Ministry of Education (Northeastern University),Shenyang Liaoning 110169,China
  • Received:2022-07-12 Revised:2022-08-30 Accepted:2022-09-09 Online:2023-07-20 Published:2023-07-10
  • Contact: Yaxin YU
  • About author:WANG Ziteng, born in 1998, M. S. candidate. His research interests include reinforcement learning, transfer learning.
    YU Yaxin, born in 1971, Ph. D., associate professor. Her research interests include data mining, social network.
    XIA Zifang, born in 1998, M. S. candidate. Her research interests include recommender system, causal inference.
    QIAO Jiaqi, born in 1998, M. S. candidate. Her research interests include natural language processing, computer vision.
  • Supported by:
    National Natural Science Foundation of China(61871106)

摘要:

深度强化学习算法在奖励稀疏的环境下,难以通过与环境的交互学习到最优策略,因此需要构建内在奖励指导策略进行探索更新。然而,这样仍存在一些问题:1)状态分类存在的统计失准问题会造成奖励值大小被误判,使智能体(agent)学习到错误行为;2)由于预测网络识别状态信息的能力较强,内在奖励产生状态的新鲜感下降,影响了最优策略的学习效果;3)由于随机状态转移,教师策略的信息未被有效利用,降低了智能体的环境探索能力。为了解决以上问题,提出一种融合随机生成网络预测误差与哈希离散化统计的奖励构建机制RGNP-HCE (Randomly Generated Network Prediction and Hash Count Exploration),并通过蒸馏(distillation)将多教师策略的知识迁移到学生策略中。RGNP-HCE机制采用好奇心分类思想构建融合奖励:一方面在多回合间以随机生成网络预测差构建全局好奇心奖励;另一方面在单回合内以哈希离散化统计构建局部好奇心奖励,从而保证内在奖励的合理性以及策略梯度更新的正确性。此外,将多个教师策略学习到的知识通过蒸馏迁移到学生策略中,有效提升学生策略的环境探索能力。最后,在Montezuma’s Revenge与Breakout测试环境中,把所提机制与当前主流的4个深度强化学习算法进行了对比实验,并执行了策略蒸馏。结果表明,相较于当前高性能的强化学习算法,RGNP-HCE机制在两个测试环境中的平均性能均有提升,且蒸馏后学生策略的平均性能又有进一步的提升,验证了RGNP-HCE机制与策略蒸馏方法对提升智能体的环境探索能力是有效的。

关键词: 奖励稀疏, 内在奖励, 探索能力, 策略蒸馏, 深度强化学习

Abstract:

Deep reinforcement learning algorithms are difficult to learn optimal policy through interaction with environment in reward sparsity environments, so that the intrinsic reward needs to be built to guide the update of algorithms. However, there are still some problems in this way: 1) statistical inaccuracy of state classification will misjudge reward value, thereby causing the agent to learn wrong behavior; 2) due to the strong ability of the prediction network to identify state information, the state freshness generated by the intrinsic reward decreases, which affects the learning effect of the optimal policy; 3) due to the random state transition, the information of the teacher strategies is not effectively utilized, which reduces the agent’s ability to explore the environment. To solve the above problems, a reward construction mechanism combining prediction error of stochastic generative network with hash discretization statistics, namely RGNP-HCE (Randomly Generated Network Prediction and Hash Count Exploration), was proposed, and the knowledge of multi-teacher policy was transferred to student policy through distillation. In RGNP-HCE mechanism, the fusion reward was constructed through the idea of curiosity classification. In specific, the global curiosity reward was constructed by stochastic generative network’s prediction error between multiple episodes, and the local curiosity reward was constructed by hash discretization statistics in one episode, which guaranteed the rationality of intrinsic rewards and the correctness of policy gradient updates. In addition, multi-teacher policy distillation provides students with multiple reference directions for exploration, which improved environmental exploration ability of the student policy effectively. Finally, in the test environments of Montezuma’s Revenge and Breakout, experiment of comparing the proposed mechanism with four current mainstream deep reinforcement learning algorithms was carried out, and policy distillation was performed. The results show that compared with the average performance of current high-performance deep reinforcement learning algorithms, the average performance of RGNP-HCE mechanism in both test environments is improved, and the distilled student policy is further improved in average performance, indicating that RGNP-HCE mechanism and policy distillation are effective in improving the exploration ability of agent.

Key words: reward sparsity, intrinsic reward, exploration ability, policy distillation, deep reinforcement learning

中图分类号: