Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (6): 1822-1828.DOI: 10.11772/j.issn.1001-9081.2021040552

• Artificial intelligence • Previous Articles    

Intrinsic curiosity method based on reward prediction error

Qing TAN1, Hui LI1,2(), Haolin WU1, Zhuang WANG1, Shuchao DENG1   

  1. 1.College of Computer Science (College of Software Engineering),Sichuan University,Chengdu Sichuan 610065,China
    2.National Key Laboratory of Fundamental Science on Synthetic Vision (Sichuan University),Chengdu Sichuan 610065,China
  • Received:2021-04-12 Revised:2021-06-17 Accepted:2021-06-23 Online:2022-06-22 Published:2022-06-10
  • Contact: Hui LI
  • About author:TAN Qing, born in 1996, M. S. candidate. His research interests include deep reinforcement learning.
    WU Haolin, born in 1990, Ph. D. candidate. His research interests include deep reinforcement learning.
    WANG Zhuang, born in 1987, Ph. D. candidate. His research interests include military artificial intelligence, deep reinforcement learning.
    DENG Shuchao, born in 1999. His research interests include deep reinforcement learning.
  • Supported by:
    Army-Wide Equipment Pre-Research Project(31505550302)

基于奖励预测误差的内在好奇心方法

谭庆1, 李辉1,2(), 吴昊霖1, 王壮1, 邓书超1   

  1. 1.四川大学 计算机学院(软件学院),成都 610065
    2.视觉合成图形图像技术国家级重点实验室(四川大学),成都 610065
  • 通讯作者: 李辉
  • 作者简介:谭庆(1996—),男,重庆人,硕士研究生,主要研究方向:深度强化学习
    吴昊霖(1990—),男,山东临沂人,博士研究生,主要研究方向:深度强化学习
    王壮(1987—),男,吉林白城人,博士研究生,主要研究方向:军事人工智能、深度强化学习
    邓书超(1999—),男,贵州绥阳人,主要研究方向:深度强化学习。
  • 基金资助:
    武器装备预研基金资助项目(31505550302)

Abstract:

Concerning the problem that when the state prediction error is directly used as the intrinsic curiosity reward, the reinforcement learning agent cannot effectively explore the environment in the task with low correlation between state novelty and reward, an Intrinsic Curiosity Module with Reward Prediction Error (RPE-ICM) was proposed. In RPE-ICM, the Reward Prediction Error Network (RPE-Network) model was used to learn and correct the state prediction error reward, and the output of the Reward Prediction Error (RPE) model was used as an intrinsic reward signal to balance over-exploration and under-exploration, so that the agent was able to explore the environment more effectively and use the reward to learn skills to achieve better learning effect. In different MuJoCo (Multi-Joint dynamics with Contact) environments, comparative experiments were conducted on RPE-ICM, Intrinsic Curiosity Module (ICM), Random Network Distillation (RND) and traditional Deep Deterministic Strategy Gradient (DDPG) algorithm. The results show that compared with traditional DDPG, ICM-DDPG and RND-DDPG, the DDPG algorithm based on RPE-ICM has the average performance improved by 13.85%, 13.34% and 20.80% respectively in Hopper environment.

Key words: reinforcement learning, exploration, intrinsic curiosity reward, state novelty, Deep Deterministic Policy Gradient (DDPG)

摘要:

针对状态预测误差直接作为内在好奇心奖励,在状态新颖性与奖励相关度低的任务中强化学习智能体不能有效探索环境的问题,提出一种基于奖励预测误差的内在好奇心模块(RPE-ICM)。RPE-ICM利用奖励预测误差网络(RPE-Network)学习并修正状态预测误差奖励,并将奖励预测误差(RPE)模型的输出作为一种内在奖励信号去平衡探索过度与探索不足,使得智能体能够更有效地探索环境并利用奖励去学习技能,从而达到更好的学习效果。在不同的MuJoCo环境中使用RPE-ICM、内在好奇心模块(ICM)、随机蒸馏网络(RND)以及传统的深度确定性策略梯度(DDPG)算法进行对比实验。结果表明,相较于传统DDPG、ICM-DDPG以及RND-DDPG,基于RPE-ICM的DDPG算法的平均性能在Hopper环境中分别提高了13.85%、13.34%和20.80%。

关键词: 强化学习, 探索, 内在好奇心奖励, 状态新颖性, 深度确定性策略梯度

CLC Number: