Intrinsic curiosity method based on reward prediction error

doi:10.11772/j.issn.1001-9081.2021040552

Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (6): 1822-1828.DOI: 10.11772/j.issn.1001-9081.2021040552

• Artificial intelligence • Previous Articles

Intrinsic curiosity method based on reward prediction error

Qing TAN¹, Hui LI¹^,²(), Haolin WU¹, Zhuang WANG¹, Shuchao DENG¹

^1.College of Computer Science （College of Software Engineering），Sichuan University，Chengdu Sichuan 610065，China
^2.National Key Laboratory of Fundamental Science on Synthetic Vision （Sichuan University），Chengdu Sichuan 610065，China

Received:2021-04-12 Revised:2021-06-17 Accepted:2021-06-23 Online:2022-06-22 Published:2022-06-10
Contact: Hui LI
About author:TAN Qing， born in 1996， M. S. candidate. His research interests include deep reinforcement learning.
WU Haolin， born in 1990， Ph. D. candidate. His research interests include deep reinforcement learning.
WANG Zhuang， born in 1987， Ph. D. candidate. His research interests include military artificial intelligence， deep reinforcement learning.
DENG Shuchao， born in 1999. His research interests include deep reinforcement learning.
Supported by:
Army-Wide Equipment Pre-Research Project(31505550302)

基于奖励预测误差的内在好奇心方法

谭庆¹, 李辉¹^,²(), 吴昊霖¹, 王壮¹, 邓书超¹

^1.四川大学计算机学院（软件学院），成都 610065
^2.视觉合成图形图像技术国家级重点实验室（四川大学），成都 610065

通讯作者: 李辉
作者简介:谭庆（1996—），男，重庆人，硕士研究生，主要研究方向：深度强化学习
吴昊霖（1990—），男，山东临沂人，博士研究生，主要研究方向：深度强化学习
王壮（1987—），男，吉林白城人，博士研究生，主要研究方向：军事人工智能、深度强化学习
邓书超（1999—），男，贵州绥阳人，主要研究方向：深度强化学习。
基金资助:
武器装备预研基金资助项目(31505550302)

Abstract

Abstract:

Concerning the problem that when the state prediction error is directly used as the intrinsic curiosity reward， the reinforcement learning agent cannot effectively explore the environment in the task with low correlation between state novelty and reward， an Intrinsic Curiosity Module with Reward Prediction Error （RPE-ICM） was proposed. In RPE-ICM， the Reward Prediction Error Network （RPE-Network） model was used to learn and correct the state prediction error reward， and the output of the Reward Prediction Error （RPE） model was used as an intrinsic reward signal to balance over-exploration and under-exploration， so that the agent was able to explore the environment more effectively and use the reward to learn skills to achieve better learning effect. In different MuJoCo （Multi-Joint dynamics with Contact） environments， comparative experiments were conducted on RPE-ICM， Intrinsic Curiosity Module （ICM）， Random Network Distillation （RND） and traditional Deep Deterministic Strategy Gradient （DDPG） algorithm. The results show that compared with traditional DDPG， ICM-DDPG and RND-DDPG， the DDPG algorithm based on RPE-ICM has the average performance improved by 13.85%， 13.34% and 20.80% respectively in Hopper environment.

Key words: reinforcement learning, exploration, intrinsic curiosity reward, state novelty, Deep Deterministic Policy Gradient (DDPG)

摘要：

针对状态预测误差直接作为内在好奇心奖励，在状态新颖性与奖励相关度低的任务中强化学习智能体不能有效探索环境的问题，提出一种基于奖励预测误差的内在好奇心模块（RPE-ICM）。RPE-ICM利用奖励预测误差网络（RPE-Network）学习并修正状态预测误差奖励，并将奖励预测误差（RPE）模型的输出作为一种内在奖励信号去平衡探索过度与探索不足，使得智能体能够更有效地探索环境并利用奖励去学习技能，从而达到更好的学习效果。在不同的MuJoCo环境中使用RPE-ICM、内在好奇心模块（ICM）、随机蒸馏网络（RND）以及传统的深度确定性策略梯度（DDPG）算法进行对比实验。结果表明，相较于传统DDPG、ICM-DDPG以及RND-DDPG，基于RPE-ICM的DDPG算法的平均性能在Hopper环境中分别提高了13.85%、13.34%和20.80%。

关键词: 强化学习, 探索, 内在好奇心奖励, 状态新颖性, 深度确定性策略梯度

CLC Number:

TP181

Qing TAN, Hui LI, Haolin WU, Zhuang WANG, Shuchao DENG. Intrinsic curiosity method based on reward prediction error[J]. Journal of Computer Applications, 2022, 42(6): 1822-1828.

谭庆, 李辉, 吴昊霖, 王壮, 邓书超. 基于奖励预测误差的内在好奇心方法[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1822-1828.

Figures/Tables 10

Fig. 1 Reinforcement learning agent interacting with environment

Fig.2 Principle of ICM

Fig.3 Architecture of RPE-ICM method

Fig.4 Experimental environments

Tab. 1 Experimental hyperparameter

超参数	取值
折扣因子 $γ$ ， $β$ ， $φ$	0.99，0.2， 0.1
目标网络更新参数 $τ$	0.001
Critic网络学习率	$10 - 3$
Actor网络学习率	$10 - 4$
RPE-Network学习率	$10 - 3$
前向模型学习率	$10 - 3$
Epochs幕数	1 000
每幕的步数	4 000
训练批次大小	64

Tab. 1 Experimental hyperparameter

超参数	取值
折扣因子 $γ$ ， $β$ ， $φ$	0.99，0.2， 0.1
目标网络更新参数 $τ$	0.001
Critic网络学习率	$10 - 3$
Actor网络学习率	$10 - 4$
RPE-Network学习率	$10 - 3$
前向模型学习率	$10 - 3$
Epochs幕数	1 000
每幕的步数	4 000
训练批次大小	64

Tab. 2 Network structure of RPE-ICM

名称	参数
前向模型全连接层	3
前向模型隐藏层结点	128
前向模型激活函数	ReLU
前向模型优化器	Adam
RPE-Network全连接层	2
RPE隐藏层结点	256
RPE激活函数	ReLU
RPE优化器	Adam

Fig.5 Performance comparison of four algorithms in different tasks

Tab. 3 Comparison of DDPG performance in different experimental environments

实验环境	平均奖励回报值				Imp₁/%	Imp₂/%	Imp₃/%
实验环境	DDPG	ICM-DDPG	RND-DDPG	RPE-ICM-DDPG	Imp₁/%	Imp₂/%	Imp₃/%
Walker2d	2 209.92	2 261.18	2 299.53	2 420.70	9.54	7.05	5.00
Hopper	1 593.42	1 600.55	1 435.65	1 814.06	13.85	13.34	20.80
Swimmer	117.23	125.00	107.67	132.47	13.00	9.98	21.20
Ant	291.24	433.75	759.23	795.87	173.27	83.49	5.60

Fig.6 Average returns of two methods in Hopper environment

Fig.7 Average returns of two methods in Swimmer environment

References 22

1	刘全，翟建伟，章宗长，等. 深度强化学习综述［J］. 计算机学报， 2018， 41（1）： 1-27. 10.11897/SP.J.1016.2018.00001
	LIU Q， ZHAI J W， ZHANG Z Z， et al. A survey on deep reinforcement learning［J］. Chinese Journal of Computers， 2018， 41（1）： 1-27. 10.11897/SP.J.1016.2018.00001
2	BELLEMARE M G， SRINIVASAN S， OSTROVSKI G， et al. Unifying count-based exploration and intrinsic motivation［C］// Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2016： 1479-1487.
3	PATHAK D， AGRAWAL P， EFROS A A， et al. Curiosity-driven exploration by self-supervised prediction［C］// Proceedings of the 34th International Conference on Machine Learning. New York： JMLR.org， 2017： 2778-2787. 10.1109/cvprw.2017.70
4	BURDA Y， EDWARDS H， STORKEY A， et al. Exploration by random network distillation［EB/OL］. （2018-10-30）［2021-02-21］..
5	AGRAWAL P， NAIR A， ABBEEL P， et al. Learning to poke by poking： experiential learning of intuitive physics［C］// Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2016：5092-5100. 10.1109/icra.2017.7989247
6	LILLICRAP T P， HUNT J J， PRITZEL A， et al. Continuous control with deep reinforcement learning［EB/OL］. （2019-07-05）［2021-02-21］..
7	WATKINS C J C H. Learning from delayed rewards［D］. Cambridge： University of Cambridge， King’s College， 1989：44-46.
8	GOODFELLOW I， BENGIO Y， COURVILLE A， et al. Deep Learning［M］. Cambridge： MIT Press， 2016：143-144.
9	MNIH V， KAVUKCUOGLU K， SILVER D， et al. Playing Atari with deep reinforcement learning［EB/OL］. （2013-12-19）［2021-02-21］.. 10.1038/nature14236
10	SILVER D， LEVER G， HEESS N， et al. Deterministic policy gradient algorithms［C］// Proceedings of the 31st International Conference on Machine Learning. New York： JMLR.org， 2014： 387-395.
11	SUTTON R S， McALLESTER D， SINGH S P， et al. Policy gradient methods for reinforcement learning with function approximation［C］// Proceedings of the 12th International Conference on Neural Information Processing Systems. Cambridge： MIT Press， 1999：1057-1063.
12	KAKADE S. A natural policy gradient［C］// Proceedings of the 14th International Conference on Neural Information Processing Systems： Natural and Synthetic. Cambridge： MIT Press， 2001：1531-1538.
13	时圣苗，刘全. 采用分类经验回放的深度确定性策略梯度方法［J/OL］. 自动化学报. （2019-10-17）［2021-02-21］. .
	SHI S M， LIU Q. Deep deterministic policy gradient with classified experience replay［J/OL］. Acta Automatica Sinica. （2019-10-17）［2021-02-21］. .
14	杨瑞，严江鹏，李秀. 强化学习稀疏奖励算法研究——理论与实验［J］.智能系统学报， 2020， 15（5）：888-899. 10.11992/tis.202003031
	YANG R， YAN J P， LI X. Survey of sparse reward algorithms in reinforcement learning - theory and experiment［J］. CAAI Transactions on Intelligent Systems， 2020， 15（5）：888-899. 10.11992/tis.202003031
15	ACHIAM J， SASTRY S. Surprise-based intrinsic motivation for deep reinforcement learning［EB/OL］. （2017-03-06）［2021-02-21］.. 10.48550/arXiv.1703.01732
16	SCHMIDHUBER J. Formal theory of creativity， fun， and intrinsic motivation （1990-2010）［J］. IEEE Transactions on Autonomous Mental Development， 2010， 2（3）： 230-247. 10.1109/tamd.2010.2056368
17	BURDA Y， EDWARDS H， PATHAK D， et al. Large-scale study of curiosity-driven learning［EB/OL］. （2018-08-13）［2021-02-21］..
18	SCHMIDHUBER J. A possibility for implementing curiosity and boredom in model-building neural controllers［C］// Proceedings of the 1st International Conference on Simulation of Adaptive Behavior： From Animals to Animats. Cambridge： MIT Press， 1991： 222-227. 10.7551/mitpress/3115.003.0030
19	AGRAWAL P， CARREIRA J， MALIK J. Learning to see by moving［C］// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2015： 37-45. 10.1109/iccv.2015.13
20	TAÏGA A A， FEDUS W， MACHADO M C， et al. On bonus based exploration methods in the arcade learning environment［EB/OL］. （2021-09-22）［2021-11-21］..
21	SCHMIDHUBER J. Formal theory of creativity， fun， and intrinsic motivation ［J］. IEEE Transactions on Autonomous Mental Development， 2010， 2（3）： 230-247. 10.1109/tamd.2010.2056368
22	TODOROV E， EREZ T， TASSA Y. MuJoCo： a physics engine for model-based control［C］// Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway： IEEE， 2012： 5026-5033. 10.1109/iros.2012.6386109

[1]	Shiquan DENG, Xuguo YE. Multi-objective task offloading algorithm based on deep Q-network [J]. Journal of Computer Applications, 2022, 42(6): 1668-1674.
[2]	Haini ZHAO, Jian JIAO. Recommendation model of penetration path based on reinforcement learning [J]. Journal of Computer Applications, 2022, 42(6): 1689-1694.
[3]	Shaobin DENG, Jun ZHU, Xiaofeng ZHOU, Shuai LI, Shurui LIU. Industrial process control method based on local policy interaction exploration-based deep deterministic policy gradient [J]. Journal of Computer Applications, 2022, 42(5): 1642-1648.
[4]	Haojie CHEN, Jiangting FAN, Yong LIU. Solving dynamic traveling salesman problem by deep reinforcement learning [J]. Journal of Computer Applications, 2022, 42(4): 1194-1200.
[5]	Xueming LI, Guohao WU, Shangbo ZHOU, Xiaoran LIN, Hongbin XIE. Image instance segmentation model based on fractional-order network and reinforcement learning [J]. Journal of Computer Applications, 2022, 42(2): 574-583.
[6]	Bosen ZENG, Yong ZHONG, Xianhua NIU. Q-table initialization approach for safe exploration based on factorization machine [J]. Journal of Computer Applications, 2022, 42(1): 209-214.
[7]	SHANG Fangjian, LI Xin, Di ZHAI, LU Yang, ZHANG Donglei, QIAN Yuwen. Two-phase resource allocation technology for network slices in smart grid [J]. Journal of Computer Applications, 2021, 41(7): 2033-2038.
[8]	WANG Yu, LIU Yanli, CHEN Shaowu. Maximum common induced subgraph algorithm based on vertex conflict learning [J]. Journal of Computer Applications, 2021, 41(6): 1756-1760.
[9]	WANG Jianping, WANG Gang, MAO Xiaobin, MA Enqi. Motion control method of two-link manipulator based on deep reinforcement learning [J]. Journal of Computer Applications, 2021, 41(6): 1799-1804.
[10]	DU Xixi, CHENG Hua, FANG Yiquan. Reinforced automatic summarization model based on advantage actor-critic algorithm [J]. Journal of Computer Applications, 2021, 41(3): 699-705.
[11]	LIU Sijia, TONG Xiangrong. Urban transportation path planning based on reinforcement learning [J]. Journal of Computer Applications, 2021, 41(1): 185-190.
[12]	YAO Xinghu, TAN Xiaoyang. Reward highway network based global credit assignment algorithm in multi-agent reinforcement learning [J]. Journal of Computer Applications, 2021, 41(1): 1-7.
[13]	LI Zhao, DONG Xiaoxiao, HUANG Chengcheng, REN Chongguang. Design space exploration method for floating-point expression based on heuristic search [J]. Journal of Computer Applications, 2020, 40(9): 2665-2669.
[14]	FU Kui, LIANG Shaoqing, LI Bing. Commodity recommendation model based on improved deep Q network structure [J]. Journal of Computer Applications, 2020, 40(9): 2613-2621.
[15]	HU Xuemin, CHENG Yu, CHEN Guowen, ZHANG Ruohan, TONG Xiuchi. Motion planning for autonomous driving with directional navigation based on deep spatio-temporal Q-network [J]. Journal of Computer Applications, 2020, 40(7): 1919-1925.

Intrinsic curiosity method based on reward prediction error

基于奖励预测误差的内在好奇心方法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 10

References 22

Related Articles 15

Recommended Articles

Metrics