Sparse reward exploration mechanism fusing curiosity and policy distillation

doi:10.11772/j.issn.1001-9081.2022071116

Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (7): 2082-2090.DOI: 10.11772/j.issn.1001-9081.2022071116

Special Issue: 第39届CCF中国数据库学术会议(NDBC 2022)

• The 39th CCF National Database Conference (NDBC 2022) • Previous Articles Next Articles

Sparse reward exploration mechanism fusing curiosity and policy distillation

Ziteng WANG¹^,², Yaxin YU¹^,²(), Zifang XIA¹^,², Jiaqi QIAO¹^,²

^1.School of Computer Science and Engineering，Northeastern University，Shenyang Liaoning 110169，China
^2.Key Laboratory of Intelligent Computing in Medical Image，Ministry of Education （Northeastern University），Shenyang Liaoning 110169，China

Received:2022-07-12 Revised:2022-08-30 Accepted:2022-09-09 Online:2023-07-20 Published:2023-07-10
Contact: Yaxin YU
About author:WANG Ziteng， born in 1998， M. S. candidate. His research interests include reinforcement learning， transfer learning.
YU Yaxin， born in 1971， Ph. D.， associate professor. Her research interests include data mining， social network.
XIA Zifang， born in 1998， M. S. candidate. Her research interests include recommender system， causal inference.
QIAO Jiaqi， born in 1998， M. S. candidate. Her research interests include natural language processing， computer vision.
Supported by:
National Natural Science Foundation of China(61871106)

融合好奇心和策略蒸馏的稀疏奖励探索机制

王子腾¹^,², 于亚新¹^,²(), 夏子芳¹^,², 乔佳琪¹^,²

^1.东北大学计算机科学与工程学院，沈阳 110169
^2.医学影像智能计算教育部重点实验室（东北大学），沈阳 110169

通讯作者: 于亚新
作者简介:王子腾（1998—），男，辽宁大连人，硕士研究生，主要研究方向：强化学习、迁移学习；
于亚新（1971—），女，辽宁沈阳人，副教授，博士，CCF会员，主要研究方向：数据挖掘、社交网络；
夏子芳（1998—），女，河北邢台人，硕士研究生，主要研究方向：推荐系统、因果推断；
乔佳琪（1998—），女，黑龙江伊春人，硕士研究生，主要研究方向：自然语言处理、计算机视觉。
基金资助:
国家自然科学基金资助项目(61871106)

Abstract

Abstract:

Deep reinforcement learning algorithms are difficult to learn optimal policy through interaction with environment in reward sparsity environments， so that the intrinsic reward needs to be built to guide the update of algorithms. However， there are still some problems in this way： 1） statistical inaccuracy of state classification will misjudge reward value， thereby causing the agent to learn wrong behavior； 2） due to the strong ability of the prediction network to identify state information， the state freshness generated by the intrinsic reward decreases， which affects the learning effect of the optimal policy； 3） due to the random state transition， the information of the teacher strategies is not effectively utilized， which reduces the agent’s ability to explore the environment. To solve the above problems， a reward construction mechanism combining prediction error of stochastic generative network with hash discretization statistics， namely RGNP-HCE （Randomly Generated Network Prediction and Hash Count Exploration）， was proposed， and the knowledge of multi-teacher policy was transferred to student policy through distillation. In RGNP-HCE mechanism， the fusion reward was constructed through the idea of curiosity classification. In specific， the global curiosity reward was constructed by stochastic generative network’s prediction error between multiple episodes， and the local curiosity reward was constructed by hash discretization statistics in one episode， which guaranteed the rationality of intrinsic rewards and the correctness of policy gradient updates. In addition， multi-teacher policy distillation provides students with multiple reference directions for exploration， which improved environmental exploration ability of the student policy effectively. Finally， in the test environments of Montezuma’s Revenge and Breakout， experiment of comparing the proposed mechanism with four current mainstream deep reinforcement learning algorithms was carried out， and policy distillation was performed. The results show that compared with the average performance of current high-performance deep reinforcement learning algorithms， the average performance of RGNP-HCE mechanism in both test environments is improved， and the distilled student policy is further improved in average performance， indicating that RGNP-HCE mechanism and policy distillation are effective in improving the exploration ability of agent.

Key words: reward sparsity, intrinsic reward, exploration ability, policy distillation, deep reinforcement learning

摘要：

深度强化学习算法在奖励稀疏的环境下，难以通过与环境的交互学习到最优策略，因此需要构建内在奖励指导策略进行探索更新。然而，这样仍存在一些问题：1）状态分类存在的统计失准问题会造成奖励值大小被误判，使智能体（agent）学习到错误行为；2）由于预测网络识别状态信息的能力较强，内在奖励产生状态的新鲜感下降，影响了最优策略的学习效果；3）由于随机状态转移，教师策略的信息未被有效利用，降低了智能体的环境探索能力。为了解决以上问题，提出一种融合随机生成网络预测误差与哈希离散化统计的奖励构建机制RGNP-HCE （Randomly Generated Network Prediction and Hash Count Exploration），并通过蒸馏（distillation）将多教师策略的知识迁移到学生策略中。RGNP-HCE机制采用好奇心分类思想构建融合奖励：一方面在多回合间以随机生成网络预测差构建全局好奇心奖励；另一方面在单回合内以哈希离散化统计构建局部好奇心奖励，从而保证内在奖励的合理性以及策略梯度更新的正确性。此外，将多个教师策略学习到的知识通过蒸馏迁移到学生策略中，有效提升学生策略的环境探索能力。最后，在Montezuma’s Revenge与Breakout测试环境中，把所提机制与当前主流的4个深度强化学习算法进行了对比实验，并执行了策略蒸馏。结果表明，相较于当前高性能的强化学习算法，RGNP-HCE机制在两个测试环境中的平均性能均有提升，且蒸馏后学生策略的平均性能又有进一步的提升，验证了RGNP-HCE机制与策略蒸馏方法对提升智能体的环境探索能力是有效的。

关键词: 奖励稀疏, 内在奖励, 探索能力, 策略蒸馏, 深度强化学习

CLC Number:

TP181

Ziteng WANG, Yaxin YU, Zifang XIA, Jiaqi QIAO. Sparse reward exploration mechanism fusing curiosity and policy distillation[J]. Journal of Computer Applications, 2023, 43(7): 2082-2090.

王子腾, 于亚新, 夏子芳, 乔佳琪. 融合好奇心和策略蒸馏的稀疏奖励探索机制[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2082-2090.

Figures/Tables 8

Tab. 1 Variable definitions of RGNP-HCE mechanism

变量	描述
$f p r e s t$	RGNP预测网络输出值
$f t a r s t$	RGNP目标网络输出值
$f s t$	状态 $S t$ 的向量编码
$N ̂ s t$	状态 $S t$ 的虚拟访问数
$r i$	RGNP-HCE构建的内在奖励
$r e$	环境反馈的外在奖励
$V i$	$r i$ 估计的价值函数
$V e$	$r e$ 估计的价值函数
$f θ I s t, s t + 1$	IDF动作预测网络
$π θ p a s$	PPO的策略网络
$V θ v s$	PPO的价值网络
$A ̂$	优势函数估计值
$r t θ$	重要性采样率
$L P P O θ p$	PPO策略网络的目标函数
$L V θ v$	PPO值函数的目标函数
$π θ s t u a s$	“状态-动作”蒸馏的学生策略网络
$π θ d a s$	正则因子蒸馏的学生策略网络
$L p r i o r s$	正则因子蒸馏的惩罚项
$L D i s t i l l θ d$	正则因子蒸馏的目标函数

Tab. 1 Variable definitions of RGNP-HCE mechanism

变量	描述
$f p r e s t$	RGNP预测网络输出值
$f t a r s t$	RGNP目标网络输出值
$f s t$	状态 $S t$ 的向量编码
$N ̂ s t$	状态 $S t$ 的虚拟访问数
$r i$	RGNP-HCE构建的内在奖励
$r e$	环境反馈的外在奖励
$V i$	$r i$ 估计的价值函数
$V e$	$r e$ 估计的价值函数
$f θ I s t, s t + 1$	IDF动作预测网络
$π θ p a s$	PPO的策略网络
$V θ v s$	PPO的价值网络
$A ̂$	优势函数估计值
$r t θ$	重要性采样率
$L P P O θ p$	PPO策略网络的目标函数
$L V θ v$	PPO值函数的目标函数
$π θ s t u a s$	“状态-动作”蒸馏的学生策略网络
$π θ d a s$	正则因子蒸馏的学生策略网络
$L p r i o r s$	正则因子蒸馏的惩罚项
$L D i s t i l l θ d$	正则因子蒸馏的目标函数

Fig. 1 Overall architecture of RGNP-HCE mechanism

Fig. 2 Overall performance comparison

Tab. 2 Performance comparison of algorithms

算法	蒙特祖马的复仇			打砖块
算法	平均奖励	平均房间数	最大奖励	平均奖励	最大奖励
DQN	0.00	1	0	86.75	95
PPO	0.00	1	0	233.44	238
RND	2 566.58	4	3 500	225.76	233
RND+AB	2 665.43	6	3 800	219.09	255
RGNP-HCE	2 699.72	6	3 300	230.39	240
Distill	2 800.68	7	3 200	254.29	266
Distill-re	2 782.14	6	2 900	251.35	261

Fig. 3 Comparison of number of agents

Fig. 4 Average reward comparison before and after regularization factor distillation

Fig. 5 “State- behavior pair” distillation compare regular factor distillation performance

Tab. 3 Ablation experiment results

算法	相对上栏提升/%
算法	蒙特祖马的复仇	打砖块
PPO+RGNP	1.00	1.00
RGNP-HCE	5.18	2.23
正则因子蒸馏	3.05	8.75
状态-动作对蒸馏	0.67	1.91

References 31

1	杨瑞，严江鹏，李秀.强化学习稀疏奖励算法研究——理论与实验［J］.智能系统学报， 2020， 15（5）： 888-899. 10.11992/tis.202003031
	YANG R， YAN J P， LI X. Survey of sparse reward algorithms in reinforcement learning — theory and experiment［J］. CAAI Transactions on Intelligent Systems， 2020， 15（5）： 888-899. 10.11992/tis.202003031
2	李波，越凯强，甘志刚，等.基于MADDPG的多无人机协同任务决策［J］.宇航学报， 2021， 42（6）： 757-765. 10.3873/j.issn.1000-1328.2021.06.009
	LI B， YUE K Q， GAN Z G， et al. Multi-UAV cooperative autonomous navigation based on multi-agent deep deterministic policy gradient［J］. Journal of Astronautics， 2021， 42（6）： 757-765. 10.3873/j.issn.1000-1328.2021.06.009
3	YE D H， CHEN G B， ZHANG W， et al. Towards playing full MOBA games with deep reinforcement learning ［C］// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2020： 621-632.
4	LI Y X. Deep reinforcement learning： an overview［EB/OL］. （2018-11-26）［2021-10-11］. . 10.1109/tpami.2023.3285634/mm1
5	BADIA A P， SPRECHMANN P， VITVITSKYI A， et al. Never give up： learning directed exploration strategies［EB/OL］. （2020-02-14）［2021-11-05］. .
6	PATHAK D， AGRAWAL P， EFROS A A， et al. Curiosity-driven exploration by self-supervised prediction ［C］// Proceedings of the 34th International Conference on Machine Learning. New York： JMLR.org， 2017： 2778-2787. 10.1109/cvprw.2017.70
7	OUDEYER P Y， KAPLAN F. How can we define intrinsic motivation？［C/OL］// Proceedings of the 8th International Conference on Epigenetic Robotics： Modeling Cognitive Development in Robotic Systems ［2021-11-05］. . 10.1016/j.cogsys.2003.11.001
8	STREHL A L， LITTMAN M L. An analysis of model-based Interval Estimation for Markov Decision Processes［J］. Journal of Computer and System Sciences， 2008， 74（8）： 1309-1331. 10.1016/j.jcss.2007.08.009
9	LAI T L， ROBBINS H. Asymptotically efficient adaptive allocation rules［J］. Advances in Applied Mathematics， 1985， 6（1）： 4-22. 10.1016/0196-8858(85)90002-8
10	OSTROVSKI G， BELLEMARE M G， A van den OORD， et al. Count-based exploration with neural density models ［C］// Proceedings of the 34th International Conference on Machine Learning. New York： JMLR.org， 2017： 2721-2730.
11	BURDA Y， EDWARDS H， STORKEY A， et al. Exploration by random network distillation［EB/OL］. （2018-10-30）［2021-12-18］. .
12	TANG H R， HOUTHOOFT R， FOOTE D， et al. #Exploration： a study of count-based exploration for deep reinforcement learning ［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 2750-2759. 10.1109/icccbda.2017.7951951
13	PARISOTTO E， BA J， SALAKHUTDINOV R. Actor-mimic： deep multitask and transfer reinforcement learning［EB/OL］. （2016-02-22）［2020-11-09］. .
14	RUSU A A， COLMENAREJO S G， GÜLÇEHRE Ç， et al. Policy distillation［EB/OL］. （2016-01-07）［2020-09-07］. .
15	姜玉斌，刘全，胡智慧.带最大熵修正的行动者评论家算法［J］.计算机学报， 2020， 43（10）： 1897-1908. 10.11897/SP.J.1016.2020.01897
	JIANG Y B， LIU Q， HU Z H. Actor-critic algorithm with maximum-entropy correction［J］. Chinese Journal of Computers， 2020， 43（10）： 1897-1908. 10.11897/SP.J.1016.2020.01897
16	SUTTON R S， BARTO A G. Reinforcement Learning： An Introduction［M］. Cambridge： MIT Press， 1998： 75-76.
17	MNIH V， KAVUKCUOGLU K， SILVER D， et al. Human-level control through deep reinforcement learning［J］. Nature， 2015， 518（7540）： 529-533. 10.1038/nature14236
18	WILLIAMS R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning［J］. Machine Learning， 1992， 8（3/4）： 229-256. 10.1007/bf00992696
19	KONDA V R， TSITSIKLIS J N. Actor-critic algorithms ［C］// Proceedings of the 12th International Conference on Neural Information Processing Systems. Cambridge： MIT Press， 2000： 1008-1014.
20	MNIH V， BADIA A P， MIRZA M， et al. Asynchronous methods for deep reinforcement learning ［C］// Proceedings of the 33rd International Conference on Machine Learning. New York： JMLR.org， 2016： 1928-1937.
21	SCHULMAN J， LEVINE S， MORITZ P， et al. Trust region policy optimization ［C］// Proceedings of the 32nd International Conference on Machine Learning. New York： JMLR.org， 2015： 1889-1897.
22	SCHULMAN J， WOLSKI F， DHARIWAL P， et al. Proximal policy optimization algorithms［EB/OL］. （2017-08-28）［2021-09-29］. .
23	THOMPSON W R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples［J］. Biometrika， 1933， 25（3/4）： 285-294. 10.1093/biomet/25.3-4.285
24	HAARNOJA T， TANG H R， ABBEEL P， et al. Reinforcement learning with deep energy-based policies ［C］// Proceedings of the 34th International Conference on Machine Learning. New York： JMLR.org， 2017： 1352-1361. 10.1007/978-1-4899-7687-1_142
25	OSBAND I， BLUNDELL C， PRITZEL A， et al. Deep exploration via bootstrapped DQN ［C］// Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2016： 4033-4041
26	BELLEMARE M G， SRINIVASAN S， OSTROVSKI G， et al. Unifying count-based exploration and intrinsic motivation ［C］// Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2016： 1479-1487.
27	STADIE B C， LEVINE S， ABBEEL P. Incentivizing exploration in reinforcement learning with deep predictive models［EB/OL］. （2015-11-19）［2020-12-18］. .
28	BURDA Y， EDWARDS H， PATHAK D， et al. Large-scale study of curiosity-driven learning［EB/OL］. （2018-08-13）［2022-01-08］. .
29	SONG Y， CHEN Y F， HU Y J， et al. Exploring unknown states with action balance ［C］// Proceedings of the 2020 IEEE Conference on Games. Piscataway： IEEE， 2020： 184-191. 10.1109/cog47356.2020.9231562
30	HINTON G， VINYALS O， DEAN J. Distilling the knowledge in a neural network［EB/OL］. （2015-03-09）［2020-12-19］. .
31	CZARNECKI W M， PASCANU R， OSINDERO S， et al. Distilling policy distillation ［C］// Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics. New York： JMLR.org， 2019： 1331-1340. 10.24963/ijcai.2020/435

[1]	Yi ZHOU, Hua GAO, Yongshen TIAN. Proximal policy optimization algorithm based on clipping optimization and policy guidance [J]. Journal of Computer Applications, 2024, 44(8): 2334-2341.
[2]	Tian MA, Runtao XI, Jiahao LYU, Yijie ZENG, Jiayi YANG, Jiehui ZHANG. Mobile robot 3D space path planning method based on deep reinforcement learning [J]. Journal of Computer Applications, 2024, 44(7): 2055-2064.
[3]	Xiaoyan ZHAO, Wei HAN, Junna ZHANG, Peiyan YUAN. Collaborative offloading strategy in internet of vehicles based on asynchronous deep reinforcement learning [J]. Journal of Computer Applications, 2024, 44(5): 1501-1510.
[4]	Rui TANG, Chuanlin PANG, Ruizhi ZHANG, Chuan LIU, Shibo YUE. DDPG-based resource allocation in D2D communication-empowered cellular network [J]. Journal of Computer Applications, 2024, 44(5): 1562-1569.
[5]	Xintong QIN, Zhengyu SONG, Tianwei HOU, Feiyue WANG, Xin SUN, Wei LI. Channel access and resource allocation algorithm for adaptive p-persistent mobile ad hoc network [J]. Journal of Computer Applications, 2024, 44(3): 863-868.
[6]	Yuanchao LI, Chongben TAO, Chen WANG. Gait control method based on maximum entropy deep reinforcement learning for biped robot [J]. Journal of Computer Applications, 2024, 44(2): 445-451.
[7]	Fuqin DENG, Huifeng GUAN, Chaoen TAN, Lanhui FU, Hongmin WANG, Tinlun LAM, Jianmin ZHANG. Multi-robot reinforcement learning path planning method based on request-response communication mechanism and local attention mechanism [J]. Journal of Computer Applications, 2024, 44(2): 432-438.
[8]	Jiachen YU, Ye YANG. Irregular object grasping by soft robotic arm based on clipped proximal policy optimization algorithm [J]. Journal of Computer Applications, 2024, 44(11): 3629-3638.
[9]	Jie LONG, Liang XIE, Haijiao XU. Integrated deep reinforcement learning portfolio model [J]. Journal of Computer Applications, 2024, 44(1): 300-310.
[10]	Yu WANG, Tianjun REN, Zilin FAN. Air combat maneuver decision-making of unmanned aerial vehicle based on guided Minimax-DDQN [J]. Journal of Computer Applications, 2023, 43(8): 2636-2643.
[11]	Xiaolin LI, Yusang JIANG. Task offloading algorithm for UAV-assisted mobile edge computing [J]. Journal of Computer Applications, 2023, 43(6): 1893-1899.
[12]	Heping FANG, Shuguang LIU, Yongyi RAN, Kunhua ZHONG. Integrated scheduling optimization of multiple data centers based on deep reinforcement learning [J]. Journal of Computer Applications, 2023, 43(6): 1884-1892.
[13]	Xiaohui HUANG, Kaiming YANG, Jiahao LING. Order dispatching by multi-agent reinforcement learning based on shared attention [J]. Journal of Computer Applications, 2023, 43(5): 1620-1624.
[14]	Tengfei CAO, Yanliang LIU, Xiaoying WANG. Edge computing and service offloading algorithm based on improved deep reinforcement learning [J]. Journal of Computer Applications, 2023, 43(5): 1543-1550.
[15]	Zhengkai DING, Qiming FU, Jianping CHEN, You LU, Hongjie WU, Nengwei FANG, Bin XING. Ultra-short-term photovoltaic power prediction by deep reinforcement learning based on attention mechanism [J]. Journal of Computer Applications, 2023, 43(5): 1647-1654.

Sparse reward exploration mechanism fusing curiosity and policy distillation

融合好奇心和策略蒸馏的稀疏奖励探索机制

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 8

References 31

Related Articles 15

Recommended Articles

Metrics