融合好奇心和策略蒸馏的稀疏奖励探索机制

doi:10.11772/j.issn.1001-9081.2022071116

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (7): 2082-2090.DOI: 10.11772/j.issn.1001-9081.2022071116

• 第39届CCF中国数据库学术会议(NDBC 2022) • 上一篇

融合好奇心和策略蒸馏的稀疏奖励探索机制

王子腾¹^,², 于亚新¹^,²(), 夏子芳¹^,², 乔佳琪¹^,²

^1.东北大学计算机科学与工程学院，沈阳 110169
^2.医学影像智能计算教育部重点实验室（东北大学），沈阳 110169

收稿日期:2022-07-12 修回日期:2022-08-30 接受日期:2022-09-09 发布日期:2023-07-20 出版日期:2023-07-10
通讯作者: 于亚新
作者简介:王子腾（1998—），男，辽宁大连人，硕士研究生，主要研究方向：强化学习、迁移学习；
于亚新（1971—），女，辽宁沈阳人，副教授，博士，CCF会员，主要研究方向：数据挖掘、社交网络；
夏子芳（1998—），女，河北邢台人，硕士研究生，主要研究方向：推荐系统、因果推断；
乔佳琪（1998—），女，黑龙江伊春人，硕士研究生，主要研究方向：自然语言处理、计算机视觉。
基金资助:
国家自然科学基金资助项目(61871106)

Sparse reward exploration mechanism fusing curiosity and policy distillation

Ziteng WANG¹^,², Yaxin YU¹^,²(), Zifang XIA¹^,², Jiaqi QIAO¹^,²

^1.School of Computer Science and Engineering，Northeastern University，Shenyang Liaoning 110169，China
^2.Key Laboratory of Intelligent Computing in Medical Image，Ministry of Education （Northeastern University），Shenyang Liaoning 110169，China

Received:2022-07-12 Revised:2022-08-30 Accepted:2022-09-09 Online:2023-07-20 Published:2023-07-10
Contact: Yaxin YU
About author:WANG Ziteng， born in 1998， M. S. candidate. His research interests include reinforcement learning， transfer learning.
YU Yaxin， born in 1971， Ph. D.， associate professor. Her research interests include data mining， social network.
XIA Zifang， born in 1998， M. S. candidate. Her research interests include recommender system， causal inference.
QIAO Jiaqi， born in 1998， M. S. candidate. Her research interests include natural language processing， computer vision.
Supported by:
National Natural Science Foundation of China(61871106)

摘要/Abstract

摘要：

深度强化学习算法在奖励稀疏的环境下，难以通过与环境的交互学习到最优策略，因此需要构建内在奖励指导策略进行探索更新。然而，这样仍存在一些问题：1）状态分类存在的统计失准问题会造成奖励值大小被误判，使智能体（agent）学习到错误行为；2）由于预测网络识别状态信息的能力较强，内在奖励产生状态的新鲜感下降，影响了最优策略的学习效果；3）由于随机状态转移，教师策略的信息未被有效利用，降低了智能体的环境探索能力。为了解决以上问题，提出一种融合随机生成网络预测误差与哈希离散化统计的奖励构建机制RGNP-HCE （Randomly Generated Network Prediction and Hash Count Exploration），并通过蒸馏（distillation）将多教师策略的知识迁移到学生策略中。RGNP-HCE机制采用好奇心分类思想构建融合奖励：一方面在多回合间以随机生成网络预测差构建全局好奇心奖励；另一方面在单回合内以哈希离散化统计构建局部好奇心奖励，从而保证内在奖励的合理性以及策略梯度更新的正确性。此外，将多个教师策略学习到的知识通过蒸馏迁移到学生策略中，有效提升学生策略的环境探索能力。最后，在Montezuma’s Revenge与Breakout测试环境中，把所提机制与当前主流的4个深度强化学习算法进行了对比实验，并执行了策略蒸馏。结果表明，相较于当前高性能的强化学习算法，RGNP-HCE机制在两个测试环境中的平均性能均有提升，且蒸馏后学生策略的平均性能又有进一步的提升，验证了RGNP-HCE机制与策略蒸馏方法对提升智能体的环境探索能力是有效的。

关键词: 奖励稀疏, 内在奖励, 探索能力, 策略蒸馏, 深度强化学习

Abstract:

Deep reinforcement learning algorithms are difficult to learn optimal policy through interaction with environment in reward sparsity environments， so that the intrinsic reward needs to be built to guide the update of algorithms. However， there are still some problems in this way： 1） statistical inaccuracy of state classification will misjudge reward value， thereby causing the agent to learn wrong behavior； 2） due to the strong ability of the prediction network to identify state information， the state freshness generated by the intrinsic reward decreases， which affects the learning effect of the optimal policy； 3） due to the random state transition， the information of the teacher strategies is not effectively utilized， which reduces the agent’s ability to explore the environment. To solve the above problems， a reward construction mechanism combining prediction error of stochastic generative network with hash discretization statistics， namely RGNP-HCE （Randomly Generated Network Prediction and Hash Count Exploration）， was proposed， and the knowledge of multi-teacher policy was transferred to student policy through distillation. In RGNP-HCE mechanism， the fusion reward was constructed through the idea of curiosity classification. In specific， the global curiosity reward was constructed by stochastic generative network’s prediction error between multiple episodes， and the local curiosity reward was constructed by hash discretization statistics in one episode， which guaranteed the rationality of intrinsic rewards and the correctness of policy gradient updates. In addition， multi-teacher policy distillation provides students with multiple reference directions for exploration， which improved environmental exploration ability of the student policy effectively. Finally， in the test environments of Montezuma’s Revenge and Breakout， experiment of comparing the proposed mechanism with four current mainstream deep reinforcement learning algorithms was carried out， and policy distillation was performed. The results show that compared with the average performance of current high-performance deep reinforcement learning algorithms， the average performance of RGNP-HCE mechanism in both test environments is improved， and the distilled student policy is further improved in average performance， indicating that RGNP-HCE mechanism and policy distillation are effective in improving the exploration ability of agent.

Key words: reward sparsity, intrinsic reward, exploration ability, policy distillation, deep reinforcement learning

中图分类号:

TP181

王子腾, 于亚新, 夏子芳, 乔佳琪. 融合好奇心和策略蒸馏的稀疏奖励探索机制[J]. 计算机应用, 2023, 43(7): 2082-2090.

Ziteng WANG, Yaxin YU, Zifang XIA, Jiaqi QIAO. Sparse reward exploration mechanism fusing curiosity and policy distillation[J]. Journal of Computer Applications, 2023, 43(7): 2082-2090.

图/表 8

表1 RGNP-HCE机制的变量定义

Tab. 1 Variable definitions of RGNP-HCE mechanism

变量	描述
$f p r e s t$	RGNP预测网络输出值
$f t a r s t$	RGNP目标网络输出值
$f s t$	状态 $S t$ 的向量编码
$N ̂ s t$	状态 $S t$ 的虚拟访问数
$r i$	RGNP-HCE构建的内在奖励
$r e$	环境反馈的外在奖励
$V i$	$r i$ 估计的价值函数
$V e$	$r e$ 估计的价值函数
$f θ I s t, s t + 1$	IDF动作预测网络
$π θ p a s$	PPO的策略网络
$V θ v s$	PPO的价值网络
$A ̂$	优势函数估计值
$r t θ$	重要性采样率
$L P P O θ p$	PPO策略网络的目标函数
$L V θ v$	PPO值函数的目标函数
$π θ s t u a s$	“状态-动作”蒸馏的学生策略网络
$π θ d a s$	正则因子蒸馏的学生策略网络
$L p r i o r s$	正则因子蒸馏的惩罚项
$L D i s t i l l θ d$	正则因子蒸馏的目标函数

表1 RGNP-HCE机制的变量定义

Tab. 1 Variable definitions of RGNP-HCE mechanism

变量	描述
$f p r e s t$	RGNP预测网络输出值
$f t a r s t$	RGNP目标网络输出值
$f s t$	状态 $S t$ 的向量编码
$N ̂ s t$	状态 $S t$ 的虚拟访问数
$r i$	RGNP-HCE构建的内在奖励
$r e$	环境反馈的外在奖励
$V i$	$r i$ 估计的价值函数
$V e$	$r e$ 估计的价值函数
$f θ I s t, s t + 1$	IDF动作预测网络
$π θ p a s$	PPO的策略网络
$V θ v s$	PPO的价值网络
$A ̂$	优势函数估计值
$r t θ$	重要性采样率
$L P P O θ p$	PPO策略网络的目标函数
$L V θ v$	PPO值函数的目标函数
$π θ s t u a s$	“状态-动作”蒸馏的学生策略网络
$π θ d a s$	正则因子蒸馏的学生策略网络
$L p r i o r s$	正则因子蒸馏的惩罚项
$L D i s t i l l θ d$	正则因子蒸馏的目标函数

图1 RGNP-HCE机制的整体架构

Fig. 1 Overall architecture of RGNP-HCE mechanism

图2 总体性能对比

Fig. 2 Overall performance comparison

表2 算法性能对比

Tab. 2 Performance comparison of algorithms

算法	蒙特祖马的复仇			打砖块
算法	平均奖励	平均房间数	最大奖励	平均奖励	最大奖励
DQN	0.00	1	0	86.75	95
PPO	0.00	1	0	233.44	238
RND	2 566.58	4	3 500	225.76	233
RND+AB	2 665.43	6	3 800	219.09	255
RGNP-HCE	2 699.72	6	3 300	230.39	240
Distill	2 800.68	7	3 200	254.29	266
Distill-re	2 782.14	6	2 900	251.35	261

图3 智能体数量对比

Fig. 3 Comparison of number of agents

图4 加入正则因子蒸馏前后平均奖励对比

Fig. 4 Average reward comparison before and after regularization factor distillation

图5 “状态-动作对”蒸馏与正则因子蒸馏性能对比

Fig. 5 “State- behavior pair” distillation compare regular factor distillation performance

表3 消融实验结果

Tab. 3 Ablation experiment results

算法	相对上栏提升/%
算法	蒙特祖马的复仇	打砖块
PPO+RGNP	1.00	1.00
RGNP-HCE	5.18	2.23
正则因子蒸馏	3.05	8.75
状态-动作对蒸馏	0.67	1.91

参考文献 31

1	杨瑞，严江鹏，李秀.强化学习稀疏奖励算法研究——理论与实验［J］.智能系统学报， 2020， 15（5）： 888-899. 10.11992/tis.202003031
	YANG R， YAN J P， LI X. Survey of sparse reward algorithms in reinforcement learning — theory and experiment［J］. CAAI Transactions on Intelligent Systems， 2020， 15（5）： 888-899. 10.11992/tis.202003031
2	李波，越凯强，甘志刚，等.基于MADDPG的多无人机协同任务决策［J］.宇航学报， 2021， 42（6）： 757-765. 10.3873/j.issn.1000-1328.2021.06.009
	LI B， YUE K Q， GAN Z G， et al. Multi-UAV cooperative autonomous navigation based on multi-agent deep deterministic policy gradient［J］. Journal of Astronautics， 2021， 42（6）： 757-765. 10.3873/j.issn.1000-1328.2021.06.009
3	YE D H， CHEN G B， ZHANG W， et al. Towards playing full MOBA games with deep reinforcement learning ［C］// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2020： 621-632.
4	LI Y X. Deep reinforcement learning： an overview［EB/OL］. （2018-11-26）［2021-10-11］. . 10.1109/tpami.2023.3285634/mm1
5	BADIA A P， SPRECHMANN P， VITVITSKYI A， et al. Never give up： learning directed exploration strategies［EB/OL］. （2020-02-14）［2021-11-05］. .
6	PATHAK D， AGRAWAL P， EFROS A A， et al. Curiosity-driven exploration by self-supervised prediction ［C］// Proceedings of the 34th International Conference on Machine Learning. New York： JMLR.org， 2017： 2778-2787. 10.1109/cvprw.2017.70
7	OUDEYER P Y， KAPLAN F. How can we define intrinsic motivation？［C/OL］// Proceedings of the 8th International Conference on Epigenetic Robotics： Modeling Cognitive Development in Robotic Systems ［2021-11-05］. . 10.1016/j.cogsys.2003.11.001
8	STREHL A L， LITTMAN M L. An analysis of model-based Interval Estimation for Markov Decision Processes［J］. Journal of Computer and System Sciences， 2008， 74（8）： 1309-1331. 10.1016/j.jcss.2007.08.009
9	LAI T L， ROBBINS H. Asymptotically efficient adaptive allocation rules［J］. Advances in Applied Mathematics， 1985， 6（1）： 4-22. 10.1016/0196-8858(85)90002-8
10	OSTROVSKI G， BELLEMARE M G， A van den OORD， et al. Count-based exploration with neural density models ［C］// Proceedings of the 34th International Conference on Machine Learning. New York： JMLR.org， 2017： 2721-2730.
11	BURDA Y， EDWARDS H， STORKEY A， et al. Exploration by random network distillation［EB/OL］. （2018-10-30）［2021-12-18］. .
12	TANG H R， HOUTHOOFT R， FOOTE D， et al. #Exploration： a study of count-based exploration for deep reinforcement learning ［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 2750-2759. 10.1109/icccbda.2017.7951951
13	PARISOTTO E， BA J， SALAKHUTDINOV R. Actor-mimic： deep multitask and transfer reinforcement learning［EB/OL］. （2016-02-22）［2020-11-09］. .
14	RUSU A A， COLMENAREJO S G， GÜLÇEHRE Ç， et al. Policy distillation［EB/OL］. （2016-01-07）［2020-09-07］. .
15	姜玉斌，刘全，胡智慧.带最大熵修正的行动者评论家算法［J］.计算机学报， 2020， 43（10）： 1897-1908. 10.11897/SP.J.1016.2020.01897
	JIANG Y B， LIU Q， HU Z H. Actor-critic algorithm with maximum-entropy correction［J］. Chinese Journal of Computers， 2020， 43（10）： 1897-1908. 10.11897/SP.J.1016.2020.01897
16	SUTTON R S， BARTO A G. Reinforcement Learning： An Introduction［M］. Cambridge： MIT Press， 1998： 75-76.
17	MNIH V， KAVUKCUOGLU K， SILVER D， et al. Human-level control through deep reinforcement learning［J］. Nature， 2015， 518（7540）： 529-533. 10.1038/nature14236
18	WILLIAMS R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning［J］. Machine Learning， 1992， 8（3/4）： 229-256. 10.1007/bf00992696
19	KONDA V R， TSITSIKLIS J N. Actor-critic algorithms ［C］// Proceedings of the 12th International Conference on Neural Information Processing Systems. Cambridge： MIT Press， 2000： 1008-1014.
20	MNIH V， BADIA A P， MIRZA M， et al. Asynchronous methods for deep reinforcement learning ［C］// Proceedings of the 33rd International Conference on Machine Learning. New York： JMLR.org， 2016： 1928-1937.
21	SCHULMAN J， LEVINE S， MORITZ P， et al. Trust region policy optimization ［C］// Proceedings of the 32nd International Conference on Machine Learning. New York： JMLR.org， 2015： 1889-1897.
22	SCHULMAN J， WOLSKI F， DHARIWAL P， et al. Proximal policy optimization algorithms［EB/OL］. （2017-08-28）［2021-09-29］. .
23	THOMPSON W R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples［J］. Biometrika， 1933， 25（3/4）： 285-294. 10.1093/biomet/25.3-4.285
24	HAARNOJA T， TANG H R， ABBEEL P， et al. Reinforcement learning with deep energy-based policies ［C］// Proceedings of the 34th International Conference on Machine Learning. New York： JMLR.org， 2017： 1352-1361. 10.1007/978-1-4899-7687-1_142
25	OSBAND I， BLUNDELL C， PRITZEL A， et al. Deep exploration via bootstrapped DQN ［C］// Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2016： 4033-4041
26	BELLEMARE M G， SRINIVASAN S， OSTROVSKI G， et al. Unifying count-based exploration and intrinsic motivation ［C］// Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2016： 1479-1487.
27	STADIE B C， LEVINE S， ABBEEL P. Incentivizing exploration in reinforcement learning with deep predictive models［EB/OL］. （2015-11-19）［2020-12-18］. .
28	BURDA Y， EDWARDS H， PATHAK D， et al. Large-scale study of curiosity-driven learning［EB/OL］. （2018-08-13）［2022-01-08］. .
29	SONG Y， CHEN Y F， HU Y J， et al. Exploring unknown states with action balance ［C］// Proceedings of the 2020 IEEE Conference on Games. Piscataway： IEEE， 2020： 184-191. 10.1109/cog47356.2020.9231562
30	HINTON G， VINYALS O， DEAN J. Distilling the knowledge in a neural network［EB/OL］. （2015-03-09）［2020-12-19］. .
31	CZARNECKI W M， PASCANU R， OSINDERO S， et al. Distilling policy distillation ［C］// Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics. New York： JMLR.org， 2019： 1331-1340. 10.24963/ijcai.2020/435

[1]	李校林, 江雨桑. 无人机辅助移动边缘计算中的任务卸载算法[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1893-1899.
[2]	方和平, 刘曙光, 冉泳屹, 钟坤华. 基于深度强化学习的多数据中心一体化调度优化[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1884-1892.
[3]	曹腾飞, 刘延亮, 王晓英. 基于改进深度强化学习的边缘计算服务卸载算法[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1543-1550.
[4]	丁正凯, 傅启明, 陈建平, 陆悠, 吴宏杰, 方能炜, 邢镔. 结合注意力机制与深度强化学习的超短期光伏功率预测[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1647-1654.
[5]	黄晓辉, 杨凯铭, 凌嘉壕. 基于共享注意力的多智能体强化学习订单派送[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1620-1624.
[6]	邓晖奕, 李勇振, 尹奇跃. 引入通信与探索的多智能体强化学习QMIX算法[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 202-208.
[7]	邓绍斌, 朱军, 周晓锋, 李帅, 刘舒锐. 基于局部策略交互探索的深度确定性策略梯度的工业过程控制方法[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1642-1648.
[8]	臧嵘, 王莉, 史腾飞. 基于注意力消息共享的多智能体强化学习[J]. 《计算机应用》唯一官方网站, 2022, 42(11): 3346-3353.
[9]	石兵, 黄茜子, 宋兆翔, 徐建桥. 基于用户激励的共享单车调度策略[J]. 《计算机应用》唯一官方网站, 2022, 42(11): 3395-3403.
[10]	徐郁, 朱韵攸, 刘筱, 邓雨婷, 廖勇. 基于深度强化学习的电力物资配送多目标路径优化[J]. 《计算机应用》唯一官方网站, 2022, 42(10): 3252-3258.
[11]	王建平, 王刚, 毛晓彬, 马恩琪. 基于深度强化学习的二连杆机械臂运动控制方法[J]. 计算机应用, 2021, 41(6): 1799-1804.
[12]	姚兴虎, 谭晓阳. 基于奖励高速路网络的多智能体强化学习中的全局信用分配算法[J]. 计算机应用, 2021, 41(1): 1-7.
[13]	傅魁, 梁少晴, 李冰. 基于改进的深度Q网络结构的商品推荐模型[J]. 计算机应用, 2020, 40(9): 2613-2621.
[14]	王甜甜, 于双元, 徐保民. 基于策略梯度算法的工作量证明中挖矿困境研究[J]. 计算机应用, 2019, 39(5): 1336-1342.
[15]	沈莹, 黄樟灿, 谈庆, 刘宁. 基于动态压力控制算子的磷虾群算法[J]. 计算机应用, 2019, 39(3): 663-667.

融合好奇心和策略蒸馏的稀疏奖励探索机制

Sparse reward exploration mechanism fusing curiosity and policy distillation

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 8

参考文献 31

相关文章 15

编辑推荐

Metrics