| 1 | 
																						 
											 杨瑞,严江鹏,李秀.强化学习稀疏奖励算法研究——理论与实验[J].智能系统学报, 2020, 15(5): 888-899.  10.11992/tis.202003031 
											 											 | 
										
																													
																						 | 
																						 
											 YANG R, YAN J P, LI X. Survey of sparse reward algorithms in reinforcement learning — theory and experiment[J]. CAAI Transactions on Intelligent Systems, 2020, 15(5): 888-899.  10.11992/tis.202003031 
											 											 | 
										
																													
																						| 2 | 
																						 
											 李波,越凯强,甘志刚,等.基于MADDPG的多无人机协同任务决策[J].宇航学报, 2021, 42(6): 757-765.  10.3873/j.issn.1000-1328.2021.06.009 
											 											 | 
										
																													
																						 | 
																						 
											 LI B, YUE K Q, GAN Z G, et al. Multi-UAV cooperative autonomous navigation based on multi-agent deep deterministic policy gradient[J]. Journal of Astronautics, 2021, 42(6): 757-765.  10.3873/j.issn.1000-1328.2021.06.009 
											 											 | 
										
																													
																						| 3 | 
																						 
											 YE D H, CHEN G B, ZHANG W, et al. Towards playing full MOBA games with deep reinforcement learning [C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc., 2020: 621-632.
											 											 | 
										
																													
																						| 4 | 
																						 
											 LI Y X. Deep reinforcement learning: an overview[EB/OL]. (2018-11-26) [2021-10-11]. .  10.1109/tpami.2023.3285634/mm1 
											 											 | 
										
																													
																						| 5 | 
																						 
											 BADIA A P, SPRECHMANN P, VITVITSKYI A, et al. Never give up: learning directed exploration strategies[EB/OL]. (2020-02-14) [2021-11-05]. .
											 											 | 
										
																													
																						| 6 | 
																						 
											 PATHAK D, AGRAWAL P, EFROS A A, et al. Curiosity-driven exploration by self-supervised prediction [C]// Proceedings of the 34th International Conference on Machine Learning. New York: JMLR.org, 2017: 2778-2787.  10.1109/cvprw.2017.70 
											 											 | 
										
																													
																						| 7 | 
																						 
											 OUDEYER P Y, KAPLAN F. How can we define intrinsic motivation?[C/OL]// Proceedings of the 8th International Conference on Epigenetic Robotics: Modeling Cognitive Development in Robotic Systems [2021-11-05]. .  10.1016/j.cogsys.2003.11.001 
											 											 | 
										
																													
																						| 8 | 
																						 
											 STREHL A L, LITTMAN M L. An analysis of model-based Interval Estimation for Markov Decision Processes[J]. Journal of Computer and System Sciences, 2008, 74(8): 1309-1331.  10.1016/j.jcss.2007.08.009 
											 											 | 
										
																													
																						| 9 | 
																						 
											 LAI T L, ROBBINS H. Asymptotically efficient adaptive allocation rules[J]. Advances in Applied Mathematics, 1985, 6(1): 4-22.  10.1016/0196-8858(85)90002-8 
											 											 | 
										
																													
																						| 10 | 
																						 
											 OSTROVSKI G, BELLEMARE M G, A van den OORD, et al. Count-based exploration with neural density models [C]// Proceedings of the 34th International Conference on Machine Learning. New York: JMLR.org, 2017: 2721-2730.
											 											 | 
										
																													
																						| 11 | 
																						 
											 BURDA Y, EDWARDS H, STORKEY A, et al. Exploration by random network distillation[EB/OL]. (2018-10-30) [2021-12-18]. .
											 											 | 
										
																													
																						| 12 | 
																						 
											 TANG H R, HOUTHOOFT R, FOOTE D, et al. #Exploration: a study of count-based exploration for deep reinforcement learning [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc., 2017: 2750-2759.  10.1109/icccbda.2017.7951951 
											 											 | 
										
																													
																						| 13 | 
																						 
											 PARISOTTO E, BA J, SALAKHUTDINOV R. Actor-mimic: deep multitask and transfer reinforcement learning[EB/OL]. (2016-02-22) [2020-11-09]. .
											 											 | 
										
																													
																						| 14 | 
																						 
											 RUSU A A, COLMENAREJO S G, GÜLÇEHRE Ç, et al. Policy distillation[EB/OL]. (2016-01-07) [2020-09-07]. .
											 											 | 
										
																													
																						| 15 | 
																						 
											 姜玉斌,刘全,胡智慧.带最大熵修正的行动者评论家算法[J].计算机学报, 2020, 43(10): 1897-1908.  10.11897/SP.J.1016.2020.01897 
											 											 | 
										
																													
																						 | 
																						 
											 JIANG Y B, LIU Q, HU Z H. Actor-critic algorithm with maximum-entropy correction[J]. Chinese Journal of Computers, 2020, 43(10): 1897-1908.  10.11897/SP.J.1016.2020.01897 
											 											 | 
										
																													
																						| 16 | 
																						 
											 SUTTON R S, BARTO A G. Reinforcement Learning: An Introduction[M]. Cambridge: MIT Press, 1998: 75-76.
											 											 | 
										
																													
																						| 17 | 
																						 
											 MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533.  10.1038/nature14236 
											 											 | 
										
																													
																						| 18 | 
																						 
											 WILLIAMS R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning[J]. Machine Learning, 1992, 8(3/4): 229-256.  10.1007/bf00992696 
											 											 | 
										
																													
																						| 19 | 
																						 
											 KONDA V R, TSITSIKLIS J N. Actor-critic algorithms [C]// Proceedings of the 12th International Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2000: 1008-1014.
											 											 | 
										
																													
																						| 20 | 
																						 
											 MNIH V, BADIA A P, MIRZA M, et al. Asynchronous methods for deep reinforcement learning [C]// Proceedings of the 33rd International Conference on Machine Learning. New York: JMLR.org, 2016: 1928-1937.
											 											 | 
										
																													
																						| 21 | 
																						 
											 SCHULMAN J, LEVINE S, MORITZ P, et al. Trust region policy optimization [C]// Proceedings of the 32nd International Conference on Machine Learning. New York: JMLR.org, 2015: 1889-1897.
											 											 | 
										
																													
																						| 22 | 
																						 
											 SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms[EB/OL]. (2017-08-28) [2021-09-29]. .
											 											 | 
										
																													
																						| 23 | 
																						 
											 THOMPSON W R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples[J]. Biometrika, 1933, 25(3/4): 285-294.  10.1093/biomet/25.3-4.285 
											 											 | 
										
																													
																						| 24 | 
																						 
											 HAARNOJA T, TANG H R, ABBEEL P, et al. Reinforcement learning with deep energy-based policies [C]// Proceedings of the 34th International Conference on Machine Learning. New York: JMLR.org, 2017: 1352-1361.  10.1007/978-1-4899-7687-1_142 
											 											 | 
										
																													
																						| 25 | 
																						 
											 OSBAND I, BLUNDELL C, PRITZEL A, et al. Deep exploration via bootstrapped DQN [C]// Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc., 2016: 4033-4041
											 											 | 
										
																													
																						| 26 | 
																						 
											 BELLEMARE M G, SRINIVASAN S, OSTROVSKI G, et al. Unifying count-based exploration and intrinsic motivation [C]// Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc., 2016: 1479-1487.
											 											 | 
										
																													
																						| 27 | 
																						 
											 STADIE B C, LEVINE S, ABBEEL P. Incentivizing exploration in reinforcement learning with deep predictive models[EB/OL]. (2015-11-19) [2020-12-18]. .
											 											 | 
										
																													
																						| 28 | 
																						 
											 BURDA Y, EDWARDS H, PATHAK D, et al. Large-scale study of curiosity-driven learning[EB/OL]. (2018-08-13) [2022-01-08]. .
											 											 | 
										
																													
																						| 29 | 
																						 
											 SONG Y, CHEN Y F, HU Y J, et al. Exploring unknown states with action balance [C]// Proceedings of the 2020 IEEE Conference on Games. Piscataway: IEEE, 2020: 184-191.  10.1109/cog47356.2020.9231562 
											 											 | 
										
																													
																						| 30 | 
																						 
											 HINTON G, VINYALS O, DEAN J. Distilling the knowledge in a neural network[EB/OL]. (2015-03-09) [2020-12-19]. .
											 											 | 
										
																													
																						| 31 | 
																						 
											 CZARNECKI W M, PASCANU R, OSINDERO S, et al. Distilling policy distillation [C]// Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics. New York: JMLR.org, 2019: 1331-1340.  10.24963/ijcai.2020/435 
											 											 |