Q-table initialization approach for safe exploration based on factorization machine

doi:10.11772/j.issn.1001-9081.2021020239

Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (1): 209-214.DOI: 10.11772/j.issn.1001-9081.2021020239

• Advanced computing • Previous Articles Next Articles

Q-table initialization approach for safe exploration based on factorization machine

Bosen ZENG¹^,²^,³(), Yong ZHONG¹^,², Xianhua NIU⁴^,⁵

^1.Chengdu Institute of Computer Application，Chinese Academy of Sciences，Chengdu Sichuan 610041，China
^2.University of Chinese Academy of Sciences，Beijing 100049，China
^3.School of Network and Communication Engineering，Chengdu Technological University，Chengdu Sichuan 611730，China
^4.National Key Laboratory of Science and Technology on Communications （University of Electronic Science and Technology of China），Chengdu Sichuan 611731，China
^5.School of Computer and Software Engineering，Xihua University，Chengdu Sichuan 610039，China

Received:2021-02-09 Revised:2021-04-21 Accepted:2021-04-28 Online:2021-05-14 Published:2022-01-10
Contact: Bosen ZENG
About author:ZENG Bosen， born in 1982， Ph. D. candidate， senior engineer. His research interests include machine learning， wireless communications.
ZHONG Yong， born in 1966， Ph. D.， research fellow. His research interests include big data and their intelligent processing， cloud computing， software engineering.
NIU Xianhua， born in 1983， Ph. D.， professor. Her research interests include intelligent information processing， information security.
Supported by:
China Postdoctoral Science Foundation(2019M663475)

基于因子分解机用于安全探索的Q表初始化方法

曾柏森¹^,²^,³(), 钟勇¹^,², 牛宪华⁴^,⁵

^1.中国科学院成都计算机应用研究所, 成都 610041
^2.中国科学院大学, 北京 100049
^3.成都工业学院网络与通信工程学院, 成都 611730
^4.通信抗干扰技术国家级重点实验室(电子科技大学), 成都 611731
^5.西华大学计算机与软件工程学院, 成都 610039

通讯作者: 曾柏森
作者简介:曾柏森（1982—），男，四川达州人，高级工程师，博士研究生，主要研究方向：机器学习、无线通信
钟勇（1966—），男，四川岳池人，研究员，博士，主要研究方向：大数据及其智能处理、云计算、软件工程
牛宪华（1983—），女，河南新乡人，教授，博士，主要研究方向：智能信息处理、信息安全。
基金资助:
中国博士后科技基金资助项目(2019M663475)

Abstract

Abstract:

In order to solve the problem that most exploration/exploitation strategies of reinforcement learning ignore the risk brought by the agent action selection with random components in exploration process， a Q-table initialization approach based on Factorization Machine （FM） was proposed for safe exploration. Firstly， the explored Q-values were introduced as prior knowledge， and then FM was used to build the model of potential interaction between states and actions in the prior knowledge. Finally， the unknown Q-values in Q-table were predicted based on this model to further guide the exploration of the agents. A/B testing was conducted in the grid reinforcement learning environment Cliffwalk of OpenAI Gym. The number of bad exploration episodes of Boltzmann and Upper Confidence Bound （UCB） exploration/exploitation strategies based on the proposed approach are reduced by 68.12% and 89.98% respectively. Experimental results show that the proposed approach improves the safety of exploration， and accelerates the convergence at the same time.

Key words: reinforcement learning, Q-learning, Factorization Machine (FM), Q-table initialization, safe exploration

摘要：

针对强化学习的大多数探索/利用策略在探索过程中忽略智能体随机选择动作带来的风险的问题，提出一种基于因子分解机（FM）用于安全探索的Q表初始化方法。首先，引入Q表中已探索的Q值作为先验知识；然后，利用FM建立先验知识中状态和行动间潜在的交互作用的模型；最后，基于该模型预测Q表中的未知Q值，从而进一步引导智能体探索。在OpenAI Gym的网格强化学习环境Cliffwalk中进行的A/B测试里，基于所提方法的Boltzmann和置信区间上界（UCB）探索/利用策略的不良探索幕数分别下降了68.12%和89.98%。实验结果表明，所提方法提高了传统策略的探索安全性，同时加快了收敛。

关键词: 强化学习, Q-learning, 因子分解机, Q表初始化, 安全探索

CLC Number:

TP18

Bosen ZENG, Yong ZHONG, Xianhua NIU. Q-table initialization approach for safe exploration based on factorization machine[J]. Journal of Computer Applications, 2022, 42(1): 209-214.

曾柏森, 钟勇, 牛宪华. 基于因子分解机用于安全探索的Q表初始化方法[J]. 《计算机应用》唯一官方网站, 2022, 42(1): 209-214.

Figures/Tables 8

Fig. 1 Unknown Q-value prediction based on prior Q-values

Fig. 2 Example of input data

Fig.3 Cliffwalk environment

Tab.1 Experimental parameters

参数	值
先验Q值三元组数量	20，40，60，80，100
默认Q值	0
实验次数	10
幕数	500
Q-learning学习率 α	0.1
Q-learning折扣率 γ	0.9
ε-greedy ε	0.1
因子分解维度 k	2

Fig.4 Exploration safety comparison of different strategies

Fig.5 Comparison of convergence speed of Boltzmann strategy

Fig.6 Comparison of convergence speed of UCB strategy

Fig.7 Comparison of convergence speed of ε-greedy strategy

References 18

1	SUTTON R S， BARTO A G. Reinforcement Learning： An Introduction［M］. 2nd ed. Cambridge： MIT Press， 2018： 2-9.
2	HANS A， SCHNEEGASS D， SCHÄFER A M， et al. Safe exploration for reinforcement learning［C/OL］// Proceedings of the 16th European Symposium on Artificial Neural Network. ［2020-12-13］..
3	SMART W D， KAELBLING L P. Practical reinforcement learning in continuous spaces［C］// Proceedings of the 17th International Conference on Machine Learning. San Francisco： Morgan Kaufmann Publishers Inc.， 2000： 903-910.
4	MAIRE F， BULITKO V. Apprenticeship learning for initial value functions in reinforcement learning［C/OL］// Proceedings of the IJCAI 2005 Workshop on Planning and Learning in A Priori Unknown or Dynamic Domains. ［2020-12-13］.. http://eprints.qut.edu.au/23912/
5	SONG Y， LI Y B， LI C H， et al. An efficient initialization approach of q-learning for mobile robots［J］. International Journal of Control， Automation and Systems， 2012， 10（1）：166-172. 10.1007/s12555-012-0119-9
6	TURCHETTA M， BERKENKAMP F， KRAUSE A. Safe exploration for interactive machine learning［C/OL］// Proceedings of the 33rd Conference on Neural Information Processing Systems. ［2020-12-13］.. 10.1109/cdc.2018.8619572
7	段建民，陈强龙. 利用先验知识的Q-Learning路径规划算法研究［J］.电光与控制， 2019， 26（9）：29-33. 10.3969/j.issn.1671-637X.2019.09.007
	DUAN J M， CHEN Q L. Prior knowledge based Q-Learning path planning algorithm［J］. Electronics Optics and Control， 2019， 26（9）：29-33. 10.3969/j.issn.1671-637X.2019.09.007
8	PECKA M， SVOBODA T. Safe exploration techniques for reinforcement learning — an overview［C］// Proceedings of the 2014 International Workshop on Modelling and Simulation for Autonomous Systems， LNCS8906. Cham： Springer， 2014： 357-375.
9	GEIBEL P. Reinforcement learning with bounded risk［C］// Proceedings of the 18th International Conference on Machine Learning. San Francisco： Morgan Kaufmann Publishers Inc.， 2001： 162-169.
10	HEGER M. Consideration of risk in reinforcement learning［C］// Proceedings of the 11th International Conference on Machine Learning. San Francisco： Morgan Kaufmann Publishers Inc.， 1994： 105-111. 10.1016/b978-1-55860-335-6.50021-0
11	WATKINS C J C H， DAYAN P. Q-learning［J］. Machine Learning， 1992， 8（3）： 279-292. 10.1023/a:1022676722315
12	RENDLE S. Factorization machines［C］// Proceedings of the 2010 IEEE International Conference on Data Mining. Piscataway： IEEE， 2010： 995-1000. 10.1109/icdm.2010.127
13	赵衎衎，张良富，张静，等. 因子分解机模型研究综述［J］. 软件学报， 2019， 30（3）：799-821. 10.13328/j.cnki.jos.005698
	ZHAO K K， ZHANG L F， ZHANG J， et al. Survey on factorization machines model［J］. Journal of Software， 2019， 30（3）：799-821. 10.13328/j.cnki.jos.005698
14	GARCÍA J， FERNÁNDEZ F. Safe exploration of state and action spaces in reinforcement learning［J］. Journal of Artificial Intelligence Research， 2012， 45： 515-564. 10.1613/jair.3761
15	BROCKMAN G， CHEUNG V， PETTERSSON L， et al. OpenAI Gym ［EB/OL］. （2016-06-05）［2020-07-01］..
16	CESA-BIANCHI N， GENTILE C， LUGOSI G， et al. Boltzmann exploration done right［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 6287-6296. 10.1109/isit.1998.708939
17	AUER P， CESA-BIANCHI N. FISCHER P. Finite time analysis of the multiarmed bandit problem［J］. Machine Learning， 2002， 47（2/3）： 235-256. 10.1023/a:1013689704352
18	Apple. TuriCreate［CP/OL］. ［2020-09-09］.. 10.2118/202703-ms

[1]	Hailin XIAO, Tianyi HUANG, Qiuxiang DAI, Yuejun ZHANG, Zhongshan ZHANG. Safe reinforcement learning method for decision making of autonomous lane changing based on trajectory prediction [J]. Journal of Computer Applications, 2024, 44(9): 2958-2963.
[2]	Haodong HE, Hao FU, Qiang WANG, Shuai ZHOU, Wei LIU. Multi-robot path following and formation based on deep reinforcement learning [J]. Journal of Computer Applications, 2024, 44(8): 2626-2633.
[3]	Yi ZHOU, Hua GAO, Yongshen TIAN. Proximal policy optimization algorithm based on clipping optimization and policy guidance [J]. Journal of Computer Applications, 2024, 44(8): 2334-2341.
[4]	Tian MA, Runtao XI, Jiahao LYU, Yijie ZENG, Jiayi YANG, Jiehui ZHANG. Mobile robot 3D space path planning method based on deep reinforcement learning [J]. Journal of Computer Applications, 2024, 44(7): 2055-2064.
[5]	Xiaoyan ZHAO, Wei HAN, Junna ZHANG, Peiyan YUAN. Collaborative offloading strategy in internet of vehicles based on asynchronous deep reinforcement learning [J]. Journal of Computer Applications, 2024, 44(5): 1501-1510.
[6]	Rui TANG, Chuanlin PANG, Ruizhi ZHANG, Chuan LIU, Shibo YUE. DDPG-based resource allocation in D2D communication-empowered cellular network [J]. Journal of Computer Applications, 2024, 44(5): 1562-1569.
[7]	Fatang CHEN, Miao HUANG, Yufeng JIN. Resource allocation algorithm for low earth orbit satellites oriented to user demand [J]. Journal of Computer Applications, 2024, 44(4): 1242-1247.
[8]	Xintong QIN, Zhengyu SONG, Tianwei HOU, Feiyue WANG, Xin SUN, Wei LI. Channel access and resource allocation algorithm for adaptive p-persistent mobile ad hoc network [J]. Journal of Computer Applications, 2024, 44(3): 863-868.
[9]	Yuanchao LI, Chongben TAO, Chen WANG. Gait control method based on maximum entropy deep reinforcement learning for biped robot [J]. Journal of Computer Applications, 2024, 44(2): 445-451.
[10]	Fuqin DENG, Huifeng GUAN, Chaoen TAN, Lanhui FU, Hongmin WANG, Tinlun LAM, Jianmin ZHANG. Multi-robot reinforcement learning path planning method based on request-response communication mechanism and local attention mechanism [J]. Journal of Computer Applications, 2024, 44(2): 432-438.
[11]	Ziyang SONG, Junhuai LI, Huaijun WANG, Xin SU, Lei YU. Path planning algorithm of manipulator based on path imitation and SAC reinforcement learning [J]. Journal of Computer Applications, 2024, 44(2): 439-444.
[12]	Jiachen YU, Ye YANG. Irregular object grasping by soft robotic arm based on clipped proximal policy optimization algorithm [J]. Journal of Computer Applications, 2024, 44(11): 3629-3638.
[13]	Yu WANG, Zhihui GUAN, Yuanpeng LI. Distributed UAV cluster pursuit decision-making based on trajectory prediction and MADDPG [J]. Journal of Computer Applications, 2024, 44(11): 3623-3628.
[14]	Jie LONG, Liang XIE, Haijiao XU. Integrated deep reinforcement learning portfolio model [J]. Journal of Computer Applications, 2024, 44(1): 300-310.
[15]	Yu WANG, Tianjun REN, Zilin FAN. Air combat maneuver decision-making of unmanned aerial vehicle based on guided Minimax-DDQN [J]. Journal of Computer Applications, 2023, 43(8): 2636-2643.

Q-table initialization approach for safe exploration based on factorization machine

基于因子分解机用于安全探索的Q表初始化方法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 8

References 18

Related Articles 15

Recommended Articles

Metrics