基于因子分解机用于安全探索的Q表初始化方法

doi:10.11772/j.issn.1001-9081.2021020239

《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (1): 209-214.DOI: 10.11772/j.issn.1001-9081.2021020239

• 先进计算 • 上一篇

基于因子分解机用于安全探索的Q表初始化方法

曾柏森¹^,²^,³(), 钟勇¹^,², 牛宪华⁴^,⁵

^1.中国科学院成都计算机应用研究所, 成都 610041
^2.中国科学院大学, 北京 100049
^3.成都工业学院网络与通信工程学院, 成都 611730
^4.通信抗干扰技术国家级重点实验室(电子科技大学), 成都 611731
^5.西华大学计算机与软件工程学院, 成都 610039

收稿日期:2021-02-09 修回日期:2021-04-21 接受日期:2021-04-28 发布日期:2021-05-14 出版日期:2022-01-10
通讯作者: 曾柏森
作者简介:曾柏森（1982—），男，四川达州人，高级工程师，博士研究生，主要研究方向：机器学习、无线通信
钟勇（1966—），男，四川岳池人，研究员，博士，主要研究方向：大数据及其智能处理、云计算、软件工程
牛宪华（1983—），女，河南新乡人，教授，博士，主要研究方向：智能信息处理、信息安全。
基金资助:
中国博士后科技基金资助项目(2019M663475)

Q-table initialization approach for safe exploration based on factorization machine

Bosen ZENG¹^,²^,³(), Yong ZHONG¹^,², Xianhua NIU⁴^,⁵

^1.Chengdu Institute of Computer Application，Chinese Academy of Sciences，Chengdu Sichuan 610041，China
^2.University of Chinese Academy of Sciences，Beijing 100049，China
^3.School of Network and Communication Engineering，Chengdu Technological University，Chengdu Sichuan 611730，China
^4.National Key Laboratory of Science and Technology on Communications （University of Electronic Science and Technology of China），Chengdu Sichuan 611731，China
^5.School of Computer and Software Engineering，Xihua University，Chengdu Sichuan 610039，China

Received:2021-02-09 Revised:2021-04-21 Accepted:2021-04-28 Online:2021-05-14 Published:2022-01-10
Contact: Bosen ZENG
About author:ZENG Bosen， born in 1982， Ph. D. candidate， senior engineer. His research interests include machine learning， wireless communications.
ZHONG Yong， born in 1966， Ph. D.， research fellow. His research interests include big data and their intelligent processing， cloud computing， software engineering.
NIU Xianhua， born in 1983， Ph. D.， professor. Her research interests include intelligent information processing， information security.
Supported by:
China Postdoctoral Science Foundation(2019M663475)

摘要/Abstract

摘要：

针对强化学习的大多数探索/利用策略在探索过程中忽略智能体随机选择动作带来的风险的问题，提出一种基于因子分解机（FM）用于安全探索的Q表初始化方法。首先，引入Q表中已探索的Q值作为先验知识；然后，利用FM建立先验知识中状态和行动间潜在的交互作用的模型；最后，基于该模型预测Q表中的未知Q值，从而进一步引导智能体探索。在OpenAI Gym的网格强化学习环境Cliffwalk中进行的A/B测试里，基于所提方法的Boltzmann和置信区间上界（UCB）探索/利用策略的不良探索幕数分别下降了68.12%和89.98%。实验结果表明，所提方法提高了传统策略的探索安全性，同时加快了收敛。

关键词: 强化学习, Q-learning, 因子分解机, Q表初始化, 安全探索

Abstract:

In order to solve the problem that most exploration/exploitation strategies of reinforcement learning ignore the risk brought by the agent action selection with random components in exploration process， a Q-table initialization approach based on Factorization Machine （FM） was proposed for safe exploration. Firstly， the explored Q-values were introduced as prior knowledge， and then FM was used to build the model of potential interaction between states and actions in the prior knowledge. Finally， the unknown Q-values in Q-table were predicted based on this model to further guide the exploration of the agents. A/B testing was conducted in the grid reinforcement learning environment Cliffwalk of OpenAI Gym. The number of bad exploration episodes of Boltzmann and Upper Confidence Bound （UCB） exploration/exploitation strategies based on the proposed approach are reduced by 68.12% and 89.98% respectively. Experimental results show that the proposed approach improves the safety of exploration， and accelerates the convergence at the same time.

Key words: reinforcement learning, Q-learning, Factorization Machine (FM), Q-table initialization, safe exploration

中图分类号:

TP18

曾柏森, 钟勇, 牛宪华. 基于因子分解机用于安全探索的Q表初始化方法[J]. 计算机应用, 2022, 42(1): 209-214.

Bosen ZENG, Yong ZHONG, Xianhua NIU. Q-table initialization approach for safe exploration based on factorization machine[J]. Journal of Computer Applications, 2022, 42(1): 209-214.

图/表 8

图1 基于先验Q值预测未知Q值

Fig. 1 Unknown Q-value prediction based on prior Q-values

图2 输入数据示例

Fig. 2 Example of input data

图3 Cliffwalk环境

Fig.3 Cliffwalk environment

表1 实验参数

Tab.1 Experimental parameters

参数	值
先验Q值三元组数量	20，40，60，80，100
默认Q值	0
实验次数	10
幕数	500
Q-learning学习率 α	0.1
Q-learning折扣率 γ	0.9
ε-greedy ε	0.1
因子分解维度 k	2

图4 不同策略探索安全性对比

Fig.4 Exploration safety comparison of different strategies

图5 Boltzmann策略收敛速度对比

Fig.5 Comparison of convergence speed of Boltzmann strategy

图6 UCB策略收敛速度对比

Fig.6 Comparison of convergence speed of UCB strategy

图7 ε-greedy策略收敛速度对比

Fig.7 Comparison of convergence speed of ε-greedy strategy

参考文献 18

1	SUTTON R S， BARTO A G. Reinforcement Learning： An Introduction［M］. 2nd ed. Cambridge： MIT Press， 2018： 2-9.
2	HANS A， SCHNEEGASS D， SCHÄFER A M， et al. Safe exploration for reinforcement learning［C/OL］// Proceedings of the 16th European Symposium on Artificial Neural Network. ［2020-12-13］..
3	SMART W D， KAELBLING L P. Practical reinforcement learning in continuous spaces［C］// Proceedings of the 17th International Conference on Machine Learning. San Francisco： Morgan Kaufmann Publishers Inc.， 2000： 903-910.
4	MAIRE F， BULITKO V. Apprenticeship learning for initial value functions in reinforcement learning［C/OL］// Proceedings of the IJCAI 2005 Workshop on Planning and Learning in A Priori Unknown or Dynamic Domains. ［2020-12-13］.. http://eprints.qut.edu.au/23912/
5	SONG Y， LI Y B， LI C H， et al. An efficient initialization approach of q-learning for mobile robots［J］. International Journal of Control， Automation and Systems， 2012， 10（1）：166-172. 10.1007/s12555-012-0119-9
6	TURCHETTA M， BERKENKAMP F， KRAUSE A. Safe exploration for interactive machine learning［C/OL］// Proceedings of the 33rd Conference on Neural Information Processing Systems. ［2020-12-13］.. 10.1109/cdc.2018.8619572
7	段建民，陈强龙. 利用先验知识的Q-Learning路径规划算法研究［J］.电光与控制， 2019， 26（9）：29-33. 10.3969/j.issn.1671-637X.2019.09.007
	DUAN J M， CHEN Q L. Prior knowledge based Q-Learning path planning algorithm［J］. Electronics Optics and Control， 2019， 26（9）：29-33. 10.3969/j.issn.1671-637X.2019.09.007
8	PECKA M， SVOBODA T. Safe exploration techniques for reinforcement learning — an overview［C］// Proceedings of the 2014 International Workshop on Modelling and Simulation for Autonomous Systems， LNCS8906. Cham： Springer， 2014： 357-375.
9	GEIBEL P. Reinforcement learning with bounded risk［C］// Proceedings of the 18th International Conference on Machine Learning. San Francisco： Morgan Kaufmann Publishers Inc.， 2001： 162-169.
10	HEGER M. Consideration of risk in reinforcement learning［C］// Proceedings of the 11th International Conference on Machine Learning. San Francisco： Morgan Kaufmann Publishers Inc.， 1994： 105-111. 10.1016/b978-1-55860-335-6.50021-0
11	WATKINS C J C H， DAYAN P. Q-learning［J］. Machine Learning， 1992， 8（3）： 279-292. 10.1023/a:1022676722315
12	RENDLE S. Factorization machines［C］// Proceedings of the 2010 IEEE International Conference on Data Mining. Piscataway： IEEE， 2010： 995-1000. 10.1109/icdm.2010.127
13	赵衎衎，张良富，张静，等. 因子分解机模型研究综述［J］. 软件学报， 2019， 30（3）：799-821. 10.13328/j.cnki.jos.005698
	ZHAO K K， ZHANG L F， ZHANG J， et al. Survey on factorization machines model［J］. Journal of Software， 2019， 30（3）：799-821. 10.13328/j.cnki.jos.005698
14	GARCÍA J， FERNÁNDEZ F. Safe exploration of state and action spaces in reinforcement learning［J］. Journal of Artificial Intelligence Research， 2012， 45： 515-564. 10.1613/jair.3761
15	BROCKMAN G， CHEUNG V， PETTERSSON L， et al. OpenAI Gym ［EB/OL］. （2016-06-05）［2020-07-01］..
16	CESA-BIANCHI N， GENTILE C， LUGOSI G， et al. Boltzmann exploration done right［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 6287-6296. 10.1109/isit.1998.708939
17	AUER P， CESA-BIANCHI N. FISCHER P. Finite time analysis of the multiarmed bandit problem［J］. Machine Learning， 2002， 47（2/3）： 235-256. 10.1023/a:1013689704352
18	Apple. TuriCreate［CP/OL］. ［2020-09-09］.. 10.2118/202703-ms

[1]	周烁, 仇润鹤, 唐旻俊. 基于禁忌搜索和Q-learning的CR-NOMA系统的功率分配算法[J]. 计算机应用, 2021, 41(7): 2026-2032.
[2]	武维, 李泽平, 杨华蔚, 林川, 王忠德. 融合内容特征和时序信息的深度注意力视频流行度预测模型[J]. 计算机应用, 2021, 41(7): 1878-1884.
[3]	王宇, 刘燕丽, 陈劭武. 基于顶点冲突学习的最大公共子图算法[J]. 计算机应用, 2021, 41(6): 1756-1760.
[4]	王建平, 王刚, 毛晓彬, 马恩琪. 基于深度强化学习的二连杆机械臂运动控制方法[J]. 计算机应用, 2021, 41(6): 1799-1804.
[5]	林怿星, 唐华. 基于异构信息网络的混合推荐模型[J]. 计算机应用, 2021, 41(5): 1348-1355.
[6]	杜嘻嘻, 程华, 房一泉. 基于优势演员-评论家算法的强化自动摘要模型[J]. 计算机应用, 2021, 41(3): 699-705.
[7]	姚兴虎, 谭晓阳. 基于奖励高速路网络的多智能体强化学习中的全局信用分配算法[J]. 计算机应用, 2021, 41(1): 1-7.
[8]	刘思嘉, 童向荣. 基于强化学习的城市交通路径规划[J]. 计算机应用, 2021, 41(1): 185-190.
[9]	傅魁, 梁少晴, 李冰. 基于改进的深度Q网络结构的商品推荐模型[J]. 计算机应用, 2020, 40(9): 2613-2621.
[10]	胡学敏, 成煜, 陈国文, 张若晗, 童秀迟. 基于深度时空Q网络的定向导航自动驾驶运动规划[J]. 计算机应用, 2020, 40(7): 1919-1925.
[11]	郑延斌, 樊文鑫, 韩梦云, 陶雪丽. 基于博弈论及Q学习的多Agent协作追捕算法[J]. 计算机应用, 2020, 40(6): 1613-1620.
[12]	任娜, 张楠, 崔妍, 张融雪, 庞新富. 面向无人机电力巡检的语义实体构建及航迹控制方法[J]. 计算机应用, 2020, 40(10): 3095-3100.
[13]	陈佳沣, 滕冲. 基于强化学习的实体关系联合抽取模型[J]. 计算机应用, 2019, 39(7): 1918-1924.
[14]	舒凌洲, 吴佳, 王晨. 基于深度强化学习的城市交通信号控制算法[J]. 计算机应用, 2019, 39(5): 1495-1499.
[15]	王甜甜, 于双元, 徐保民. 基于策略梯度算法的工作量证明中挖矿困境研究[J]. 计算机应用, 2019, 39(5): 1336-1342.

基于因子分解机用于安全探索的Q表初始化方法

Q-table initialization approach for safe exploration based on factorization machine

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 8

参考文献 18

相关文章 15

编辑推荐

Metrics