基于强化学习的渗透路径推荐模型

doi:10.11772/j.issn.1001-9081.2021061424

《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (6): 1689-1694.DOI: 10.11772/j.issn.1001-9081.2021061424

• 2021年全国开放式分布与并行计算学术年会(DPCS 2021)论文 • 上一篇

基于强化学习的渗透路径推荐模型

赵海妮¹^,², 焦健¹^,²()

^1.北京信息科技大学计算机学院，北京 100101
^2.网络文化与数字传播北京市重点实验室（北京信息科技大学），北京 100101

收稿日期:2021-08-09 修回日期:2021-10-16 接受日期:2021-10-29 发布日期:2022-01-10 出版日期:2022-06-10
通讯作者: 焦健
作者简介:赵海妮（1997—），女，安徽阜阳人，硕士研究生，CCF会员，主要研究方向：网络安全、渗透测试
基金资助:
网络文化与数字传播北京市重点实验室开放课题(ICDDXN006)

Recommendation model of penetration path based on reinforcement learning

Haini ZHAO¹^,², Jian JIAO¹^,²()

^1.Computer School，Beijing Information Science and Technology University，Beijing 100101，China
^2.Beijing Key Laboratory of Internet Culture and Digital Dissemination Research （Beijing Information Science and Technology University），Beijing 100101，China

Received:2021-08-09 Revised:2021-10-16 Accepted:2021-10-29 Online:2022-01-10 Published:2022-06-10
Contact: Jian JIAO
About author:ZHAO Haini， born in 1997， M. S. candidate. Her research interests include network security， penetration test.
Supported by:
Opening Project of Beijing Key Laboratory of Internet Culture and Digital Dissemination Research(ICDDXN006)

摘要/Abstract

摘要：

渗透测试的核心问题是渗透测试路径的规划，手动规划依赖测试人员的经验，而自动生成渗透路径主要基于网络安全的先验知识和特定的漏洞或网络场景，所需成本高且缺乏灵活性。针对这些问题，提出一种基于强化学习的渗透路径推荐模型QLPT，通过多回合的漏洞选择和奖励反馈，最终给出针对渗透对象的最佳渗透路径。在开源靶场的渗透实验结果表明，与手动测试的渗透路径相比，所提模型推荐的路径具有较高一致性，验证了该模型的可行性与准确性；与自动化渗透测试框架Metasploit相比，该模型在适应所有渗透场景方面也更具灵活性。

关键词: 渗透测试, 强化学习, Q学习, 策略规划

Abstract:

The core problem of penetration test is the planning of penetration test paths. Manual planning relies on the experience of testers， while automated generation of penetration paths is mainly based on the priori knowledge of network security and specific vulnerabilities or network scenarios， which requires high cost and lacks flexibility. To address these problems， a reinforcement learning-based penetration path recommendation model named Q Learning Penetration Test （QLPT） was proposed to finally give the optimal penetration path for the penetration object through multiple rounds of vulnerability selection and reward feedback. It is found that the recommended path of QLPT has a high consistency with the path of manual penetration test by implementing penetration experiments at open source cyber range， verifying the feasibility and accuracy of this model； compared with the automated penetration test framework Metasploit， QLPT is more flexible in adapting to all penetration scenarios.

Key words: penetration test, reinforcement learning, Q learning, strategic planning

中图分类号:

TP393.08

赵海妮, 焦健. 基于强化学习的渗透路径推荐模型[J]. 计算机应用, 2022, 42(6): 1689-1694.

Haini ZHAO, Jian JIAO. Recommendation model of penetration path based on reinforcement learning[J]. Journal of Computer Applications, 2022, 42(6): 1689-1694.

图/表 12

图1 强化学习过程

Fig. 1 Reinforcement learning process

表1 主要符号及其含义

Tab. 1 Main symbols and their definitions

符号	含义
V	可利用漏洞集
$v i$	漏洞i，描述形式为（漏洞标识，漏洞位置，漏洞等级，数据）
S	模型对渗透对象知悉程度状态集
$s t$	t时刻状态，描述形式为（位置，权限，数据）
$s'$	期望状态
A	动作集
$a t$	t时刻执行的动作，描述形式为 $(v i, T)$
T	漏洞利用工具
Resu	漏洞利用结果
R	奖励值｛0，1｝
∂	随机选择概率
π	策略（决策规则）
γ	折扣系数
$Q π (s t, v i)$	状态s_t 在策略 $π$ 下利用漏洞v_i 的价值
$Q (s t, v i)$	状态s_t 在最优策略下利用漏洞v_i 的价值
MaxEpisodes	最大学习回合数
Path	渗透路径，由一系列的漏洞利用动作构成，即 $P a t h = a 1, a 2, …, a n$

表1 主要符号及其含义

Tab. 1 Main symbols and their definitions

符号	含义
V	可利用漏洞集
$v i$	漏洞i，描述形式为（漏洞标识，漏洞位置，漏洞等级，数据）
S	模型对渗透对象知悉程度状态集
$s t$	t时刻状态，描述形式为（位置，权限，数据）
$s'$	期望状态
A	动作集
$a t$	t时刻执行的动作，描述形式为 $(v i, T)$
T	漏洞利用工具
Resu	漏洞利用结果
R	奖励值｛0，1｝
∂	随机选择概率
π	策略（决策规则）
γ	折扣系数
$Q π (s t, v i)$	状态s_t 在策略 $π$ 下利用漏洞v_i 的价值
$Q (s t, v i)$	状态s_t 在最优策略下利用漏洞v_i 的价值
MaxEpisodes	最大学习回合数
Path	渗透路径，由一系列的漏洞利用动作构成，即 $P a t h = a 1, a 2, …, a n$

表2 系统访问权限说明

Tab. 2 System access rights description

符号	含义
root	系统管理员，管理系统设备、系统文件和系统进程等一切资源
user	任意一个系统普通用户，由系统初始化产生或系统管理员创建，有自己独立私有的资源
access	可以访问网络服务的远程访问者，通常是信任的访问者，能和网络服务进程交互数据，可以扫描系统信息等
none	没有任何权限的远程访问者，包括不受信任或被隔离在防火墙之外的用户

表3 QLPT内部模块功能说明

Tab. 3 Function description of QLPT internal modules

QLPT模块	功能描述
漏洞选择	根据当前状态 $s t$ 选择最佳利用漏洞 $v i$
漏洞利用	根据当前漏洞 $v i$ 选择工具T，并使用工具T利用漏洞 $v i$ ，得到漏洞利用结果Resu
奖励值转换	根据当前状态 $s t$ 与期望状态 $s'$ ，将漏洞利用结果Resu转换成奖励值R
Q学习	根据状态 $s t$ ，漏洞 $v i$ 与奖励值R更新Q表中对应的 $Q (s t, v i)$ 值
状态-漏洞Q表	存储每组状态-漏洞对的价值 $Q (s t, v i)$
状态更新	根据漏洞利用结果Resu更新状态 $s t$

表3 QLPT内部模块功能说明

Tab. 3 Function description of QLPT internal modules

QLPT模块	功能描述
漏洞选择	根据当前状态 $s t$ 选择最佳利用漏洞 $v i$
漏洞利用	根据当前漏洞 $v i$ 选择工具T，并使用工具T利用漏洞 $v i$ ，得到漏洞利用结果Resu
奖励值转换	根据当前状态 $s t$ 与期望状态 $s'$ ，将漏洞利用结果Resu转换成奖励值R
Q学习	根据状态 $s t$ ，漏洞 $v i$ 与奖励值R更新Q表中对应的 $Q (s t, v i)$ 值
状态-漏洞Q表	存储每组状态-漏洞对的价值 $Q (s t, v i)$
状态更新	根据漏洞利用结果Resu更新状态 $s t$

图2 QLPT训练流程

Fig. 2 QLPT training flowchart

图3 实验场景

Fig. 3 Experimental scenario

表4 漏洞信息说明

Tab. 4 Vulnerability information description

漏洞标识	漏洞类型	漏洞位置	漏洞等级	数据
CWE-20	文件上传	Input file	7.5	webshell
CWE-22	目录扫描	./	5.3	敏感文件
CWE-89	SQL注入	ss_id	10.0	webshell
CWE-307	暴力破解	login	5.3	密码列表
CWE-79	XSS	user_email	5.3	Cookie

表5 状态集索引

Tab. 5 State set indexes

索引	状态	位置	权限	数据
0	得到可利用漏洞集	web	access	漏洞信息
1	已获取后台入口	web	user	后台URL
2	已获取用户信息	web	user	用户信息
3	已获取webshell	web	user	管理员信息
4	提权成功	server	root	全部信息

图4 渗透路径

Fig. 4 Penetration path

图5 漏洞利用期望收益值随学习回合数的变化

Fig. 5 Expected profit from vulnerability exploitation varying with number of learning rounds

表6 QLPT和Metasploit的对比

Tab. 6 Comparison of QLPT and Metasploit

框架/模型

攻击场景

普适性

反馈更新漏洞

利用条件

测试人员无需具备

漏洞利用知识

图6 为达到渗透目标每回合学习的漏洞利用尝试次数

Fig. 6 Vulnerability exploration attempt number to reach penetration target per learning round

参考文献 20

1	SARRAUTE C. Automated attack planning［D］. Buenos Aires： Instituto Tecnológico de Buenos Aires， 2012：23-24.
2	SINGH N， MEHERHOMJI V， CHANDAVARKAR B R. Automated versus manual approach of web application penetration testing［C］// Proceedings of the 11th International Conference on Computing， Communication and Networking Technologies. Piscataway： IEEE， 2020： 1-6. 10.1109/icccnt49239.2020.9225385
3	SHEYNER O， HAINES J， JHA S， et al. Automated generation and analysis of attack graphs［C］// Proceedings 2002 IEEE Symposium on Security and Privacy. Piscataway： IEEE， 2002： 273-284.
4	SWILER L P， PHILLIPS C， GAYLOR T. A graph-based network-vulnerability analysis system： SAND-97-3010C； CONF-980534ON： DE98001486； BR： YN0100000； TRN： AHC2DT03%%16［R］. Albuquerque， NM： Sandia National Lab， 1998：8.
5	YU X H， JIANG J H， SHUAI C Y. Approach to attack path generation based on vulnerability correlation［M］// KOPCHO J， KURZAWA C， MACPHERSON G. IEEE Conference Anthology. Piscataway： IEEE， 2013： 1-6. 10.1109/anthology.2013.6784925
6	OU X M， GOVINDAVAJHALA S， APPEL A W. MulVAL： a logic-based network security analyzer［C］// Proceedings of the 14th USENIX Security Symposium. Berkeley： USENIX Association， 2005： 113-128.
7	OU X M， BOYER W F， McQUEEN M A. A scalable approach to attack graph generation［C］// Proceedings of the 13th ACM Conference on Computer and Communications Security. New York： ACM， 2006： 336-345. 10.1145/1180405.1180446
8	张登峰. 基于机器学习的SQL注入检测［D］. 重庆：重庆邮电大学， 2017：1-69.
	ZHANG D F. SQL injection detection based on machine learning［D］. Chongqing： Chongqing University of Posts and Telecommunications， 2017：1-69
9	洪镇宇. 基于机器学习的跨站脚本攻击检测研究［D］. 厦门：厦门大学， 2018：1-77.
	HONG Z U. Research on detection of cross-site scripting attacks based on machine learning［D］. Xiamen： Xiamen University， 2018：1-77.
10	NUNAN A E， SOUTO E， DOS SANTOS E M， et al. Automatic classification of cross-site scripting in web pages using document-based and URL-based features［C］// Proceedings of the 2012 IEEE Symposium on Computers and Communications. Piscataway： IEEE， 2012： 702-707. 10.1109/iscc.2012.6249380
11	SARRAUTE C， BUFFET O， HOFFMANN J. POMDPs make better hackers： accounting for uncertainty in penetration testing［C］// Proceedings of the 26th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2012： 1816-1824. 10.1609/aaai.v26i1.8363
12	RICHARD S S， BARTO A G. Reinforcement Learning： An Introduction［M］. Cambridge： MIT Press， 1998： 313-314.
13	CHOWDHARY A， HUANG D J， MAHENDRAN J S， et al. Autonomous security analysis and penetration testing［C］// Proceedings of the 16th International Conference on Mobility， Sensing and Networking. Piscataway： IEEE， 2020： 508-515. 10.1109/msn50589.2020.00086
14	CHAUDHARY S， O’BRIEN A， XU S. Automated post-breach penetration testing through reinforcement learning［C］// Proceedings of the 2020 IEEE Conference on Communications and Network Security. Piscataway： IEEE， 2020： 1-2. 10.1109/cns48642.2020.9162301
15	GHANEM M C， CHEN T M. Reinforcement learning for intelligent penetration testing［C］// Proceedings of the 2nd World Conference on Smart Trends in Systems， Security and Sustainability. Piscataway： IEEE， 2018： 185-192. 10.1109/worlds4.2018.8611595
16	Rapid 7. Metasploit［DB/OL］. ［2021-06-17］. . 10.34739/si.2020.24.03
17	ZHOU T Y， ZANG Y C， ZHU J H， et al. NIG-AP： a new method for automated penetration testing［J］. Frontiers of Information Technology and Electronic Engineering， 2019， 20（9）： 1277-1288. 10.1631/fitee.1800532
18	Invicti. Acunetix［DB/OL］. ［2021-06-17］.. 10.37034/jidt.v4i1.190
19	WATKINS C J C H. Learning from delayed rewards［D］. Cambridge： King’s College of University of Cambridge， 1989：1-142.
20	WATKINS C J C H， DAYAN P. Q-learning［J］. Machine Learning， 1992， 8（3/4）：279-292. 10.1023/a:1022676722315

[1]	邓世权, 叶绪国. 基于深度Q网络的多目标任务卸载算法[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1668-1674.
[2]	邓绍斌, 朱军, 周晓锋, 李帅, 刘舒锐. 基于局部策略交互探索的深度确定性策略梯度的工业过程控制方法[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1642-1648.
[3]	陈浩杰, 范江亭, 刘勇. 深度强化学习解决动态旅行商问题[J]. 《计算机应用》唯一官方网站, 2022, 42(4): 1194-1200.
[4]	李学明, 吴国豪, 周尚波, 林晓然, 谢洪斌. 基于分数阶网络和强化学习的图像实例分割模型[J]. 《计算机应用》唯一官方网站, 2022, 42(2): 574-583.
[5]	曾柏森, 钟勇, 牛宪华. 基于因子分解机用于安全探索的Q表初始化方法[J]. 《计算机应用》唯一官方网站, 2022, 42(1): 209-214.
[6]	王宇, 刘燕丽, 陈劭武. 基于顶点冲突学习的最大公共子图算法[J]. 计算机应用, 2021, 41(6): 1756-1760.
[7]	王建平, 王刚, 毛晓彬, 马恩琪. 基于深度强化学习的二连杆机械臂运动控制方法[J]. 计算机应用, 2021, 41(6): 1799-1804.
[8]	杜嘻嘻, 程华, 房一泉. 基于优势演员-评论家算法的强化自动摘要模型[J]. 计算机应用, 2021, 41(3): 699-705.
[9]	刘思嘉, 童向荣. 基于强化学习的城市交通路径规划[J]. 计算机应用, 2021, 41(1): 185-190.
[10]	姚兴虎, 谭晓阳. 基于奖励高速路网络的多智能体强化学习中的全局信用分配算法[J]. 计算机应用, 2021, 41(1): 1-7.
[11]	傅魁, 梁少晴, 李冰. 基于改进的深度Q网络结构的商品推荐模型[J]. 计算机应用, 2020, 40(9): 2613-2621.
[12]	胡学敏, 成煜, 陈国文, 张若晗, 童秀迟. 基于深度时空Q网络的定向导航自动驾驶运动规划[J]. 计算机应用, 2020, 40(7): 1919-1925.
[13]	郑延斌, 樊文鑫, 韩梦云, 陶雪丽. 基于博弈论及Q学习的多Agent协作追捕算法[J]. 计算机应用, 2020, 40(6): 1613-1620.
[14]	任娜, 张楠, 崔妍, 张融雪, 庞新富. 面向无人机电力巡检的语义实体构建及航迹控制方法[J]. 计算机应用, 2020, 40(10): 3095-3100.
[15]	陈佳沣, 滕冲. 基于强化学习的实体关系联合抽取模型[J]. 计算机应用, 2019, 39(7): 1918-1924.

基于强化学习的渗透路径推荐模型

Recommendation model of penetration path based on reinforcement learning

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 20

相关文章 15

编辑推荐

Metrics