Recommendation model of penetration path based on reinforcement learning

doi:10.11772/j.issn.1001-9081.2021061424

Abstract

Abstract:

The core problem of penetration test is the planning of penetration test paths. Manual planning relies on the experience of testers， while automated generation of penetration paths is mainly based on the priori knowledge of network security and specific vulnerabilities or network scenarios， which requires high cost and lacks flexibility. To address these problems， a reinforcement learning-based penetration path recommendation model named Q Learning Penetration Test （QLPT） was proposed to finally give the optimal penetration path for the penetration object through multiple rounds of vulnerability selection and reward feedback. It is found that the recommended path of QLPT has a high consistency with the path of manual penetration test by implementing penetration experiments at open source cyber range， verifying the feasibility and accuracy of this model； compared with the automated penetration test framework Metasploit， QLPT is more flexible in adapting to all penetration scenarios.

Key words: penetration test, reinforcement learning, Q learning, strategic planning

摘要：

渗透测试的核心问题是渗透测试路径的规划，手动规划依赖测试人员的经验，而自动生成渗透路径主要基于网络安全的先验知识和特定的漏洞或网络场景，所需成本高且缺乏灵活性。针对这些问题，提出一种基于强化学习的渗透路径推荐模型QLPT，通过多回合的漏洞选择和奖励反馈，最终给出针对渗透对象的最佳渗透路径。在开源靶场的渗透实验结果表明，与手动测试的渗透路径相比，所提模型推荐的路径具有较高一致性，验证了该模型的可行性与准确性；与自动化渗透测试框架Metasploit相比，该模型在适应所有渗透场景方面也更具灵活性。

关键词: 渗透测试, 强化学习, Q学习, 策略规划

CLC Number:

TP393.08

Haini ZHAO, Jian JIAO. Recommendation model of penetration path based on reinforcement learning[J]. Journal of Computer Applications, 2022, 42(6): 1689-1694.

赵海妮, 焦健. 基于强化学习的渗透路径推荐模型[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1689-1694.

Figures/Tables 12

Fig. 1 Reinforcement learning process

Tab. 1 Main symbols and their definitions

符号	含义
V	可利用漏洞集
$v i$	漏洞i，描述形式为（漏洞标识，漏洞位置，漏洞等级，数据）
S	模型对渗透对象知悉程度状态集
$s t$	t时刻状态，描述形式为（位置，权限，数据）
$s'$	期望状态
A	动作集
$a t$	t时刻执行的动作，描述形式为 $(v i, T)$
T	漏洞利用工具
Resu	漏洞利用结果
R	奖励值｛0，1｝
∂	随机选择概率
π	策略（决策规则）
γ	折扣系数
$Q π (s t, v i)$	状态s_t 在策略 $π$ 下利用漏洞v_i 的价值
$Q (s t, v i)$	状态s_t 在最优策略下利用漏洞v_i 的价值
MaxEpisodes	最大学习回合数
Path	渗透路径，由一系列的漏洞利用动作构成，即 $P a t h = a 1, a 2, …, a n$

Tab. 1 Main symbols and their definitions

符号	含义
V	可利用漏洞集
$v i$	漏洞i，描述形式为（漏洞标识，漏洞位置，漏洞等级，数据）
S	模型对渗透对象知悉程度状态集
$s t$	t时刻状态，描述形式为（位置，权限，数据）
$s'$	期望状态
A	动作集
$a t$	t时刻执行的动作，描述形式为 $(v i, T)$
T	漏洞利用工具
Resu	漏洞利用结果
R	奖励值｛0，1｝
∂	随机选择概率
π	策略（决策规则）
γ	折扣系数
$Q π (s t, v i)$	状态s_t 在策略 $π$ 下利用漏洞v_i 的价值
$Q (s t, v i)$	状态s_t 在最优策略下利用漏洞v_i 的价值
MaxEpisodes	最大学习回合数
Path	渗透路径，由一系列的漏洞利用动作构成，即 $P a t h = a 1, a 2, …, a n$

Tab. 2 System access rights description

符号	含义
root	系统管理员，管理系统设备、系统文件和系统进程等一切资源
user	任意一个系统普通用户，由系统初始化产生或系统管理员创建，有自己独立私有的资源
access	可以访问网络服务的远程访问者，通常是信任的访问者，能和网络服务进程交互数据，可以扫描系统信息等
none	没有任何权限的远程访问者，包括不受信任或被隔离在防火墙之外的用户

Tab. 3 Function description of QLPT internal modules

QLPT模块	功能描述
漏洞选择	根据当前状态 $s t$ 选择最佳利用漏洞 $v i$
漏洞利用	根据当前漏洞 $v i$ 选择工具T，并使用工具T利用漏洞 $v i$ ，得到漏洞利用结果Resu
奖励值转换	根据当前状态 $s t$ 与期望状态 $s'$ ，将漏洞利用结果Resu转换成奖励值R
Q学习	根据状态 $s t$ ，漏洞 $v i$ 与奖励值R更新Q表中对应的 $Q (s t, v i)$ 值
状态-漏洞Q表	存储每组状态-漏洞对的价值 $Q (s t, v i)$
状态更新	根据漏洞利用结果Resu更新状态 $s t$

Tab. 3 Function description of QLPT internal modules

QLPT模块	功能描述
漏洞选择	根据当前状态 $s t$ 选择最佳利用漏洞 $v i$
漏洞利用	根据当前漏洞 $v i$ 选择工具T，并使用工具T利用漏洞 $v i$ ，得到漏洞利用结果Resu
奖励值转换	根据当前状态 $s t$ 与期望状态 $s'$ ，将漏洞利用结果Resu转换成奖励值R
Q学习	根据状态 $s t$ ，漏洞 $v i$ 与奖励值R更新Q表中对应的 $Q (s t, v i)$ 值
状态-漏洞Q表	存储每组状态-漏洞对的价值 $Q (s t, v i)$
状态更新	根据漏洞利用结果Resu更新状态 $s t$

Fig. 2 QLPT training flowchart

Fig. 3 Experimental scenario

Tab. 4 Vulnerability information description

漏洞标识	漏洞类型	漏洞位置	漏洞等级	数据
CWE-20	文件上传	Input file	7.5	webshell
CWE-22	目录扫描	./	5.3	敏感文件
CWE-89	SQL注入	ss_id	10.0	webshell
CWE-307	暴力破解	login	5.3	密码列表
CWE-79	XSS	user_email	5.3	Cookie

Tab. 5 State set indexes

索引	状态	位置	权限	数据
0	得到可利用漏洞集	web	access	漏洞信息
1	已获取后台入口	web	user	后台URL
2	已获取用户信息	web	user	用户信息
3	已获取webshell	web	user	管理员信息
4	提权成功	server	root	全部信息

Fig. 4 Penetration path

Fig. 5 Expected profit from vulnerability exploitation varying with number of learning rounds

Tab. 6 Comparison of QLPT and Metasploit

框架/模型

攻击场景

普适性

反馈更新漏洞

利用条件

测试人员无需具备

漏洞利用知识

Fig. 6 Vulnerability exploration attempt number to reach penetration target per learning round

References 20

1	SARRAUTE C. Automated attack planning［D］. Buenos Aires： Instituto Tecnológico de Buenos Aires， 2012：23-24.
2	SINGH N， MEHERHOMJI V， CHANDAVARKAR B R. Automated versus manual approach of web application penetration testing［C］// Proceedings of the 11th International Conference on Computing， Communication and Networking Technologies. Piscataway： IEEE， 2020： 1-6. 10.1109/icccnt49239.2020.9225385
3	SHEYNER O， HAINES J， JHA S， et al. Automated generation and analysis of attack graphs［C］// Proceedings 2002 IEEE Symposium on Security and Privacy. Piscataway： IEEE， 2002： 273-284.
4	SWILER L P， PHILLIPS C， GAYLOR T. A graph-based network-vulnerability analysis system： SAND-97-3010C； CONF-980534ON： DE98001486； BR： YN0100000； TRN： AHC2DT03%%16［R］. Albuquerque， NM： Sandia National Lab， 1998：8.
5	YU X H， JIANG J H， SHUAI C Y. Approach to attack path generation based on vulnerability correlation［M］// KOPCHO J， KURZAWA C， MACPHERSON G. IEEE Conference Anthology. Piscataway： IEEE， 2013： 1-6. 10.1109/anthology.2013.6784925
6	OU X M， GOVINDAVAJHALA S， APPEL A W. MulVAL： a logic-based network security analyzer［C］// Proceedings of the 14th USENIX Security Symposium. Berkeley： USENIX Association， 2005： 113-128.
7	OU X M， BOYER W F， McQUEEN M A. A scalable approach to attack graph generation［C］// Proceedings of the 13th ACM Conference on Computer and Communications Security. New York： ACM， 2006： 336-345. 10.1145/1180405.1180446
8	张登峰. 基于机器学习的SQL注入检测［D］. 重庆：重庆邮电大学， 2017：1-69.
	ZHANG D F. SQL injection detection based on machine learning［D］. Chongqing： Chongqing University of Posts and Telecommunications， 2017：1-69
9	洪镇宇. 基于机器学习的跨站脚本攻击检测研究［D］. 厦门：厦门大学， 2018：1-77.
	HONG Z U. Research on detection of cross-site scripting attacks based on machine learning［D］. Xiamen： Xiamen University， 2018：1-77.
10	NUNAN A E， SOUTO E， DOS SANTOS E M， et al. Automatic classification of cross-site scripting in web pages using document-based and URL-based features［C］// Proceedings of the 2012 IEEE Symposium on Computers and Communications. Piscataway： IEEE， 2012： 702-707. 10.1109/iscc.2012.6249380
11	SARRAUTE C， BUFFET O， HOFFMANN J. POMDPs make better hackers： accounting for uncertainty in penetration testing［C］// Proceedings of the 26th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2012： 1816-1824. 10.1609/aaai.v26i1.8363
12	RICHARD S S， BARTO A G. Reinforcement Learning： An Introduction［M］. Cambridge： MIT Press， 1998： 313-314.
13	CHOWDHARY A， HUANG D J， MAHENDRAN J S， et al. Autonomous security analysis and penetration testing［C］// Proceedings of the 16th International Conference on Mobility， Sensing and Networking. Piscataway： IEEE， 2020： 508-515. 10.1109/msn50589.2020.00086
14	CHAUDHARY S， O’BRIEN A， XU S. Automated post-breach penetration testing through reinforcement learning［C］// Proceedings of the 2020 IEEE Conference on Communications and Network Security. Piscataway： IEEE， 2020： 1-2. 10.1109/cns48642.2020.9162301
15	GHANEM M C， CHEN T M. Reinforcement learning for intelligent penetration testing［C］// Proceedings of the 2nd World Conference on Smart Trends in Systems， Security and Sustainability. Piscataway： IEEE， 2018： 185-192. 10.1109/worlds4.2018.8611595
16	Rapid 7. Metasploit［DB/OL］. ［2021-06-17］. . 10.34739/si.2020.24.03
17	ZHOU T Y， ZANG Y C， ZHU J H， et al. NIG-AP： a new method for automated penetration testing［J］. Frontiers of Information Technology and Electronic Engineering， 2019， 20（9）： 1277-1288. 10.1631/fitee.1800532
18	Invicti. Acunetix［DB/OL］. ［2021-06-17］.. 10.37034/jidt.v4i1.190
19	WATKINS C J C H. Learning from delayed rewards［D］. Cambridge： King’s College of University of Cambridge， 1989：1-142.
20	WATKINS C J C H， DAYAN P. Q-learning［J］. Machine Learning， 1992， 8（3/4）：279-292. 10.1023/a:1022676722315

[1]	Shiquan DENG, Xuguo YE. Multi-objective task offloading algorithm based on deep Q-network [J]. Journal of Computer Applications, 2022, 42(6): 1668-1674.
[2]	Shaobin DENG, Jun ZHU, Xiaofeng ZHOU, Shuai LI, Shurui LIU. Industrial process control method based on local policy interaction exploration-based deep deterministic policy gradient [J]. Journal of Computer Applications, 2022, 42(5): 1642-1648.
[3]	Haojie CHEN, Jiangting FAN, Yong LIU. Solving dynamic traveling salesman problem by deep reinforcement learning [J]. Journal of Computer Applications, 2022, 42(4): 1194-1200.
[4]	Xueming LI, Guohao WU, Shangbo ZHOU, Xiaoran LIN, Hongbin XIE. Image instance segmentation model based on fractional-order network and reinforcement learning [J]. Journal of Computer Applications, 2022, 42(2): 574-583.
[5]	Bosen ZENG, Yong ZHONG, Xianhua NIU. Q-table initialization approach for safe exploration based on factorization machine [J]. Journal of Computer Applications, 2022, 42(1): 209-214.
[6]	SHANG Fangjian, LI Xin, Di ZHAI, LU Yang, ZHANG Donglei, QIAN Yuwen. Two-phase resource allocation technology for network slices in smart grid [J]. Journal of Computer Applications, 2021, 41(7): 2033-2038.
[7]	WANG Jianping, WANG Gang, MAO Xiaobin, MA Enqi. Motion control method of two-link manipulator based on deep reinforcement learning [J]. Journal of Computer Applications, 2021, 41(6): 1799-1804.
[8]	WANG Yu, LIU Yanli, CHEN Shaowu. Maximum common induced subgraph algorithm based on vertex conflict learning [J]. Journal of Computer Applications, 2021, 41(6): 1756-1760.
[9]	DU Xixi, CHENG Hua, FANG Yiquan. Reinforced automatic summarization model based on advantage actor-critic algorithm [J]. Journal of Computer Applications, 2021, 41(3): 699-705.
[10]	YAO Xinghu, TAN Xiaoyang. Reward highway network based global credit assignment algorithm in multi-agent reinforcement learning [J]. Journal of Computer Applications, 2021, 41(1): 1-7.
[11]	LIU Sijia, TONG Xiangrong. Urban transportation path planning based on reinforcement learning [J]. Journal of Computer Applications, 2021, 41(1): 185-190.
[12]	FU Kui, LIANG Shaoqing, LI Bing. Commodity recommendation model based on improved deep Q network structure [J]. Journal of Computer Applications, 2020, 40(9): 2613-2621.
[13]	HU Xuemin, CHENG Yu, CHEN Guowen, ZHANG Ruohan, TONG Xiuchi. Motion planning for autonomous driving with directional navigation based on deep spatio-temporal Q-network [J]. Journal of Computer Applications, 2020, 40(7): 1919-1925.
[14]	ZHENG Yanbin, FAN Wenxin, HAN Mengyun, TAO Xueli. Multi-agent collaborative pursuit algorithm based on game theory and Q-learning [J]. Journal of Computer Applications, 2020, 40(6): 1613-1620.
[15]	REN Na, ZHANG Nan, CUI Yan, ZHANG Rongxue, PANG Xinfu. Method of semantic entity construction and trajectory control for UAV electric power inspection [J]. Journal of Computer Applications, 2020, 40(10): 3095-3100.