基于强化学习的渗透路径推荐模型

doi:10.11772/j.issn.1001-9081.2021061424

《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (6): 1689-1694.DOI: 10.11772/j.issn.1001-9081.2021061424

所属专题： 2021年全国开放式分布与并行计算学术年会(DPCS 2021)论文

• 2021年全国开放式分布与并行计算学术年会(DPCS 2021)论文 • 上一篇下一篇

基于强化学习的渗透路径推荐模型

赵海妮¹^,², 焦健¹^,²()

^1.北京信息科技大学计算机学院，北京 100101
^2.网络文化与数字传播北京市重点实验室（北京信息科技大学），北京 100101

收稿日期:2021-08-09 修回日期:2021-10-16 接受日期:2021-10-29 发布日期:2022-01-10 出版日期:2022-06-10
通讯作者: 焦健
作者简介:赵海妮（1997—），女，安徽阜阳人，硕士研究生，CCF会员，主要研究方向：网络安全、渗透测试
基金资助:
网络文化与数字传播北京市重点实验室开放课题(ICDDXN006)

Recommendation model of penetration path based on reinforcement learning

Haini ZHAO¹^,², Jian JIAO¹^,²()

^1.Computer School，Beijing Information Science and Technology University，Beijing 100101，China
^2.Beijing Key Laboratory of Internet Culture and Digital Dissemination Research （Beijing Information Science and Technology University），Beijing 100101，China

Received:2021-08-09 Revised:2021-10-16 Accepted:2021-10-29 Online:2022-01-10 Published:2022-06-10
Contact: Jian JIAO
About author:ZHAO Haini， born in 1997， M. S. candidate. Her research interests include network security， penetration test.
Supported by:
Opening Project of Beijing Key Laboratory of Internet Culture and Digital Dissemination Research(ICDDXN006)

摘要/Abstract

摘要：

渗透测试的核心问题是渗透测试路径的规划，手动规划依赖测试人员的经验，而自动生成渗透路径主要基于网络安全的先验知识和特定的漏洞或网络场景，所需成本高且缺乏灵活性。针对这些问题，提出一种基于强化学习的渗透路径推荐模型QLPT，通过多回合的漏洞选择和奖励反馈，最终给出针对渗透对象的最佳渗透路径。在开源靶场的渗透实验结果表明，与手动测试的渗透路径相比，所提模型推荐的路径具有较高一致性，验证了该模型的可行性与准确性；与自动化渗透测试框架Metasploit相比，该模型在适应所有渗透场景方面也更具灵活性。

关键词: 渗透测试, 强化学习, Q学习, 策略规划

Abstract:

The core problem of penetration test is the planning of penetration test paths. Manual planning relies on the experience of testers， while automated generation of penetration paths is mainly based on the priori knowledge of network security and specific vulnerabilities or network scenarios， which requires high cost and lacks flexibility. To address these problems， a reinforcement learning-based penetration path recommendation model named Q Learning Penetration Test （QLPT） was proposed to finally give the optimal penetration path for the penetration object through multiple rounds of vulnerability selection and reward feedback. It is found that the recommended path of QLPT has a high consistency with the path of manual penetration test by implementing penetration experiments at open source cyber range， verifying the feasibility and accuracy of this model； compared with the automated penetration test framework Metasploit， QLPT is more flexible in adapting to all penetration scenarios.

Key words: penetration test, reinforcement learning, Q learning, strategic planning

中图分类号:

TP393.08

赵海妮, 焦健. 基于强化学习的渗透路径推荐模型[J]. 计算机应用, 2022, 42(6): 1689-1694.

Haini ZHAO, Jian JIAO. Recommendation model of penetration path based on reinforcement learning[J]. Journal of Computer Applications, 2022, 42(6): 1689-1694.

图/表 12

图1 强化学习过程

Fig. 1 Reinforcement learning process

表1 主要符号及其含义

Tab. 1 Main symbols and their definitions

符号	含义
V	可利用漏洞集
$v i$	漏洞i，描述形式为（漏洞标识，漏洞位置，漏洞等级，数据）
S	模型对渗透对象知悉程度状态集
$s t$	t时刻状态，描述形式为（位置，权限，数据）
$s'$	期望状态
A	动作集
$a t$	t时刻执行的动作，描述形式为 $(v i, T)$
T	漏洞利用工具
Resu	漏洞利用结果
R	奖励值｛0，1｝
∂	随机选择概率
π	策略（决策规则）
γ	折扣系数
$Q π (s t, v i)$	状态s_t 在策略 $π$ 下利用漏洞v_i 的价值
$Q (s t, v i)$	状态s_t 在最优策略下利用漏洞v_i 的价值
MaxEpisodes	最大学习回合数
Path	渗透路径，由一系列的漏洞利用动作构成，即 $P a t h = a 1, a 2, …, a n$

表1 主要符号及其含义

Tab. 1 Main symbols and their definitions

符号	含义
V	可利用漏洞集
$v i$	漏洞i，描述形式为（漏洞标识，漏洞位置，漏洞等级，数据）
S	模型对渗透对象知悉程度状态集
$s t$	t时刻状态，描述形式为（位置，权限，数据）
$s'$	期望状态
A	动作集
$a t$	t时刻执行的动作，描述形式为 $(v i, T)$
T	漏洞利用工具
Resu	漏洞利用结果
R	奖励值｛0，1｝
∂	随机选择概率
π	策略（决策规则）
γ	折扣系数
$Q π (s t, v i)$	状态s_t 在策略 $π$ 下利用漏洞v_i 的价值
$Q (s t, v i)$	状态s_t 在最优策略下利用漏洞v_i 的价值
MaxEpisodes	最大学习回合数
Path	渗透路径，由一系列的漏洞利用动作构成，即 $P a t h = a 1, a 2, …, a n$

表2 系统访问权限说明

Tab. 2 System access rights description

符号	含义
root	系统管理员，管理系统设备、系统文件和系统进程等一切资源
user	任意一个系统普通用户，由系统初始化产生或系统管理员创建，有自己独立私有的资源
access	可以访问网络服务的远程访问者，通常是信任的访问者，能和网络服务进程交互数据，可以扫描系统信息等
none	没有任何权限的远程访问者，包括不受信任或被隔离在防火墙之外的用户

表3 QLPT内部模块功能说明

Tab. 3 Function description of QLPT internal modules

QLPT模块	功能描述
漏洞选择	根据当前状态 $s t$ 选择最佳利用漏洞 $v i$
漏洞利用	根据当前漏洞 $v i$ 选择工具T，并使用工具T利用漏洞 $v i$ ，得到漏洞利用结果Resu
奖励值转换	根据当前状态 $s t$ 与期望状态 $s'$ ，将漏洞利用结果Resu转换成奖励值R
Q学习	根据状态 $s t$ ，漏洞 $v i$ 与奖励值R更新Q表中对应的 $Q (s t, v i)$ 值
状态-漏洞Q表	存储每组状态-漏洞对的价值 $Q (s t, v i)$
状态更新	根据漏洞利用结果Resu更新状态 $s t$

表3 QLPT内部模块功能说明

Tab. 3 Function description of QLPT internal modules

QLPT模块	功能描述
漏洞选择	根据当前状态 $s t$ 选择最佳利用漏洞 $v i$
漏洞利用	根据当前漏洞 $v i$ 选择工具T，并使用工具T利用漏洞 $v i$ ，得到漏洞利用结果Resu
奖励值转换	根据当前状态 $s t$ 与期望状态 $s'$ ，将漏洞利用结果Resu转换成奖励值R
Q学习	根据状态 $s t$ ，漏洞 $v i$ 与奖励值R更新Q表中对应的 $Q (s t, v i)$ 值
状态-漏洞Q表	存储每组状态-漏洞对的价值 $Q (s t, v i)$
状态更新	根据漏洞利用结果Resu更新状态 $s t$

图2 QLPT训练流程

Fig. 2 QLPT training flowchart

图3 实验场景

Fig. 3 Experimental scenario

表4 漏洞信息说明

Tab. 4 Vulnerability information description

漏洞标识	漏洞类型	漏洞位置	漏洞等级	数据
CWE-20	文件上传	Input file	7.5	webshell
CWE-22	目录扫描	./	5.3	敏感文件
CWE-89	SQL注入	ss_id	10.0	webshell
CWE-307	暴力破解	login	5.3	密码列表
CWE-79	XSS	user_email	5.3	Cookie

表5 状态集索引

Tab. 5 State set indexes

索引	状态	位置	权限	数据
0	得到可利用漏洞集	web	access	漏洞信息
1	已获取后台入口	web	user	后台URL
2	已获取用户信息	web	user	用户信息
3	已获取webshell	web	user	管理员信息
4	提权成功	server	root	全部信息

图4 渗透路径

Fig. 4 Penetration path

图5 漏洞利用期望收益值随学习回合数的变化

Fig. 5 Expected profit from vulnerability exploitation varying with number of learning rounds

表6 QLPT和Metasploit的对比

Tab. 6 Comparison of QLPT and Metasploit

框架/模型

攻击场景

普适性

反馈更新漏洞

利用条件

测试人员无需具备

漏洞利用知识

图6 为达到渗透目标每回合学习的漏洞利用尝试次数

Fig. 6 Vulnerability exploration attempt number to reach penetration target per learning round

参考文献 20

1	SARRAUTE C. Automated attack planning［D］. Buenos Aires： Instituto Tecnológico de Buenos Aires， 2012：23-24.
2	SINGH N， MEHERHOMJI V， CHANDAVARKAR B R. Automated versus manual approach of web application penetration testing［C］// Proceedings of the 11th International Conference on Computing， Communication and Networking Technologies. Piscataway： IEEE， 2020： 1-6. 10.1109/icccnt49239.2020.9225385
3	SHEYNER O， HAINES J， JHA S， et al. Automated generation and analysis of attack graphs［C］// Proceedings 2002 IEEE Symposium on Security and Privacy. Piscataway： IEEE， 2002： 273-284.
4	SWILER L P， PHILLIPS C， GAYLOR T. A graph-based network-vulnerability analysis system： SAND-97-3010C； CONF-980534ON： DE98001486； BR： YN0100000； TRN： AHC2DT03%%16［R］. Albuquerque， NM： Sandia National Lab， 1998：8.
5	YU X H， JIANG J H， SHUAI C Y. Approach to attack path generation based on vulnerability correlation［M］// KOPCHO J， KURZAWA C， MACPHERSON G. IEEE Conference Anthology. Piscataway： IEEE， 2013： 1-6. 10.1109/anthology.2013.6784925
6	OU X M， GOVINDAVAJHALA S， APPEL A W. MulVAL： a logic-based network security analyzer［C］// Proceedings of the 14th USENIX Security Symposium. Berkeley： USENIX Association， 2005： 113-128.
7	OU X M， BOYER W F， McQUEEN M A. A scalable approach to attack graph generation［C］// Proceedings of the 13th ACM Conference on Computer and Communications Security. New York： ACM， 2006： 336-345. 10.1145/1180405.1180446
8	张登峰. 基于机器学习的SQL注入检测［D］. 重庆：重庆邮电大学， 2017：1-69.
	ZHANG D F. SQL injection detection based on machine learning［D］. Chongqing： Chongqing University of Posts and Telecommunications， 2017：1-69
9	洪镇宇. 基于机器学习的跨站脚本攻击检测研究［D］. 厦门：厦门大学， 2018：1-77.
	HONG Z U. Research on detection of cross-site scripting attacks based on machine learning［D］. Xiamen： Xiamen University， 2018：1-77.
10	NUNAN A E， SOUTO E， DOS SANTOS E M， et al. Automatic classification of cross-site scripting in web pages using document-based and URL-based features［C］// Proceedings of the 2012 IEEE Symposium on Computers and Communications. Piscataway： IEEE， 2012： 702-707. 10.1109/iscc.2012.6249380
11	SARRAUTE C， BUFFET O， HOFFMANN J. POMDPs make better hackers： accounting for uncertainty in penetration testing［C］// Proceedings of the 26th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2012： 1816-1824. 10.1609/aaai.v26i1.8363
12	RICHARD S S， BARTO A G. Reinforcement Learning： An Introduction［M］. Cambridge： MIT Press， 1998： 313-314.
13	CHOWDHARY A， HUANG D J， MAHENDRAN J S， et al. Autonomous security analysis and penetration testing［C］// Proceedings of the 16th International Conference on Mobility， Sensing and Networking. Piscataway： IEEE， 2020： 508-515. 10.1109/msn50589.2020.00086
14	CHAUDHARY S， O’BRIEN A， XU S. Automated post-breach penetration testing through reinforcement learning［C］// Proceedings of the 2020 IEEE Conference on Communications and Network Security. Piscataway： IEEE， 2020： 1-2. 10.1109/cns48642.2020.9162301
15	GHANEM M C， CHEN T M. Reinforcement learning for intelligent penetration testing［C］// Proceedings of the 2nd World Conference on Smart Trends in Systems， Security and Sustainability. Piscataway： IEEE， 2018： 185-192. 10.1109/worlds4.2018.8611595
16	Rapid 7. Metasploit［DB/OL］. ［2021-06-17］. . 10.34739/si.2020.24.03
17	ZHOU T Y， ZANG Y C， ZHU J H， et al. NIG-AP： a new method for automated penetration testing［J］. Frontiers of Information Technology and Electronic Engineering， 2019， 20（9）： 1277-1288. 10.1631/fitee.1800532
18	Invicti. Acunetix［DB/OL］. ［2021-06-17］.. 10.37034/jidt.v4i1.190
19	WATKINS C J C H. Learning from delayed rewards［D］. Cambridge： King’s College of University of Cambridge， 1989：1-142.
20	WATKINS C J C H， DAYAN P. Q-learning［J］. Machine Learning， 1992， 8（3/4）：279-292. 10.1023/a:1022676722315

[1]	肖海林, 黄天义, 代秋香, 张跃军, 张中山. 基于轨迹预测的安全强化学习自动变道决策方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2958-2963.
[2]	何浩东, 符浩, 王强, 周帅, 刘伟. 基于深度强化学习的多机器人路径跟随与编队[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2626-2633.
[3]	周毅, 高华, 田永谌. 基于裁剪优化和策略指导的近端策略优化算法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2334-2341.
[4]	马天, 席润韬, 吕佳豪, 曾奕杰, 杨嘉怡, 张杰慧. 基于深度强化学习的移动机器人三维路径规划方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2055-2064.
[5]	赵晓焱, 韩威, 张俊娜, 袁培燕. 基于异步深度强化学习的车联网协作卸载策略[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1501-1510.
[6]	唐睿, 庞川林, 张睿智, 刘川, 岳士博. D2D通信增强的蜂窝网络中基于DDPG的资源分配[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1562-1569.
[7]	陈发堂, 黄淼, 金宇峰. 面向用户需求的低轨卫星资源分配算法[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1242-1247.
[8]	秦鑫彤, 宋政育, 侯天为, 王飞越, 孙昕, 黎伟. 基于自适应p持续的移动自组网信道接入和资源分配算法[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 863-868.
[9]	李源潮, 陶重犇, 王琛. 基于最大熵深度强化学习的双足机器人步态控制方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 445-451.
[10]	宋紫阳, 李军怀, 王怀军, 苏鑫, 于蕾. 基于路径模仿和SAC强化学习的机械臂路径规划算法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 439-444.
[11]	邓辅秦, 官桧锋, 谭朝恩, 付兰慧, 王宏民, 林天麟, 张建民. 基于请求与应答通信机制和局部注意力机制的多机器人强化学习路径规划方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 432-438.
[12]	刘羿希, 何俊, 吴波, 刘丙童, 李子玉. DevSecOps中软件安全性测试技术综述[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3470-3478.
[13]	余家宸, 杨晔. 基于裁剪近端策略优化算法的软机械臂不规则物体抓取[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3629-3638.
[14]	王昱, 关智慧, 李远鹏. 基于轨迹预测和分布式MADDPG的无人机集群追击决策[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3623-3628.
[15]	龙杰, 谢良, 徐海蛟. 集成的深度强化学习投资组合模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 300-310.

基于强化学习的渗透路径推荐模型

Recommendation model of penetration path based on reinforcement learning

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 20

相关文章 15

编辑推荐

Metrics