《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (12): 3831-3840.DOI: 10.11772/j.issn.1001-9081.2021101730

• 网络空间安全 • 上一篇    

基于系统调用和数据溯源的PDF文档检测模型

雷靖玮(), 伊鹏, 陈祥, 王亮, 毛明   

  1. 中国人民解放军战略支援部队信息工程大学,郑州 450001
  • 收稿日期:2021-10-09 修回日期:2022-01-24 接受日期:2022-02-21 发布日期:2022-04-18 出版日期:2022-12-10
  • 通讯作者: 雷靖玮
  • 作者简介:伊鹏(1977—),男,河南郑州人,研究员,博士,主要研究方向:入侵检测、新型网络体系结构
    陈祥(1990—),男,湖北荆州人,助理研究员,博士研究生,CCF会员,主要研究方向:异常检测、入侵容忍
    王亮(1995—),男,河南唐河人,助理工程师,硕士研究生,主要研究方向:网络主动防御、电子对抗
    毛明(1987—),男,甘肃天水人,博士研究生,主要研究方向:网络安全、车联网安全。
  • 基金资助:
    国防科技创新特区项目

PDF document detection model based on system calls and data provenance

Jingwei LEI(), Peng YI, Xiang CHEN, Liang WANG, Ming MAO   

  1. Information Engineering University,Zhengzhou Henan 450001,China
  • Received:2021-10-09 Revised:2022-01-24 Accepted:2022-02-21 Online:2022-04-18 Published:2022-12-10
  • Contact: Jingwei LEI
  • About author:YI Peng, born in 1977, Ph. D., research fellow. His research interests include intrusion detection, new network architecture.
    CHEN Xiang, born in 1990, Ph. D. candidate, research assistant. His research interests include anomaly detection, intrusion tolerance.
    WANG Liang, born in 1995, M. S. candidate, assistant engineer. His research interests include network active defense, electronic countermeasures.
    MAO Ming, born in 1987, Ph. D. candidate. His research interests include network security, security for internet of vehicles.
  • Supported by:
    Program of National Defense Science and Technology Innovation Special Zone

摘要:

针对传统静态检测及动态检测方法无法应对基于大量混淆及未知技术的PDF文档攻击的缺陷,提出了一个基于系统调用和数据溯源技术的新型检测模型NtProvenancer。首先,使用系统调用捕获工具收集文档执行时产生的系统调用记录;其次,利用数据溯源技术构建基于系统调用的数据溯源图;而后,用图的路径筛选算法提取系统调用特征片段进行检测。实验数据集由528个良性PDF文档与320个恶意PDF文档组成。在Adobe Reader上展开测试,并使用词频-逆文档频率(TF-IDF)及PROVDETECTOR稀有度算法替换所提出的图的关键点算法来进行对比实验。结果表明NtProvenancer在精确率和F1分数等多项指标上均优于对比模型。在最佳参数设置下,所提模型的文档训练与检测阶段的平均用时分别为251.51 ms以及60.55 ms,同时误报率低于5.22%,F1分数达到0.989。可见NtProvenancer是一种高效实用的PDF文档检测模型。

关键词: PDF文档检测, 系统调用, 数据溯源, 关键点算法, 特征片段

Abstract:

Focused on the issue that the traditional static detection and dynamic detection methods cannot cope with malicious PDF document attacks using a lot of obfuscation and unknown technologies, a new detection model based on system calls and data provenance, called NtProvenancer, was proposed. Firstly, the system call records during execution of the document were collected by the system call tracing tool. Then, the data provenance technology was used to establish a data provenance graph based on the system calls. After that, the feature segments of system calls were extracted for detection by using the key point algorithm of the graph. The experimental dataset consists of 528 benign PDF documents and 320 malicious ones. The test was carried out on Adobe Reader, and the Term Frequency-Inverse Document Frequency (TF-IDF) and the rarity algorithm in PROVDETECTOR were used to replace the key point algorithm of the graph to conduct the comparative study. The results show that NtProvenancer has better performance on precision and F1 score. Under the optimal parameter setting, the proposed model has the average time of document training and detection stages of 251.51 ms and 60.55 ms respectively, the false alarm rate lower than 5.22%, and the F1 score reached 0.989, showing that NtProvenancer is an efficient and practical model for PDF document detection.

Key words: PDF document detection, system call, data provenance, key point algorithm, feature segment

中图分类号: