《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (8): 2517-2526.DOI: 10.11772/j.issn.1001-9081.2022071135

• 计算机软件技术 • 上一篇    

基于混合代码表示的源代码脆弱性检测

张琨, 杨丰玉(), 钟发, 曾广东, 周世健   

  1. 南昌航空大学 软件学院,南昌 330063
  • 收稿日期:2022-07-31 修回日期:2022-11-07 接受日期:2022-11-07 发布日期:2023-01-15 出版日期:2023-08-10
  • 通讯作者: 杨丰玉
  • 作者简介:张琨(1998—),男,江西新余人,硕士研究生,CCF会员,主要研究方向:源代码脆弱性检测
    钟发(1999—),男,江西宜春人,硕士研究生,主要研究方向:软件缺陷预测
    曾广东(1998—),男,江西赣州人,硕士研究生,主要研究方向:软件缺陷预测
    周世健(1966—),男,江西吉安人,教授,博士,CCF会员,主要研究方向:智能系统。
  • 基金资助:
    江西省自然科学基金资助项目(20212BAB212009)

Source code vulnerability detection based on hybrid code representation

Kun ZHANG, Fengyu YANG(), Fa ZHONG, Guangdong ZENG, Shijian ZHOU   

  1. School of Software,Nanchang Hangkong University,Nanchang Jiangxi 330063,China
  • Received:2022-07-31 Revised:2022-11-07 Accepted:2022-11-07 Online:2023-01-15 Published:2023-08-10
  • Contact: Fengyu YANG
  • About author:ZHANG Kun, born in 1998, M. S. candidate. His research interests include source code vulnerability detection.
    ZHONG Fa, born in 1999, M. S. candidate. His research interests include software defect prediction.
    ZENG Guangdong, born in 1998, M. S. candidate. His research interests include software defect prediction.
    ZHOU Shijian, born in 1966, Ph.D., professor. His research interests include intelligent system.
  • Supported by:
    Natural Science Foundation of Jiangxi Province(20212BAB212009)

摘要:

软件脆弱性对网络与信息安全产生了极大的威胁,而脆弱性的根源在于软件源代码。因为现有的传统静态检测工具和基于深度学习的检测方法没有完整地表示代码特征,并且简单地使用词嵌入方法转换代码表示,所以检测结果准确率低,误报率高或漏报率高。因此,提出了一种基于混合代码表示的源代码脆弱性检测方法来解决代码表示不完整的问题,并提升检测性能。首先将源代码编译为中间表示(IR),并提取程序依赖图;然后基于数据流和控制流分析进行程序切片来得到结构化的特征,同时使用doc2vec嵌入节点语句得到非结构化的特征;接着使用图神经网络(GNN)对混合特征进行学习;最后使用训练好的GNN进行预测和分类。为了验证所提方法的有效性,在软件保证参考数据集(SARD)和真实世界数据集上进行了实验评估,检测结果的F1值分别达到了95.3%和89.6%。实验结果表明,所提方法有较好的脆弱性检测能力。

关键词: 脆弱性检测, 中间表示, 表示学习, 图神经网络, 深度学习

Abstract:

Software vulnerabilities pose a great threat to network and information security, and the root of vulnerabilities lies in software source code. Existing traditional static detection tools and deep learning based detection methods do not fully represent code features, and simply use word embedding method to transform code representation, so that their detection results have low accuracy and high false positive rate or high false negative rate. Therefore, a source code vulnerability detection method based on hybrid code representation was proposed to solve the problem of incomplete code representation and improve detection performance. Firstly, source code was compiled into Intermediate Representation (IR), and the program dependency graph was extracted. Then, structural features were obtained through program slicing based on data flow and control flow analysis. At the same time, unstructural features were obtained by embedding node statements using doc2vec. Next, Graph Neural Network (GNN) was used to learn the hybrid features. Finally, the trained GNN was used for prediction and classification. In order to verify the effectiveness of the proposed method, experimental evaluation was performed on Software Assurance Reference Dataset (SARD) and real-world datasets, and the F1 score of detection results reached 95.3% and 89.6% respectively. Experimental results show that the proposed method has good vulnerability detection ability.

Key words: vulnerability detection, Intermediate Representation (IR), representation learning, Graph Neural Network (GNN), deep learning

中图分类号: