Static code defect detection method based on deep semantic fusion

doi:10.11772/j.issn.1001-9081.2021081548

Abstract

Abstract:

With the increasing scale and complexity of computer softwares， code defect in software has become a serious threat to public safety. Aiming at the problems of poor expansibility of static analysis tools， as well as coarse detection granularity and unsatisfactory detection effect of existing methods， a static code defect detection method based on program slicing and semantic feature fusion was proposed. Firstly， key points in source code were analyzed through data flow and control flow， and the program slicing method based on Interprocedural Finite Distributive Subset （IFDS） was adopted to obtain the code snippet composed of multiple lines of statements related to code defects. Then， semantically related vector representation of code snippet was obtained by word embedding， so that the appropriate length of code snippet was selected with the accuracy guaranteed. Finally， Text Convolutional Neural Network （TextCNN） and Bi-directional Gate Recurrent Unit （BiGRU） were used to extract local key features and context sequence features of the code snippet respectively， and the proposed method was used to detect slice-level code defects. Experimental results show that the proposed method can detect different types of code defects effectively， and is significantly better than static analysis tool Flawfinder. Under the premise of fine granularity， IFDS slicing method can further improve F₁ score and accuracy，reach 89.64% and 92.08% respectively. Compared with the existing methods based on program slicing， when key points are the Application Programming Interface （API） or the variables， the proposed method has the F₁ score reached 89.69% and 89.74% respectively， and the accuracy reached 92.15% and 91.98% respectively， and all of them are higher. It can be seen that without significantly increasing time complexity， the proposed method has a better comprehensive detection performance.

Key words: defect detection, program slicing, semantic analysis, deep learning, feature fusion

摘要：

随着计算机软件规模和复杂度的不断增加，软件中存在的代码缺陷对公共安全形成了严重威胁。针对静态分析工具拓展性差，以及现有方法检测粒度粗、检测效果不够理想的问题，提出了一种基于程序切片和语义特征融合的代码缺陷静态检测方法。首先，对源代码中的关键点进行数据流和控制流分析，并采用基于过程间有限分布子集（IFDS）的切片方法，以获取由多行与代码缺陷相关的语句组成的代码片段；然后，通过词嵌入法获取代码片段语义相关的向量表示，从而在保证准确率的同时选择合适的代码片段长度；最后，利用文本卷积神经网络（TextCNN）和双向门控循环单元（BiGRU）分别提取代码片段中的局部关键特征和上下文序列特征，并将所提方法用于检测切片级别的代码缺陷。实验结果表明，所提方法能够有效检测不同类型的代码缺陷，并且检测效果显著优于静态分析工具Flawfinder；在细粒度的前提下，IFDS切片方法能进一步提高F₁值和准确率，分别达到了89.64%和92.08%；与现有的基于程序切片的方法相比，在关键点为应用程序编程接口（API）或变量时，所提方法的F₁值分别达到89.69%、89.74%，准确率分别达到92.15%、91.98%。可见在不显著增加时间复杂度的同时，所提方法具备更好的综合检测性能。

关键词: 缺陷检测, 程序切片, 语义分析, 深度学习, 特征融合

CLC Number:

TP393.08

Jingyun CHENG, Buhong WANG, Peng LUO. Static code defect detection method based on deep semantic fusion[J]. Journal of Computer Applications, 2022, 42(10): 3170-3176.

程靖云, 王布宏, 罗鹏. 基于深度语义融合的代码缺陷静态检测方法[J]. 《计算机应用》唯一官方网站, 2022, 42(10): 3170-3176.

Figures/Tables 13

References 23

1	ABU-DABASEH F， ALSHAMMARI E. Automated penetration testing： an overview［C］// Proceedings of the 4th International Conference on Natural Language Computing. Chennai， Tamil Nadu： AIRCC Publishing Corporation， 2018： 121-129. 10.5121/csit.2018.80610
2	李韵，黄辰林，王中锋，等. 基于机器学习的软件漏洞挖掘方法综述［J］. 软件学报， 2020， 31（7）：2040-2061.
	LI Y， HUANG C L， WANG Z F， et al. Survey of software vulnerability mining methods based on machine learning［J］. Journal of Software， 2020， 31（7）：2040-2061.
3	SEMASABA A O A， ZHENG W， WU X X， et al. Literature survey of deep learning-based vulnerability analysis on source code［J］. IET Software， 2020， 14（6）： 654-664. 10.1049/iet-sen.2020.0084
4	Details CVE. Browse vulnerabilities by date［EB/OL］. ［2021-07-24］..
5	YAMAGUCHI F. Pattern-based methods for vulnerability discovery［J］. it—Information Technology， 2017， 59（2）： 101-106. 10.1515/itit-2016-0037
6	蒋考林，白玮，张磊，等. 基于多通道图像深度学习的恶意代码检测［J］. 计算机应用， 2021， 41（4）：1142-1147.
	JIANG K L， BAI W， ZHANG L， et al. Malicious code detection based on multi-channel image deep learning［J］. Journal of Computer Applications， 2021， 41（4）：1142-1147.
7	KIM S， WOO S， LEE H， et al. VUDDY： a scalable approach for vulnerable code clone discovery［C］// Proceedings of the 2017 IEEE Symposium on Security and Privacy. Piscataway： IEEE， 2017：595-614. 10.1109/sp.2017.62
8	GRIECO G， GRINBLAT G L， UZAL L， et al. Toward large-scale vulnerability discovery using machine learning［C］// Proceedings of the 6th ACM Conference on Data and Application Security and Privacy. New York： ACM， 2016： 85-96. 10.1145/2857705.2857720
9	SCANDARIATO R， WALDEN J， HOVSEPYAN A， et al. Predicting vulnerable software components via text mining［J］. IEEE Transactions on Software Engineering， 2014， 40（10）： 993-1006. 10.1109/tse.2014.2340398
10	MIRSKY Y， DEMONTIS A， KOTAK J， et al. The threat of offensive AI to organizations［EB/OL］. （2021-06-30）［2021-07-26］..
11	RUSSELL R， KIM L， HAMILTON L， et al. Automated vulnerability detection in source code using deep representation learning［C］// Proceedings of the 2018 17th IEEE International Conference on Machine Learning and Applications. Piscataway： IEEE， 2018： 757-762. 10.1109/icmla.2018.00120
12	ZHOU Y Q， LIU S Q， SIOW J， et al. Devign： effective vulnerability identification by learning comprehensive program semantics via graph neural networks［C/OL］// Proceedings of the 33rd Conference on Neural Information Processing Systems. ［2021-07-27］..
13	许健，陈平华，熊建斌. 融合滑动窗口和哈希函数的代码漏洞检测模型［J］. 计算机应用研究， 2021， 38（8）：2394-2400.
	XU J， CHEN P H， XIONG J B. Code vulnerability detection model based on sliding window and hash function［J］. Application Research of Computers， 2021， 38（8）：2394-2400.
14	LI Z， ZOU D Q， XU S H， et al. VulDeePecker： a deep learning-based system for vulnerability detection［EB/OL］. （2018-01-05）［2021-07-27］.. 10.14722/ndss.2018.23158
15	李元诚，崔亚奇，吕俊峰，等. 开源软件漏洞检测的混合深度学习方法［J］. 计算机工程与应用， 2019， 55（11）：52-59.
	LI Y C， CUI Y Q， LYU J F， et al. Combined deep learning method for open source software vulnerability detection［J］. Computer Engineering and Applications， 2019， 55（11）：52-59.
16	王晓萌，管志斌，辛伟，等. 基于深度卷积神经网络的源代码缺陷检测方法［J］. 清华大学学报（自然科学版）， 2021， 61（11）： 1267-1272.
	WANG X M， GUAN Z B， XIN W， et al. Source code defect detection using deep convolutional neural networks［J］. Journal of Tsinghua University （Science and Technology）， 2021， 61（11）： 1267-1272.
17	LI X， WANG L， XIN Y， et al. Automated vulnerability detection in source code using minimum intermediate representation learning［J］. Applied Sciences， 2020， 10（5）： No.1692. 10.3390/app10051692
18	JEON S， KIM H K. AutoVAS： an automated vulnerability analysis system with a deep learning approach［J］. Computers and Security， 2021， 106： No.102308. 10.1016/j.cose.2021.102308
19	CHANDRA A， SINGHAL A， BANSAL A. A study of program slicing techniques for software development approaches［C］// Proceedings of the 1st International Conference on Next Generation Computing Technologies. Piscataway： IEEE， 2015： 622-627. 10.1109/ngct.2015.7375196
20	MIKOLOV T， CHEN K， CORRADO G， et al. Efficient estimation of word representations in vector space［EB/OL］. （2013-09-07）［2021-07-29］.. 10.3126/jiee.v3i1.34327
21	KIM Y. Convolutional neural networks for sentence classification［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2014：1746-1751. 10.3115/v1/d14-1181
22	National Institute of Standards and Technology. Software assurance reference dataset［DS/OL］. ［2021-08-02］.. 10.1109/dasc.2007.4391957
23	WHEELER D A. Flawfinder［EB/OL］. ［2017-8-26］..

变量	后向	前向
argc@main	｛14｝	｛6，8，9，10，11，12， 17，19，20，21，22｝
argv@main	｛14｝	｛9，19｝
buf@test	｛6，8，14，17，20｝	｛8｝
str@test	｛6，14，17，19，20｝	｛9｝
userstr@main	｛14，17，19｝	｛19｝

变量	后向	前向
argc@main	｛14｝	｛6，8，9，10，11，12， 17，19，20，21，22｝
argv@main	｛14｝	｛9，19｝
buf@test	｛6，8，14，17，20｝	｛8｝
str@test	｛6，14，17，19，20｝	｛9｝
userstr@main	｛14，17，19｝	｛19｝

实际	预测
实际	脆弱	非脆弱
脆弱	TP	FN
非脆弱	FP	TN

实际	预测
实际	脆弱	非脆弱
脆弱	TP	FN
非脆弱	FP	TN

参数名	值	参数名	值
滤波器数量（N）	128	迭代轮次	20
卷积窗口大小（m）	1、3、5	激活函数	ReLU
GRU神经元个数（u）	50	卷积方式	MaxPooling1D
全连接层神经元个数	484	优化函数	Adamax
Dropout	0.5	损失函数	categorical_crossentrop
Batch Size	256