Classification of malicious code variants based on VGGNet

doi:10.11772/j.issn.1001-9081.2019050953

Journal of Computer Applications ›› 2020, Vol. 40 ›› Issue (1): 162-167.DOI: 10.11772/j.issn.1001-9081.2019050953

• Cyber security • Previous Articles Next Articles

Classification of malicious code variants based on VGGNet

WANG Bo^1,2, CAI Honghao³, SU Yang^1,2

1. College of Cryptographic Engineering, Engineering College of PAP, Xi'an Shaanxi 710086, China;
2. Key Laboratory of Network and Information Security under the Armed Police Force(Engineering College of PAP), Xi'an Shaanxi 710086, China;
3. College of Information Engineering, Engineering College of PAP, Xi'an Shaanxi 710086, China

Received:2019-06-06 Revised:2019-08-14 Online:2020-01-10 Published:2020-01-17
Contact: 苏旸

基于VGGNet的恶意代码变种分类

王博^1,2, 蔡弘昊³, 苏旸^1,2

1. 武警工程大学密码工程学院, 西安 710086;
2. 网络与信息安全武警部队重点实验室(武警工程大学), 西安 710086;
3. 武警工程大学信息工程学院, 西安 710086

作者简介:王博(1996-),男,广东惠州人,硕士研究生,主要研究方向:恶意代码检测、深度学习;蔡弘昊(1996-),男,浙江杭州人,硕士研究生,主要研究方向:信道编码、深度学习;苏旸(1975-),男,陕西西安人,教授,博士,主要研究方向:网络安全、信息对抗。

Abstract

Abstract: Aiming at the phenomenon that code reuse is common in the same malicious code family, a malicious sample classification method using code reuse features was proposed. Firstly, the binary sequence of file was split into the values of RGB three-color channels, converting malicious samples into color images. Then, these images were used to generate a malicious sample classification model based on VGG convolutional neural network. Finally, during training process of model, to solve the problems of overfitting and gradient vanishing as well as high computation overhead, the random dropout algorithm was utilized. This method achieves 96.16% average classification accuracy on the 9342 samples from 25 families in Malimg dataset and can effectively classify the malicious code samples. Experimental results show that compared with grayscale images, converting binary files into color images can emphasize the image features more significantly, especially for the files with repetitive short data segments in binary sequences. And, using a training set with more obvious features, neural networks can generate a classification model with better performance. Since the preprocessing operation is simple and the classification result response is fast, the method is suitable for the scene with high real-time requirements such as rapid classification of large-scale malicious samples.

Key words: malicious code classification, data visualization, deep learning, dropout, Convolutional Neural Network (CNN)

摘要： 针对代码复用在同一恶意家族样本中普遍存在的现象，提出了一种利用代码复用特征的恶意样本分类方法。首先将文件的二进制序列分割成RGB三色通道的值，从而将恶意样本转换为彩色图；然后用这些图片基于VGG卷积神经网络生成恶意样本分类模型；最后在模型训练阶段利用随机失活算法解决过拟合和梯度消失问题以及降低神经网络计算开销。该方法使用Malimg数据集25个族的9342个样本进行评估，平均分类准确率达96.16%，能有效地分类恶意代码样本。实验结果表明，与灰度图相比，所提方法将二进制文件转换为彩色图能更明显地强调图像特征，尤其是对于二进制序列中含有重复短数据片段的文件，而且利用特征更明显的训练集，神经网络能生成分类效果更好的分类模型。所提方法预处理操作简单，分类结果响应较快，因此适用于大规模恶意样本的快速分类等即时性要求较高的场景。

关键词: 恶意代码分类, 数据可视化, 深度学习, 随机失活, 卷积神经网络

CLC Number:

TP309

WANG Bo, CAI Honghao, SU Yang. Classification of malicious code variants based on VGGNet[J]. Journal of Computer Applications, 2020, 40(1): 162-167.

王博, 蔡弘昊, 苏旸. 基于VGGNet的恶意代码变种分类[J]. 计算机应用, 2020, 40(1): 162-167.

References

[1] Symantec. Internet security threat report[EB/OL].[2017-04-17].https://pages.cobweb.com/acton/ct/15730/s-02c4-1705/Bct/l-0170/l-0170:11/ct25_1/1?sid=TV2%3AxBhBdhisn.
[2] ANDERSON B, LANE T, HASH C. Malware phylogenetics based on the multiview graphical lasso[C]//Proceedings of the 2014 International Symposium on Intelligent Data Analysis, LNCS 8819. Cham:Springer, 2014:1-12.
[3] ALAZAB M. Profiling and classifying the behavior of malicious codes[J]. Journal of Systems and Software, 2015, 100:91-102.
[4] YOO I. Visualizing windows executable viruses using self-organizing maps[C]//Proceedings of the 2004 ACM Workshop on Visualization and Data Mining for Computer Security. New York:ACM, 2004:82-89.
[5] HAN K S, LIM J H, KANG B, et al. Malware analysis using visualized images and entropy graphs[J]. International Journal of Information Security, 2015,14(1):1-14.
[6] 任卓君,陈光. 熵可视化方法在恶意代码分类中的应用[J]. 计算机工程, 2017, 43(9):167-171. (REN Z J, CHEN G. Application of entropy visualization method in malware classification[J]. Computer Engineering, 2017, 43(9):167-171.)
[7] NATARAJ L, KARTHIKEYAN S, JACOB G, et al. Malware images:visualization and automatic classification[C]//Proceedings of the 8th International Symposium on Visualization for Cyber Security. New York:ACM, 2011:No.4.
[8] CUI Z, XUE F, CAI X, et al. Detection of malicious code variants based on deep learning[J]. IEEE Transactions on Industrial Informatics, 2018,14(7):3187-3196.
[9] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL].[2015-04-10].https://arxiv.org/pdf/1409.1556.pdf.
[10] KINGMA D P, BA J L. Adam:a method for stochastic optimization[EB/OL].[2017-01-30].https://arxiv.org/pdf/1412.6980.pdf.
[11] HINTON G E, SRIVASTAVA N, KRIZHEVSKY A, et al. Improving neural networks by preventing co-adaptation of feature detectors[EB/OL].[2012-07-03].https://arxiv.org/pdf/1207.0580v1.pdf.
[12] SRIVASTAVA N, HINTON G, KRIZHEVSKY A, et al. Dropout:a simple way to prevent neural networks from overfitting[J]. Journal of Machine Learning Research, 2014, 15:1929-1958.
[13] TIELEMAN T, HINTON G. Lecture 6.5-rmsprop:divide the gradient by a running average of its recent magnitude[J]. Neural Networks for Machine Learning, 2012, 4:26-30.
[14] PARK H, AMARI S I, FUKUMIZU K. Adaptive natural gradient learning algorithms for various stochastic models[J]. Neural Networks, 2000, 13(7):755-764.
[15] PAPA G, BIANCHI P, CLÉMENÇON S. Adaptive sampling for incremental optimization using stochastic gradient descent[C]//Proceedings of the 2015 International Conference on Algorithmic Learning Theory, LNCS 9355. Cham:Springer, 2015:317-331.

Classification of malicious code variants based on VGGNet

基于VGGNet的恶意代码变种分类

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	XIE Defeng, JI Jianmin. Syntax-enhanced semantic parsing with syntax-aware representation [J]. Journal of Computer Applications, 2021, 41(9): 2489-2495.
[2]	DAI Yurou, YANG Qing, ZHANG Fengli, ZHOU Fan. Trajectory prediction model of social network users based on self-supervised learning [J]. Journal of Computer Applications, 2021, 41(9): 2545-2551.
[3]	WANG Hebing, ZHANG Chunmei. Facial landmark detection based on ResNeXt with asymmetric convolution and squeeze excitation [J]. Journal of Computer Applications, 2021, 41(9): 2741-2747.
[4]	ZHENG Zhiqiang, HU Xin, WENG Zhi, WANG Yuhe, CHENG Xi. Cattle eye image feature extraction method based on improved DenseNet [J]. Journal of Computer Applications, 2021, 41(9): 2780-2784.
[5]	CHEN Chengrui, SUN Ning, HE Shibiao, LIAO Yong. Deep learning-based joint channel estimation and equalization algorithm for C-V2X communications [J]. Journal of Computer Applications, 2021, 41(9): 2687-2693.
[6]	SONG Zhongshan, LIANG Jiarui, ZHENG Lu, LIU Zhenyu, TIE Jun. Remote sensing scene classification based on bidirectional gated scale feature fusion [J]. Journal of Computer Applications, 2021, 41(9): 2726-2735.
[7]	LI Kangkang, ZHANG Jing. Multi-layer encoding and decoding model for image captioning based on attention mechanism [J]. Journal of Computer Applications, 2021, 41(9): 2504-2509.
[8]	ZHANG Yongbin, CHANG Wenxin, SUN Lianshan, ZHANG Hang. Detection method of domains generated by dictionary-based domain generation algorithm [J]. Journal of Computer Applications, 2021, 41(9): 2609-2614.
[9]	ZHAO Hong, KONG Dongyi. Chinese description of image content based on fusion of image feature attention and adaptive attention [J]. Journal of Computer Applications, 2021, 41(9): 2496-2503.
[10]	XU Jianglang, LI Linyan, WAN Xinjun, HU Fuyuan. Indoor scene recognition method combined with object detection [J]. Journal of Computer Applications, 2021, 41(9): 2720-2725.
[11]	CAO Yuhong, XU Hai, LIU Sun'ao, WANG Zixiao, LI Hongliang. Review of deep learning-based medical image segmentation [J]. Journal of Computer Applications, 2021, 41(8): 2273-2287.
[12]	QIN Binbin, PENG Liangkang, LU Xiangming, QIAN Jiangbo. Research progress on driver distracted driving detection [J]. Journal of Computer Applications, 2021, 41(8): 2330-2337.
[13]	HUANG Chengcheng, DONG Xiaoxiao, LI Zhao. Deep pipeline 5×5 convolution method based on two-dimensional Winograd algorithm [J]. Journal of Computer Applications, 2021, 41(8): 2258-2264.
[14]	ZENG Xiangyin, ZHENG Bochuan, LIU Dan. Detection of left and right railway tracks based on deep convolutional neural network and clustering [J]. Journal of Computer Applications, 2021, 41(8): 2324-2329.
[15]	HE Zhenghai, XIAN Yantuan, WANG Meng, YU Zhengtao. Case reading comprehension method combining syntactic guidance and character attention mechanism [J]. Journal of Computer Applications, 2021, 41(8): 2427-2431.