Journal of Computer Applications ›› 2020, Vol. 40 ›› Issue (1): 162-167.DOI: 10.11772/j.issn.1001-9081.2019050953

• Cyber security • Previous Articles     Next Articles

Classification of malicious code variants based on VGGNet

WANG Bo1,2, CAI Honghao3, SU Yang1,2   

  1. 1. College of Cryptographic Engineering, Engineering College of PAP, Xi'an Shaanxi 710086, China;
    2. Key Laboratory of Network and Information Security under the Armed Police Force(Engineering College of PAP), Xi'an Shaanxi 710086, China;
    3. College of Information Engineering, Engineering College of PAP, Xi'an Shaanxi 710086, China
  • Received:2019-06-06 Revised:2019-08-14 Online:2020-01-10 Published:2020-01-17
  • Contact: 苏旸


王博1,2, 蔡弘昊3, 苏旸1,2   

  1. 1. 武警工程大学 密码工程学院, 西安 710086;
    2. 网络与信息安全武警部队重点实验室(武警工程大学), 西安 710086;
    3. 武警工程大学 信息工程学院, 西安 710086
  • 作者简介:王博(1996-),男,广东惠州人,硕士研究生,主要研究方向:恶意代码检测、深度学习;蔡弘昊(1996-),男,浙江杭州人,硕士研究生,主要研究方向:信道编码、深度学习;苏旸(1975-),男,陕西西安人,教授,博士,主要研究方向:网络安全、信息对抗。

Abstract: Aiming at the phenomenon that code reuse is common in the same malicious code family, a malicious sample classification method using code reuse features was proposed. Firstly, the binary sequence of file was split into the values of RGB three-color channels, converting malicious samples into color images. Then, these images were used to generate a malicious sample classification model based on VGG convolutional neural network. Finally, during training process of model, to solve the problems of overfitting and gradient vanishing as well as high computation overhead, the random dropout algorithm was utilized. This method achieves 96.16% average classification accuracy on the 9342 samples from 25 families in Malimg dataset and can effectively classify the malicious code samples. Experimental results show that compared with grayscale images, converting binary files into color images can emphasize the image features more significantly, especially for the files with repetitive short data segments in binary sequences. And, using a training set with more obvious features, neural networks can generate a classification model with better performance. Since the preprocessing operation is simple and the classification result response is fast, the method is suitable for the scene with high real-time requirements such as rapid classification of large-scale malicious samples.

Key words: malicious code classification, data visualization, deep learning, dropout, Convolutional Neural Network (CNN)

摘要: 针对代码复用在同一恶意家族样本中普遍存在的现象,提出了一种利用代码复用特征的恶意样本分类方法。首先将文件的二进制序列分割成RGB三色通道的值,从而将恶意样本转换为彩色图;然后用这些图片基于VGG卷积神经网络生成恶意样本分类模型;最后在模型训练阶段利用随机失活算法解决过拟合和梯度消失问题以及降低神经网络计算开销。该方法使用Malimg数据集25个族的9342个样本进行评估,平均分类准确率达96.16%,能有效地分类恶意代码样本。实验结果表明,与灰度图相比,所提方法将二进制文件转换为彩色图能更明显地强调图像特征,尤其是对于二进制序列中含有重复短数据片段的文件,而且利用特征更明显的训练集,神经网络能生成分类效果更好的分类模型。所提方法预处理操作简单,分类结果响应较快,因此适用于大规模恶意样本的快速分类等即时性要求较高的场景。

关键词: 恶意代码分类, 数据可视化, 深度学习, 随机失活, 卷积神经网络

CLC Number: