《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (5): 1490-1499.DOI: 10.11772/j.issn.1001-9081.2021030486

• 网络空间安全 • 上一篇    下一篇

基于代码图像合成的Android恶意软件家族分类方法

李默, 芦天亮(), 谢子恒   

  1. 中国人民公安大学 信息网络安全学院,北京 100038
  • 收稿日期:2021-03-31 修回日期:2021-06-23 接受日期:2021-06-25 发布日期:2022-06-11 出版日期:2022-05-10
  • 通讯作者: 芦天亮
  • 作者简介:李默(1995—),男,江西赣州人,硕士研究生,主要研究方向:恶意代码检测、机器学习
    芦天亮(1985—),男,河北保定人,副教授,博士,CCF会员,主要研究方向:网络空间安全、恶意代码检测 lutianliang@ppsuc.edu.cn
    谢子恒(1999—),男,浙江宁波人,主要研究方向:网络攻防、恶意代码检测。
  • 基金资助:
    2021年公共安全行为科学实验室开放课题(2020SYS06)

Android malware family classification method based on code image integration

Mo LI, Tianliang LU(), Ziheng XIE   

  1. School of Information and Cyber Security,People’s Public Security University of China,Beijing 100038,China
  • Received:2021-03-31 Revised:2021-06-23 Accepted:2021-06-25 Online:2022-06-11 Published:2022-05-10
  • Contact: Tianliang LU
  • About author:LI Mo, born in 1995,M. S. candidate. His research interestsinclude malware detection,machine learning.
    LU Tianliang, born in 1985,Ph. D.,associate professor. His mainresearch interests include cyber security,malware detection.
    XIE Ziheng, born in 1999. His research interests include cyberattackand defense,malware detection.
  • Supported by:
    2021 Open Project of Public Security Behavioral Science Lab(2020SYS06)

摘要:

代码图像化技术被提出后在Android恶意软件研究领域迅速普及。针对使用单个DEX文件转换而成的代码图像表征能力不足的问题,提出了一种基于代码图像合成的Android恶意软件家族分类方法。首先,将安装包中的DEX、XML与反编译生成的JAR文件进行灰度图像化处理,并使用Bilinear插值算法来放缩处理不同尺寸的灰度图像,然后将三张灰度图合成为一张三维RGB图像用于训练与分类。在分类模型上,将软阈值去噪模块与基于Split-Attention的ResNeSt相结合提出了STResNeSt。该模型具备较强的抗噪能力,更能关注代码图像的重要特征。针对训练过程中的数据长尾分布问题,在数据增强的基础上引入了类别平衡损失函数(CB Loss),从而为样本不平衡造成的过拟合现象提供了解决方案。在Drebin数据集上,合成代码图像的准确率领先DEX灰度图像2.93个百分点,STResNeSt与残差神经网络(ResNet)相比准确率提升了1.1个百分点,且数据增强结合CB Loss的方案将F1值最高提升了2.4个百分点。实验结果表明,所提方法的平均分类准确率达到了98.97%,能有效分类Android恶意软件家族。

关键词: Android恶意软件家族, 代码图像, 迁移学习, 卷积神经网络, 通道注意力

Abstract:

Code visualization technology is rapidly popularized in the field of Android malware research once it was proposed. Aiming at the problem of insufficient representation ability of code image converted from single DEX (classes.dex) file, a new Android malware family classification method based on code image integration was proposed. Firstly, the DEX, XML (androidManifest.xml) and decompiled JAR (classes.jar) files in the Android application package were converted to three gray-scale images, and the Bilinear interpolation algorithm was used for the scaling of gray images in different sizes. Then, the three gray-scale images were integrated into a three-dimensional Red-Green-Blue (RGB) image for training and classification. In terms of classification model, the Soft Threshold (ST) Block+ResNeSt(STResNeSt) was proposed by combining the soft threshold denoising block with Split-Attention based ResNeSt. The proposed model has the strong anti-noise ability and is able to pay more attention to the important features of code image. To handle the long-tail distribution of data in the training process, Class Balance Loss (CB Loss) was introduced after data augmentation, which provided a feasible solution to the over-fitting caused by the imbalance of samples. On the Drebin dataset, the accuracy of integrated code image is 2.93 percentage points higher than that of DEX gray-scale image, the accuracy of STResNeSt is improved by 1.1 percentage points compared with the Residual Neural Network (ResNet), the scheme of data augmentation combined with CB Loss improves the F1 score by up to 2.4 percentage points. Experimental results show that, the average classification accuracy of the proposed method reaches 98.97%, which can effectively classify the Android malware family.

Key words: Android malware family, code image, transfer learning, Convolution Neural Network (CNN), channel attention

中图分类号: