基于代码图像合成的Android恶意软件家族分类方法

doi:10.11772/j.issn.1001-9081.2021030486

《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (5): 1490-1499.DOI: 10.11772/j.issn.1001-9081.2021030486

基于代码图像合成的Android恶意软件家族分类方法

李默, 芦天亮(), 谢子恒

中国人民公安大学信息网络安全学院，北京 100038

收稿日期:2021-03-31 修回日期:2021-06-23 接受日期:2021-06-25 发布日期:2022-06-11 出版日期:2022-05-10
通讯作者: 芦天亮
作者简介:李默（1995—），男，江西赣州人，硕士研究生，主要研究方向：恶意代码检测、机器学习
芦天亮（1985—），男，河北保定人，副教授，博士，CCF会员，主要研究方向：网络空间安全、恶意代码检测 lutianliang@ppsuc.edu.cn
谢子恒（1999—），男，浙江宁波人，主要研究方向：网络攻防、恶意代码检测。
基金资助:
2021年公共安全行为科学实验室开放课题(2020SYS06)

Android malware family classification method based on code image integration

Mo LI, Tianliang LU(), Ziheng XIE

School of Information and Cyber Security，People’s Public Security University of China，Beijing 100038，China

Received:2021-03-31 Revised:2021-06-23 Accepted:2021-06-25 Online:2022-06-11 Published:2022-05-10
Contact: Tianliang LU
About author:LI Mo， born in 1995，M. S. candidate. His research interestsinclude malware detection，machine learning.
LU Tianliang， born in 1985，Ph. D.，associate professor. His mainresearch interests include cyber security，malware detection.
XIE Ziheng， born in 1999. His research interests include cyberattackand defense，malware detection.
Supported by:
2021 Open Project of Public Security Behavioral Science Lab(2020SYS06)

摘要/Abstract

摘要：

代码图像化技术被提出后在Android恶意软件研究领域迅速普及。针对使用单个DEX文件转换而成的代码图像表征能力不足的问题，提出了一种基于代码图像合成的Android恶意软件家族分类方法。首先，将安装包中的DEX、XML与反编译生成的JAR文件进行灰度图像化处理，并使用Bilinear插值算法来放缩处理不同尺寸的灰度图像，然后将三张灰度图合成为一张三维RGB图像用于训练与分类。在分类模型上，将软阈值去噪模块与基于Split-Attention的ResNeSt相结合提出了STResNeSt。该模型具备较强的抗噪能力，更能关注代码图像的重要特征。针对训练过程中的数据长尾分布问题，在数据增强的基础上引入了类别平衡损失函数（CB Loss），从而为样本不平衡造成的过拟合现象提供了解决方案。在Drebin数据集上，合成代码图像的准确率领先DEX灰度图像2.93个百分点，STResNeSt与残差神经网络（ResNet）相比准确率提升了1.1个百分点，且数据增强结合CB Loss的方案将F1值最高提升了2.4个百分点。实验结果表明，所提方法的平均分类准确率达到了98.97%，能有效分类Android恶意软件家族。

关键词: Android恶意软件家族, 代码图像, 迁移学习, 卷积神经网络, 通道注意力

Abstract:

Code visualization technology is rapidly popularized in the field of Android malware research once it was proposed. Aiming at the problem of insufficient representation ability of code image converted from single DEX （classes.dex） file， a new Android malware family classification method based on code image integration was proposed. Firstly， the DEX， XML （androidManifest.xml） and decompiled JAR （classes.jar） files in the Android application package were converted to three gray-scale images， and the Bilinear interpolation algorithm was used for the scaling of gray images in different sizes. Then， the three gray-scale images were integrated into a three-dimensional Red-Green-Blue （RGB） image for training and classification. In terms of classification model， the Soft Threshold （ST） Block+ResNeSt（STResNeSt） was proposed by combining the soft threshold denoising block with Split-Attention based ResNeSt. The proposed model has the strong anti-noise ability and is able to pay more attention to the important features of code image. To handle the long-tail distribution of data in the training process， Class Balance Loss （CB Loss） was introduced after data augmentation， which provided a feasible solution to the over-fitting caused by the imbalance of samples. On the Drebin dataset， the accuracy of integrated code image is 2.93 percentage points higher than that of DEX gray-scale image， the accuracy of STResNeSt is improved by 1.1 percentage points compared with the Residual Neural Network （ResNet）， the scheme of data augmentation combined with CB Loss improves the F1 score by up to 2.4 percentage points. Experimental results show that， the average classification accuracy of the proposed method reaches 98.97%， which can effectively classify the Android malware family.

Key words: Android malware family, code image, transfer learning, Convolution Neural Network (CNN), channel attention

中图分类号:

TP309.5

李默, 芦天亮, 谢子恒. 基于代码图像合成的Android恶意软件家族分类方法[J]. 计算机应用, 2022, 42(5): 1490-1499.

Mo LI, Tianliang LU, Ziheng XIE. Android malware family classification method based on code image integration[J]. Journal of Computer Applications, 2022, 42(5): 1490-1499.

图/表 14

图1 恶意软件分类流程

Fig. 1 Flow chart of malware classification

表1 代码图像转换比例

Tab. 1 Ratio of code image conversation

文件大小	图像宽度/像素
<10 KB	64
［10 KB，30 KB）	128
［30 KB，60 KB）	214
［60 KB，100 KB）	280
［100 KB，200 KB）	400
［200 KB，500 KB）	600
［500 KB，1 MB）	864
［1 MB，2 MB）	1 280
［2 MB，3 MB）	1 600
［3 MB，4 MB］	1 920
>4 MB	2 048

图2 插值坐标

Fig. 2 Coordinates for interpolations

图3 不同家族样本的合成代码图像

Fig. 3 Integrated code images of samples from different families

图4 ResNeSt Block与Split-Attention Block结构Tab. 4　Structures of ResNeSt Block and Split-Attention Block

图5 STResNeSt Block结构Tab. 5　Structure of STResNeSt Block

表2 数据增强参数值

Tab. 2 Parameter values of data augmentation

变换方法	参数值
中心旋转	［－0.1，0.1］
水平平移	［－0.1，0.1］
垂直平移	［－0.1，0.1］
中心放大	［－0.2，0.2］
翻转模式	水平翻转
填充模式	固定224插值填充

表3 实验数据集

Tab. 3 Experimental dataset

家族	总样本数		训练集样本数		测试集样本数
家族	原始	扩增后	原始	扩增后	测试集样本数
FakeInstaller	925	1 185	740	1 000	185
DroidKungFu	666	1 133	533	1 000	133
Plankton	625	1 125	500	1 000	125
Opfake	613	1 123	490	1 000	123
GinMaster	339	1 068	271	1 000	68
BaseBridge	329	1 066	263	1 000	66
Iconosys	152	1 030	122	1 000	30
Kmin	147	1 029	118	1 000	29
FakeDoc	132	1 027	105	1 000	27
Geinimi	92	1 019	73	1 000	19

图6 迁移学习前后的网络对比

Fig. 6 Comparison of networks before and after transfer learning

表4 不同插值算法的性能对比

Tab. 4 Performance comparison of different interpolation algorithms

插值算法	尺寸	F1/%	准确率/%	耗时/s
Nearest	$224 × 224$	97.76	98.23	1.64
Box	$224 × 224$	98.10	98.54	7.53
Lanczos	$224 × 224$	98.21	98.63	18.47
Bicubic	$224 × 224$	98.80	98.94	10.03
Bilinear	$224 × 224$	98.81	98.97	5.03
Bicubic （EfficientNet）	$352 × 352$	95.95	96.48	10.03
Bilinear （EfficientNet）	$352 × 352$	94.68	95.35	5.03

表4 不同插值算法的性能对比

Tab. 4 Performance comparison of different interpolation algorithms

插值算法	尺寸	F1/%	准确率/%	耗时/s
Nearest	$224 × 224$	97.76	98.23	1.64
Box	$224 × 224$	98.10	98.54	7.53
Lanczos	$224 × 224$	98.21	98.63	18.47
Bicubic	$224 × 224$	98.80	98.94	10.03
Bilinear	$224 × 224$	98.81	98.97	5.03
Bicubic （EfficientNet）	$352 × 352$	95.95	96.48	10.03
Bilinear （EfficientNet）	$352 × 352$	94.68	95.35	5.03

表5 不同样本均衡方法的性能对比

Tab. 5 Performance comparison of different sample balancing methods

样本均衡方法	精确率/%	召回率/%	F1/%
原数据	97.82	98.48	98.15
数据增强	98.43	97.91	98.17
CB Loss	98.11	98.57	98.34
数据增强+CB Loss	98.87	98.75	98.81

图7 不同家族的F1值

Fig. 7 F1 scores of different families

表6 不同残差网络的性能对比

Tab. 6 Performance comparison of different residual networks

基础网络	网络层数	精确率/%	召回率/%	F1/%	准确率/%
ResNet	50	97.38	96.81	97.09	97.87
ResNet	101	96.88	97.14	97.01	97.80
ResNeXt	50	97.65	97.71	97.68	97.98
ResNeXt	101	97.64	97.40	97.52	98.19
SENet	50	97.83	97.81	97.82	98.18
SENet	101	97.96	97.94	97.95	98.27
SKNet	50	98.50	97.76	98.13	98.39
SKNet	101	97.54	98.22	97.88	98.31
ResNeSt	50	98.14	98.66	98.40	98.61
ResNeSt	101	98.41	98.33	98.37	98.57
STResNeSt	50	98.87	98.75	98.81	98.97
STResNeSt	101	98.81	98.69	98.75	98.95

表7 不同代码图像生成方法的性能对比

Tab.7 Performance comparison of different code image generation methods

代码图像生成方法	精确率/%	召回率/%	F1/%	准确率/%
JAR	94.59	94.43	94.51	94.97
JAR（字符筛选）	95.73	95.29	95.51	96.01
XML	95.96	96.38	96.17	96.92
DEX（灰度图）	95.53	95.83	95.68	96.04
DEX	96.79	96.69	96.74	97.09
合成图像	98.87	98.75	98.81	98.97

参考文献 30

1	LI M B， WANG W， WANG P， et al. LibD： scalable and precise third-party library detection in Android markets ［C］// Proceedings of the 2017 IEEE/ACM 39th International Conference on Software Engineering. Piscataway： IEEE， 2017： 335-346. 10.1109/icse.2017.38
2	HE K M， ZHANG X Y， REN S Q， et al. Deep residual learning for image recognition ［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778. 10.1109/cvpr.2016.90
3	HU J， SHEN L， SUN G. Squeeze-and-excitation networks ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 7132-7141. 10.1109/cvpr.2018.00745
4	XIE S N， GIRSHICK R， DOLLÁR P， et al. Aggregated residual transformations for deep neural networks ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 5987-5995. 10.1109/cvpr.2017.634
5	LI X， WANG W H， HU X L， et al. Selective kernel networks ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 510-519. 10.1109/cvpr.2019.00060
6	ZHANG H， WU C R， ZHANG Z Y， et al. ResNeSt： split-attention networks ［EB/OL］. ［2021-03-02］. . 10.1155/2021/7544355
7	SANTOS I， BREZO F， NIEVES J， et al. Idea： opcode-sequence-based malware detection ［C］// Proceedings of the 2010 International Symposium on Engineering Secure Software and Systems， LNCS 5965. Berlin： Springer， 2010： 35-43.
8	WANG W， WANG X， FENG D W， et al. Exploring permission-induced risk in Android applications for malicious application detection ［J］. IEEE Transactions on Information Forensics and Security， 2014， 9（11）： 1869-1882. 10.1109/tifs.2014.2353996
9	GRINI L S， SHALAGINOV A， FRANKE K. Study of soft computing methods for large-scale multinomial malware types and families detection ［M］// ZADEH L A， YAGER R R， SHAHBAZOVA S N， et al. Recent Developments and the New Direction in Soft-Computing Foundations and Applications， STUDFUZZ 361. Cham： Springer， 2018： 337-350.
10	QIU J Y， ZHANG J， LUO W， et al. A3CM： automatic capability annotation for Android malware ［J］. IEEE Access， 2019， 7： 147156-147168. 10.1109/access.2019.2946392
11	张晨斌，张云春，郑杨，等.基于灰度图纹理指纹的恶意软件分类［J］.计算机科学，2018，45（6A）：383-386. 10.11896/j.issn.1002-137X.2018.Z6.083
	ZHANG C B， ZHANG Y C， ZHENG Y， et al. Malware classification based on texture fingerprint of gray-scale images ［J］. Computer Science， 2018， 45（6A）： 383-386. 10.11896/j.issn.1002-137X.2018.Z6.083
12	HUANG T T H D， KAO H Y. R2-D2： color-inspired Convolutional Neural Network （CNN）-based Android malware detections ［C］// Proceedings of the 2018 IEEE International Conference on Big Data. Piscataway： IEEE， 2018： 2633-2642.
13	VASAN D， ALAZAB M， WASSAN S， et al. IMCFN： image-based malware classification using fine-tuned convolutional neural network architecture ［J］. Computer Networks， 2020， 171： Article No.107138. 10.1016/j.comnet.2020.107138
14	高杨晨，方勇，刘亮，等.基于卷积神经网络的Android恶意软件检测技术研究［J］.四川大学学报（自然科学版），2020，57（4）：673-680. 10.3969/j.issn.0490-6756.2020.04.009
	GAO Y C， FANG Y， LIU L， et al. Android malware detection technology based on deep convolutional neural network ［J］. Journal of Sichuan University （Natural Science Edition）， 2020， 57（4）： 673-680. 10.3969/j.issn.0490-6756.2020.04.009
15	ZHAO M H， ZHONG S S， FU X Y， et al. Deep residual shrinkage networks for fault diagnosis ［J］. IEEE Transactions on Industrial Informatics， 2020， 16（7）： 4681-4690. 10.1109/tii.2019.2943898
16	CUI Y， JIA M L， LIN T Y， et al. Class-balanced loss based on effective number of samples ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 9260-9269. 10.1109/cvpr.2019.00949
17	NATARAJ L， KARTHIKEYAN S， JACOB G， et al. Malware images： visualization and automatic classification ［C］// Proceedings of the 2011 8th International Symposium on Visualization for Cyber Security. New York； ACM， 2011： Article No.4. 10.1145/2016904.2016908
18	CUI Z H， XUE F， CAI X J， et al. Detection of malicious code variants based on deep learning ［J］. IEEE Transactions on Industrial Informatics， 2018， 14（7）： 3187-3196. 10.1109/tii.2018.2822680
19	孙博文，张鹏，成茗宇，等.基于代码图像增强的恶意代码检测方法［J］.清华大学学报（自然科学版），2020，60（5）：386-392. 10.16511/j.cnki.qhdxxb.2020.25.008
	SUN B W， ZHANG P， CHENG M Y， et al. Malware detection method based on enhanced code images ［J］. Journal of Tsinghua University （Science and Technology）， 2020， 60（5）： 386-392. 10.16511/j.cnki.qhdxxb.2020.25.008
20	ARP D， SPREITZENBARTH M， HÜBNER M， et al. Drebin： efficient and explainable detection of Android malware in your pocket ［C］// Proceedings of the 2014 Annual Network and Distributed System Security Symposium. Reston： Internet Society， 2014： 1-12. 10.14722/ndss.2014.23247
21	PRESS W H， TEUKOLSKY S A， VETTERLING W T， et al. Numerical Recipes： the Art of Scientific Computing ［M］. 3rd ed. New York： Cambridge University Press， 2007： 123-128.
22	KEYS R. Cubic convolution interpolation for digital image processing ［J］. IEEE Transactions on Acoustics， Speech， and Signal Processing， 1981， 29（6）： 1153-1160. 10.1109/tassp.1981.1163711
23	TURKOWSKI K. Filters for common resampling tasks ［M］// GRASSNER A S. Graphics Gems. Waltham： Academic Press， 1990： 147-165. 10.1016/b978-0-08-050753-8.50042-5
24	LAWRENCE N D， SCHÖLKOPF B. Estimating a kernel fisher discriminant in the presence of label noise ［C］// Proceedings of the 2001 18th International Conference on Machine Learning. San Francisco： Morgan Kaufmann Publishers Inc.， 2001： 306-313.
25	XIA S Y， WANG G Y， CHEN Z Z， et al. Complete random forest based class noise filtering learning for improving the generalizability of classifiers ［J］. IEEE Transactions on Knowledge and Data Engineering， 2019， 31（11）： 2063-2078. 10.1109/tkde.2018.2873791
26	WU P X， ZHENG S Z， GOSWAMI M， et al. A topological filter for learning with label noise ［EB/OL］. ［2021-03-02］. .
27	HE H B， BAI Y， GARCIA E A， et al. ADASYN： adaptive synthetic sampling approach for imbalanced learning ［C］// Proceedings of the 2008 IEEE International Joint Conference on Neural Networks （IEEE World Congress on Computational Intelligence）. Piscataway： IEEE， 2008： 1322-1328. 10.1109/ijcnn.2008.4633969
28	ZOU Y， YU Z D， VIJAYA KUMAR B V K， et al. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training ［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11207. Cham： Springer， 2018： 297-313.
29	TAN M X， LE Q. EfficientNet： rethinking model scaling for convolutional neural networks ［C］// Proceedings of the 2019 36th International Conference on Machine Learning. New York： JMLR.org， 2019： 6105-6114.
30	SZEGEDY C， VANHOUCKE V， IOFFE S， et al. Rethinking the inception architecture for computer vision ［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 2818-2826. 10.1109/cvpr.2016.308

[1]	王利娥, 李小聪, 刘红翼. 融合知识图谱和差分隐私的新闻推荐方法[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1339-1346.
[2]	陈学勤, 陶涛, 张钟旺, 王一蕾. 融合成对编码方案及二维卷积神经网络的长短期会话推荐算法[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1347-1354.
[3]	屈震, 李堃婷, 冯志玺. 基于有效通道注意力的遥感图像场景分类[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1431-1439.
[4]	陈颖, 于炯, 陈嘉颖, 杜旭升. 基于交叉层级数据共享的多任务模型[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1447-1454.
[5]	王艺霏, 于雷, 滕飞, 宋佳玉, 袁玥. 基于长-短时序特征融合的资源负载预测模型[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1508-1515.
[6]	陈浩杰, 范江亭, 刘勇. 深度强化学习解决动态旅行商问题[J]. 《计算机应用》唯一官方网站, 2022, 42(4): 1194-1200.
[7]	汪祖民, 张志豪, 秦静, 季长清. 基于卷积神经网络的机械故障诊断技术综述[J]. 《计算机应用》唯一官方网站, 2022, 42(4): 1036-1043.
[8]	季长清, 高志勇, 秦静, 汪祖民. 基于卷积神经网络的图像分类算法综述[J]. 《计算机应用》唯一官方网站, 2022, 42(4): 1044-1049.
[9]	乔桂芳, 侯守明, 刘彦彦. 基于改进卷积神经网络与支持向量机结合的面部表情识别算法[J]. 《计算机应用》唯一官方网站, 2022, 42(4): 1253-1259.
[10]	李昆鹏, 张鹏程, 上官宏, 王燕玲, 杨婕, 桂志国. 基于卷积神经网络的时频域CT重建算法[J]. 《计算机应用》唯一官方网站, 2022, 42(4): 1308-1316.
[11]	滕腾, 潘海为, 张可佳, 牟雪莲, 张锡明, 陈伟鹏. 支持中文医疗问答的基于注意力机制的栈卷积神经网络模型[J]. 《计算机应用》唯一官方网站, 2022, 42(4): 1125-1130.
[12]	潘列, 曾诚, 张海丰, 温超东, 郝儒松, 何鹏. 结合广义自回归预训练语言模型与循环卷积神经网络的文本情感分析方法[J]. 《计算机应用》唯一官方网站, 2022, 42(4): 1108-1115.
[13]	刘志华, 陈文洁, 陈爱斌. 基于自注意力机制时频谱同源特征融合的鸟鸣声分类[J]. 《计算机应用》唯一官方网站, 2022, 42(4): 1260-1268.
[14]	董永峰, 孙跃华, 高立超, 韩鹏, 季海鹏. 基于改进一维卷积和双向长短期记忆神经网络的故障诊断方法[J]. 《计算机应用》唯一官方网站, 2022, 42(4): 1207-1215.
[15]	张璐, 方春, 祝铭. 基于Res2Net-YOLACT和融合特征的室内跌倒检测算法[J]. 《计算机应用》唯一官方网站, 2022, 42(3): 757-763.

基于代码图像合成的Android恶意软件家族分类方法

Android malware family classification method based on code image integration

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 14

参考文献 30

相关文章 15

编辑推荐

Metrics