基于代码图像合成的Android恶意软件家族分类方法

doi:10.11772/j.issn.1001-9081.2021030486

《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (5): 1490-1499.DOI: 10.11772/j.issn.1001-9081.2021030486

基于代码图像合成的Android恶意软件家族分类方法

李默, 芦天亮(), 谢子恒

中国人民公安大学信息网络安全学院，北京 100038

收稿日期:2021-03-31 修回日期:2021-06-23 接受日期:2021-06-25 发布日期:2022-06-11 出版日期:2022-05-10
通讯作者: 芦天亮
作者简介:李默（1995—），男，江西赣州人，硕士研究生，主要研究方向：恶意代码检测、机器学习
芦天亮（1985—），男，河北保定人，副教授，博士，CCF会员，主要研究方向：网络空间安全、恶意代码检测 lutianliang@ppsuc.edu.cn
谢子恒（1999—），男，浙江宁波人，主要研究方向：网络攻防、恶意代码检测。
基金资助:
2021年公共安全行为科学实验室开放课题(2020SYS06)

Android malware family classification method based on code image integration

Mo LI, Tianliang LU(), Ziheng XIE

School of Information and Cyber Security，People’s Public Security University of China，Beijing 100038，China

Received:2021-03-31 Revised:2021-06-23 Accepted:2021-06-25 Online:2022-06-11 Published:2022-05-10
Contact: Tianliang LU
About author:LI Mo， born in 1995，M. S. candidate. His research interestsinclude malware detection，machine learning.
LU Tianliang， born in 1985，Ph. D.，associate professor. His mainresearch interests include cyber security，malware detection.
XIE Ziheng， born in 1999. His research interests include cyberattackand defense，malware detection.
Supported by:
2021 Open Project of Public Security Behavioral Science Lab(2020SYS06)

摘要/Abstract

摘要：

代码图像化技术被提出后在Android恶意软件研究领域迅速普及。针对使用单个DEX文件转换而成的代码图像表征能力不足的问题，提出了一种基于代码图像合成的Android恶意软件家族分类方法。首先，将安装包中的DEX、XML与反编译生成的JAR文件进行灰度图像化处理，并使用Bilinear插值算法来放缩处理不同尺寸的灰度图像，然后将三张灰度图合成为一张三维RGB图像用于训练与分类。在分类模型上，将软阈值去噪模块与基于Split-Attention的ResNeSt相结合提出了STResNeSt。该模型具备较强的抗噪能力，更能关注代码图像的重要特征。针对训练过程中的数据长尾分布问题，在数据增强的基础上引入了类别平衡损失函数（CB Loss），从而为样本不平衡造成的过拟合现象提供了解决方案。在Drebin数据集上，合成代码图像的准确率领先DEX灰度图像2.93个百分点，STResNeSt与残差神经网络（ResNet）相比准确率提升了1.1个百分点，且数据增强结合CB Loss的方案将F1值最高提升了2.4个百分点。实验结果表明，所提方法的平均分类准确率达到了98.97%，能有效分类Android恶意软件家族。

关键词: Android恶意软件家族, 代码图像, 迁移学习, 卷积神经网络, 通道注意力

Abstract:

Code visualization technology is rapidly popularized in the field of Android malware research once it was proposed. Aiming at the problem of insufficient representation ability of code image converted from single DEX （classes.dex） file， a new Android malware family classification method based on code image integration was proposed. Firstly， the DEX， XML （androidManifest.xml） and decompiled JAR （classes.jar） files in the Android application package were converted to three gray-scale images， and the Bilinear interpolation algorithm was used for the scaling of gray images in different sizes. Then， the three gray-scale images were integrated into a three-dimensional Red-Green-Blue （RGB） image for training and classification. In terms of classification model， the Soft Threshold （ST） Block+ResNeSt（STResNeSt） was proposed by combining the soft threshold denoising block with Split-Attention based ResNeSt. The proposed model has the strong anti-noise ability and is able to pay more attention to the important features of code image. To handle the long-tail distribution of data in the training process， Class Balance Loss （CB Loss） was introduced after data augmentation， which provided a feasible solution to the over-fitting caused by the imbalance of samples. On the Drebin dataset， the accuracy of integrated code image is 2.93 percentage points higher than that of DEX gray-scale image， the accuracy of STResNeSt is improved by 1.1 percentage points compared with the Residual Neural Network （ResNet）， the scheme of data augmentation combined with CB Loss improves the F1 score by up to 2.4 percentage points. Experimental results show that， the average classification accuracy of the proposed method reaches 98.97%， which can effectively classify the Android malware family.

Key words: Android malware family, code image, transfer learning, Convolution Neural Network (CNN), channel attention

中图分类号:

TP309.5

李默, 芦天亮, 谢子恒. 基于代码图像合成的Android恶意软件家族分类方法[J]. 计算机应用, 2022, 42(5): 1490-1499.

Mo LI, Tianliang LU, Ziheng XIE. Android malware family classification method based on code image integration[J]. Journal of Computer Applications, 2022, 42(5): 1490-1499.

图/表 14

图1 恶意软件分类流程

Fig. 1 Flow chart of malware classification

表1 代码图像转换比例

Tab. 1 Ratio of code image conversation

文件大小	图像宽度/像素
<10 KB	64
［10 KB，30 KB）	128
［30 KB，60 KB）	214
［60 KB，100 KB）	280
［100 KB，200 KB）	400
［200 KB，500 KB）	600
［500 KB，1 MB）	864
［1 MB，2 MB）	1 280
［2 MB，3 MB）	1 600
［3 MB，4 MB］	1 920
>4 MB	2 048

图2 插值坐标

Fig. 2 Coordinates for interpolations

图3 不同家族样本的合成代码图像

Fig. 3 Integrated code images of samples from different families

图4 ResNeSt Block与Split-Attention Block结构Tab. 4　Structures of ResNeSt Block and Split-Attention Block

图5 STResNeSt Block结构Tab. 5　Structure of STResNeSt Block

表2 数据增强参数值

Tab. 2 Parameter values of data augmentation

变换方法	参数值
中心旋转	［－0.1，0.1］
水平平移	［－0.1，0.1］
垂直平移	［－0.1，0.1］
中心放大	［－0.2，0.2］
翻转模式	水平翻转
填充模式	固定224插值填充

表3 实验数据集

Tab. 3 Experimental dataset

家族	总样本数		训练集样本数		测试集样本数
家族	原始	扩增后	原始	扩增后	测试集样本数
FakeInstaller	925	1 185	740	1 000	185
DroidKungFu	666	1 133	533	1 000	133
Plankton	625	1 125	500	1 000	125
Opfake	613	1 123	490	1 000	123
GinMaster	339	1 068	271	1 000	68
BaseBridge	329	1 066	263	1 000	66
Iconosys	152	1 030	122	1 000	30
Kmin	147	1 029	118	1 000	29
FakeDoc	132	1 027	105	1 000	27
Geinimi	92	1 019	73	1 000	19

图6 迁移学习前后的网络对比

Fig. 6 Comparison of networks before and after transfer learning

表4 不同插值算法的性能对比

Tab. 4 Performance comparison of different interpolation algorithms

插值算法	尺寸	F1/%	准确率/%	耗时/s
Nearest	$224 × 224$	97.76	98.23	1.64
Box	$224 × 224$	98.10	98.54	7.53
Lanczos	$224 × 224$	98.21	98.63	18.47
Bicubic	$224 × 224$	98.80	98.94	10.03
Bilinear	$224 × 224$	98.81	98.97	5.03
Bicubic （EfficientNet）	$352 × 352$	95.95	96.48	10.03
Bilinear （EfficientNet）	$352 × 352$	94.68	95.35	5.03

表4 不同插值算法的性能对比

Tab. 4 Performance comparison of different interpolation algorithms

插值算法	尺寸	F1/%	准确率/%	耗时/s
Nearest	$224 × 224$	97.76	98.23	1.64
Box	$224 × 224$	98.10	98.54	7.53
Lanczos	$224 × 224$	98.21	98.63	18.47
Bicubic	$224 × 224$	98.80	98.94	10.03
Bilinear	$224 × 224$	98.81	98.97	5.03
Bicubic （EfficientNet）	$352 × 352$	95.95	96.48	10.03
Bilinear （EfficientNet）	$352 × 352$	94.68	95.35	5.03

表5 不同样本均衡方法的性能对比

Tab. 5 Performance comparison of different sample balancing methods

样本均衡方法	精确率/%	召回率/%	F1/%
原数据	97.82	98.48	98.15
数据增强	98.43	97.91	98.17
CB Loss	98.11	98.57	98.34
数据增强+CB Loss	98.87	98.75	98.81

图7 不同家族的F1值

Fig. 7 F1 scores of different families

表6 不同残差网络的性能对比

Tab. 6 Performance comparison of different residual networks

基础网络	网络层数	精确率/%	召回率/%	F1/%	准确率/%
ResNet	50	97.38	96.81	97.09	97.87
ResNet	101	96.88	97.14	97.01	97.80
ResNeXt	50	97.65	97.71	97.68	97.98
ResNeXt	101	97.64	97.40	97.52	98.19
SENet	50	97.83	97.81	97.82	98.18
SENet	101	97.96	97.94	97.95	98.27
SKNet	50	98.50	97.76	98.13	98.39
SKNet	101	97.54	98.22	97.88	98.31
ResNeSt	50	98.14	98.66	98.40	98.61
ResNeSt	101	98.41	98.33	98.37	98.57
STResNeSt	50	98.87	98.75	98.81	98.97
STResNeSt	101	98.81	98.69	98.75	98.95

表7 不同代码图像生成方法的性能对比

Tab.7 Performance comparison of different code image generation methods

代码图像生成方法	精确率/%	召回率/%	F1/%	准确率/%
JAR	94.59	94.43	94.51	94.97
JAR（字符筛选）	95.73	95.29	95.51	96.01
XML	95.96	96.38	96.17	96.92
DEX（灰度图）	95.53	95.83	95.68	96.04
DEX	96.79	96.69	96.74	97.09
合成图像	98.87	98.75	98.81	98.97

参考文献 30

1	LI M B， WANG W， WANG P， et al. LibD： scalable and precise third-party library detection in Android markets ［C］// Proceedings of the 2017 IEEE/ACM 39th International Conference on Software Engineering. Piscataway： IEEE， 2017： 335-346. 10.1109/icse.2017.38
2	HE K M， ZHANG X Y， REN S Q， et al. Deep residual learning for image recognition ［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778. 10.1109/cvpr.2016.90
3	HU J， SHEN L， SUN G. Squeeze-and-excitation networks ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 7132-7141. 10.1109/cvpr.2018.00745
4	XIE S N， GIRSHICK R， DOLLÁR P， et al. Aggregated residual transformations for deep neural networks ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 5987-5995. 10.1109/cvpr.2017.634
5	LI X， WANG W H， HU X L， et al. Selective kernel networks ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 510-519. 10.1109/cvpr.2019.00060
6	ZHANG H， WU C R， ZHANG Z Y， et al. ResNeSt： split-attention networks ［EB/OL］. ［2021-03-02］. . 10.1155/2021/7544355
7	SANTOS I， BREZO F， NIEVES J， et al. Idea： opcode-sequence-based malware detection ［C］// Proceedings of the 2010 International Symposium on Engineering Secure Software and Systems， LNCS 5965. Berlin： Springer， 2010： 35-43.
8	WANG W， WANG X， FENG D W， et al. Exploring permission-induced risk in Android applications for malicious application detection ［J］. IEEE Transactions on Information Forensics and Security， 2014， 9（11）： 1869-1882. 10.1109/tifs.2014.2353996
9	GRINI L S， SHALAGINOV A， FRANKE K. Study of soft computing methods for large-scale multinomial malware types and families detection ［M］// ZADEH L A， YAGER R R， SHAHBAZOVA S N， et al. Recent Developments and the New Direction in Soft-Computing Foundations and Applications， STUDFUZZ 361. Cham： Springer， 2018： 337-350.
10	QIU J Y， ZHANG J， LUO W， et al. A3CM： automatic capability annotation for Android malware ［J］. IEEE Access， 2019， 7： 147156-147168. 10.1109/access.2019.2946392
11	张晨斌，张云春，郑杨，等.基于灰度图纹理指纹的恶意软件分类［J］.计算机科学，2018，45（6A）：383-386. 10.11896/j.issn.1002-137X.2018.Z6.083
	ZHANG C B， ZHANG Y C， ZHENG Y， et al. Malware classification based on texture fingerprint of gray-scale images ［J］. Computer Science， 2018， 45（6A）： 383-386. 10.11896/j.issn.1002-137X.2018.Z6.083
12	HUANG T T H D， KAO H Y. R2-D2： color-inspired Convolutional Neural Network （CNN）-based Android malware detections ［C］// Proceedings of the 2018 IEEE International Conference on Big Data. Piscataway： IEEE， 2018： 2633-2642.
13	VASAN D， ALAZAB M， WASSAN S， et al. IMCFN： image-based malware classification using fine-tuned convolutional neural network architecture ［J］. Computer Networks， 2020， 171： Article No.107138. 10.1016/j.comnet.2020.107138
14	高杨晨，方勇，刘亮，等.基于卷积神经网络的Android恶意软件检测技术研究［J］.四川大学学报（自然科学版），2020，57（4）：673-680. 10.3969/j.issn.0490-6756.2020.04.009
	GAO Y C， FANG Y， LIU L， et al. Android malware detection technology based on deep convolutional neural network ［J］. Journal of Sichuan University （Natural Science Edition）， 2020， 57（4）： 673-680. 10.3969/j.issn.0490-6756.2020.04.009
15	ZHAO M H， ZHONG S S， FU X Y， et al. Deep residual shrinkage networks for fault diagnosis ［J］. IEEE Transactions on Industrial Informatics， 2020， 16（7）： 4681-4690. 10.1109/tii.2019.2943898
16	CUI Y， JIA M L， LIN T Y， et al. Class-balanced loss based on effective number of samples ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 9260-9269. 10.1109/cvpr.2019.00949
17	NATARAJ L， KARTHIKEYAN S， JACOB G， et al. Malware images： visualization and automatic classification ［C］// Proceedings of the 2011 8th International Symposium on Visualization for Cyber Security. New York； ACM， 2011： Article No.4. 10.1145/2016904.2016908
18	CUI Z H， XUE F， CAI X J， et al. Detection of malicious code variants based on deep learning ［J］. IEEE Transactions on Industrial Informatics， 2018， 14（7）： 3187-3196. 10.1109/tii.2018.2822680
19	孙博文，张鹏，成茗宇，等.基于代码图像增强的恶意代码检测方法［J］.清华大学学报（自然科学版），2020，60（5）：386-392. 10.16511/j.cnki.qhdxxb.2020.25.008
	SUN B W， ZHANG P， CHENG M Y， et al. Malware detection method based on enhanced code images ［J］. Journal of Tsinghua University （Science and Technology）， 2020， 60（5）： 386-392. 10.16511/j.cnki.qhdxxb.2020.25.008
20	ARP D， SPREITZENBARTH M， HÜBNER M， et al. Drebin： efficient and explainable detection of Android malware in your pocket ［C］// Proceedings of the 2014 Annual Network and Distributed System Security Symposium. Reston： Internet Society， 2014： 1-12. 10.14722/ndss.2014.23247
21	PRESS W H， TEUKOLSKY S A， VETTERLING W T， et al. Numerical Recipes： the Art of Scientific Computing ［M］. 3rd ed. New York： Cambridge University Press， 2007： 123-128.
22	KEYS R. Cubic convolution interpolation for digital image processing ［J］. IEEE Transactions on Acoustics， Speech， and Signal Processing， 1981， 29（6）： 1153-1160. 10.1109/tassp.1981.1163711
23	TURKOWSKI K. Filters for common resampling tasks ［M］// GRASSNER A S. Graphics Gems. Waltham： Academic Press， 1990： 147-165. 10.1016/b978-0-08-050753-8.50042-5
24	LAWRENCE N D， SCHÖLKOPF B. Estimating a kernel fisher discriminant in the presence of label noise ［C］// Proceedings of the 2001 18th International Conference on Machine Learning. San Francisco： Morgan Kaufmann Publishers Inc.， 2001： 306-313.
25	XIA S Y， WANG G Y， CHEN Z Z， et al. Complete random forest based class noise filtering learning for improving the generalizability of classifiers ［J］. IEEE Transactions on Knowledge and Data Engineering， 2019， 31（11）： 2063-2078. 10.1109/tkde.2018.2873791
26	WU P X， ZHENG S Z， GOSWAMI M， et al. A topological filter for learning with label noise ［EB/OL］. ［2021-03-02］. .
27	HE H B， BAI Y， GARCIA E A， et al. ADASYN： adaptive synthetic sampling approach for imbalanced learning ［C］// Proceedings of the 2008 IEEE International Joint Conference on Neural Networks （IEEE World Congress on Computational Intelligence）. Piscataway： IEEE， 2008： 1322-1328. 10.1109/ijcnn.2008.4633969
28	ZOU Y， YU Z D， VIJAYA KUMAR B V K， et al. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training ［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11207. Cham： Springer， 2018： 297-313.
29	TAN M X， LE Q. EfficientNet： rethinking model scaling for convolutional neural networks ［C］// Proceedings of the 2019 36th International Conference on Machine Learning. New York： JMLR.org， 2019： 6105-6114.
30	SZEGEDY C， VANHOUCKE V， IOFFE S， et al. Rethinking the inception architecture for computer vision ［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 2818-2826. 10.1109/cvpr.2016.308

[1]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[2]	李云, 王富铕, 井佩光, 王粟, 肖澳. 基于不确定度感知的帧关联短视频事件检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2903-2910.
[3]	赵宇博, 张丽萍, 闫盛, 侯敏, 高茂. 基于改进分段卷积神经网络和知识蒸馏的学科知识实体间关系抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2421-2429.
[4]	张春雪, 仇丽青, 孙承爱, 荆彩霞. 基于两阶段动态兴趣识别的购买行为预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2365-2371.
[5]	陈彤, 杨丰玉, 熊宇, 严荭, 邱福星. 基于多尺度频率通道注意力融合的声纹库构建方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2407-2413.
[6]	陈虹, 齐兵, 金海波, 武聪, 张立昂. 融合1D-CNN与BiGRU的类不平衡流量异常检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2493-2499.
[7]	王东炜, 刘柏辰, 韩志, 王艳美, 唐延东. 基于低秩分解和向量量化的深度网络压缩方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 1987-1994.
[8]	唐媛, 陈艳平, 扈应, 黄瑞章, 秦永彬. 基于多尺度混合注意力卷积神经网络的关系抽取模型[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2011-2017.
[9]	高阳峄, 雷涛, 杜晓刚, 李岁永, 王营博, 闵重丹. 基于像素距离图和四维动态卷积网络的密集人群计数与定位方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2233-2242.
[10]	翟飞宇, 马汉达. 基于DenseNet的经典-量子混合分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1905-1910.
[11]	黄梦源, 常侃, 凌铭阳, 韦新杰, 覃团发. 基于层间引导的低光照图像渐进增强算法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1911-1919.
[12]	李健京, 李贯峰, 秦飞舟, 李卫军. 基于不确定知识图谱嵌入的多关系近似推理模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1751-1759.
[13]	姚迅, 秦忠正, 杨捷. 生成式标签对抗的文本分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1781-1785.
[14]	沈君凤, 周星辰, 汤灿. 基于改进的提示学习方法的双通道情感分析模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1796-1806.
[15]	席治远, 唐超, 童安炀, 王文剑. 基于双路时空网络的驾驶员行为识别[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1511-1519.

基于代码图像合成的Android恶意软件家族分类方法

Android malware family classification method based on code image integration

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 14

参考文献 30

相关文章 15

编辑推荐

Metrics