Android malware family classification method based on code image integration

doi:10.11772/j.issn.1001-9081.2021030486

Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (5): 1490-1499.DOI: 10.11772/j.issn.1001-9081.2021030486

• Cyber security • Previous Articles Next Articles

Android malware family classification method based on code image integration

Mo LI, Tianliang LU(), Ziheng XIE

School of Information and Cyber Security，People’s Public Security University of China，Beijing 100038，China

Received:2021-03-31 Revised:2021-06-23 Accepted:2021-06-25 Online:2022-06-11 Published:2022-05-10
Contact: Tianliang LU
About author:LI Mo， born in 1995，M. S. candidate. His research interestsinclude malware detection，machine learning.
LU Tianliang， born in 1985，Ph. D.，associate professor. His mainresearch interests include cyber security，malware detection.
XIE Ziheng， born in 1999. His research interests include cyberattackand defense，malware detection.
Supported by:
2021 Open Project of Public Security Behavioral Science Lab(2020SYS06)

基于代码图像合成的Android恶意软件家族分类方法

李默, 芦天亮(), 谢子恒

中国人民公安大学信息网络安全学院，北京 100038

通讯作者: 芦天亮
作者简介:李默（1995—），男，江西赣州人，硕士研究生，主要研究方向：恶意代码检测、机器学习
芦天亮（1985—），男，河北保定人，副教授，博士，CCF会员，主要研究方向：网络空间安全、恶意代码检测 lutianliang@ppsuc.edu.cn
谢子恒（1999—），男，浙江宁波人，主要研究方向：网络攻防、恶意代码检测。
基金资助:
2021年公共安全行为科学实验室开放课题(2020SYS06)

Abstract

Abstract:

Code visualization technology is rapidly popularized in the field of Android malware research once it was proposed. Aiming at the problem of insufficient representation ability of code image converted from single DEX （classes.dex） file， a new Android malware family classification method based on code image integration was proposed. Firstly， the DEX， XML （androidManifest.xml） and decompiled JAR （classes.jar） files in the Android application package were converted to three gray-scale images， and the Bilinear interpolation algorithm was used for the scaling of gray images in different sizes. Then， the three gray-scale images were integrated into a three-dimensional Red-Green-Blue （RGB） image for training and classification. In terms of classification model， the Soft Threshold （ST） Block+ResNeSt（STResNeSt） was proposed by combining the soft threshold denoising block with Split-Attention based ResNeSt. The proposed model has the strong anti-noise ability and is able to pay more attention to the important features of code image. To handle the long-tail distribution of data in the training process， Class Balance Loss （CB Loss） was introduced after data augmentation， which provided a feasible solution to the over-fitting caused by the imbalance of samples. On the Drebin dataset， the accuracy of integrated code image is 2.93 percentage points higher than that of DEX gray-scale image， the accuracy of STResNeSt is improved by 1.1 percentage points compared with the Residual Neural Network （ResNet）， the scheme of data augmentation combined with CB Loss improves the F1 score by up to 2.4 percentage points. Experimental results show that， the average classification accuracy of the proposed method reaches 98.97%， which can effectively classify the Android malware family.

Key words: Android malware family, code image, transfer learning, Convolution Neural Network (CNN), channel attention

摘要：

代码图像化技术被提出后在Android恶意软件研究领域迅速普及。针对使用单个DEX文件转换而成的代码图像表征能力不足的问题，提出了一种基于代码图像合成的Android恶意软件家族分类方法。首先，将安装包中的DEX、XML与反编译生成的JAR文件进行灰度图像化处理，并使用Bilinear插值算法来放缩处理不同尺寸的灰度图像，然后将三张灰度图合成为一张三维RGB图像用于训练与分类。在分类模型上，将软阈值去噪模块与基于Split-Attention的ResNeSt相结合提出了STResNeSt。该模型具备较强的抗噪能力，更能关注代码图像的重要特征。针对训练过程中的数据长尾分布问题，在数据增强的基础上引入了类别平衡损失函数（CB Loss），从而为样本不平衡造成的过拟合现象提供了解决方案。在Drebin数据集上，合成代码图像的准确率领先DEX灰度图像2.93个百分点，STResNeSt与残差神经网络（ResNet）相比准确率提升了1.1个百分点，且数据增强结合CB Loss的方案将F1值最高提升了2.4个百分点。实验结果表明，所提方法的平均分类准确率达到了98.97%，能有效分类Android恶意软件家族。

关键词: Android恶意软件家族, 代码图像, 迁移学习, 卷积神经网络, 通道注意力

CLC Number:

TP309.5

Mo LI, Tianliang LU, Ziheng XIE. Android malware family classification method based on code image integration[J]. Journal of Computer Applications, 2022, 42(5): 1490-1499.

李默, 芦天亮, 谢子恒. 基于代码图像合成的Android恶意软件家族分类方法[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1490-1499.

Figures/Tables 14

Fig. 1 Flow chart of malware classification

Tab. 1 Ratio of code image conversation

文件大小	图像宽度/像素
<10 KB	64
［10 KB，30 KB）	128
［30 KB，60 KB）	214
［60 KB，100 KB）	280
［100 KB，200 KB）	400
［200 KB，500 KB）	600
［500 KB，1 MB）	864
［1 MB，2 MB）	1 280
［2 MB，3 MB）	1 600
［3 MB，4 MB］	1 920
>4 MB	2 048

Fig. 2 Coordinates for interpolations

Fig. 3 Integrated code images of samples from different families

Tab. 2 Parameter values of data augmentation

变换方法	参数值
中心旋转	［－0.1，0.1］
水平平移	［－0.1，0.1］
垂直平移	［－0.1，0.1］
中心放大	［－0.2，0.2］
翻转模式	水平翻转
填充模式	固定224插值填充

Tab. 3 Experimental dataset

家族	总样本数		训练集样本数		测试集样本数
家族	原始	扩增后	原始	扩增后	测试集样本数
FakeInstaller	925	1 185	740	1 000	185
DroidKungFu	666	1 133	533	1 000	133
Plankton	625	1 125	500	1 000	125
Opfake	613	1 123	490	1 000	123
GinMaster	339	1 068	271	1 000	68
BaseBridge	329	1 066	263	1 000	66
Iconosys	152	1 030	122	1 000	30
Kmin	147	1 029	118	1 000	29
FakeDoc	132	1 027	105	1 000	27
Geinimi	92	1 019	73	1 000	19

Fig. 6 Comparison of networks before and after transfer learning

Tab. 4 Performance comparison of different interpolation algorithms

插值算法	尺寸	F1/%	准确率/%	耗时/s
Nearest	$224 × 224$	97.76	98.23	1.64
Box	$224 × 224$	98.10	98.54	7.53
Lanczos	$224 × 224$	98.21	98.63	18.47
Bicubic	$224 × 224$	98.80	98.94	10.03
Bilinear	$224 × 224$	98.81	98.97	5.03
Bicubic （EfficientNet）	$352 × 352$	95.95	96.48	10.03
Bilinear （EfficientNet）	$352 × 352$	94.68	95.35	5.03

Tab. 4 Performance comparison of different interpolation algorithms

插值算法	尺寸	F1/%	准确率/%	耗时/s
Nearest	$224 × 224$	97.76	98.23	1.64
Box	$224 × 224$	98.10	98.54	7.53
Lanczos	$224 × 224$	98.21	98.63	18.47
Bicubic	$224 × 224$	98.80	98.94	10.03
Bilinear	$224 × 224$	98.81	98.97	5.03
Bicubic （EfficientNet）	$352 × 352$	95.95	96.48	10.03
Bilinear （EfficientNet）	$352 × 352$	94.68	95.35	5.03

Tab. 5 Performance comparison of different sample balancing methods

样本均衡方法	精确率/%	召回率/%	F1/%
原数据	97.82	98.48	98.15
数据增强	98.43	97.91	98.17
CB Loss	98.11	98.57	98.34
数据增强+CB Loss	98.87	98.75	98.81

Fig. 7 F1 scores of different families

Tab. 6 Performance comparison of different residual networks

基础网络	网络层数	精确率/%	召回率/%	F1/%	准确率/%
ResNet	50	97.38	96.81	97.09	97.87
ResNet	101	96.88	97.14	97.01	97.80
ResNeXt	50	97.65	97.71	97.68	97.98
ResNeXt	101	97.64	97.40	97.52	98.19
SENet	50	97.83	97.81	97.82	98.18
SENet	101	97.96	97.94	97.95	98.27
SKNet	50	98.50	97.76	98.13	98.39
SKNet	101	97.54	98.22	97.88	98.31
ResNeSt	50	98.14	98.66	98.40	98.61
ResNeSt	101	98.41	98.33	98.37	98.57
STResNeSt	50	98.87	98.75	98.81	98.97
STResNeSt	101	98.81	98.69	98.75	98.95

Tab.7 Performance comparison of different code image generation methods

代码图像生成方法	精确率/%	召回率/%	F1/%	准确率/%
JAR	94.59	94.43	94.51	94.97
JAR（字符筛选）	95.73	95.29	95.51	96.01
XML	95.96	96.38	96.17	96.92
DEX（灰度图）	95.53	95.83	95.68	96.04
DEX	96.79	96.69	96.74	97.09
合成图像	98.87	98.75	98.81	98.97

References 30

1	LI M B， WANG W， WANG P， et al. LibD： scalable and precise third-party library detection in Android markets ［C］// Proceedings of the 2017 IEEE/ACM 39th International Conference on Software Engineering. Piscataway： IEEE， 2017： 335-346. 10.1109/icse.2017.38
2	HE K M， ZHANG X Y， REN S Q， et al. Deep residual learning for image recognition ［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778. 10.1109/cvpr.2016.90
3	HU J， SHEN L， SUN G. Squeeze-and-excitation networks ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 7132-7141. 10.1109/cvpr.2018.00745
4	XIE S N， GIRSHICK R， DOLLÁR P， et al. Aggregated residual transformations for deep neural networks ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 5987-5995. 10.1109/cvpr.2017.634
5	LI X， WANG W H， HU X L， et al. Selective kernel networks ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 510-519. 10.1109/cvpr.2019.00060
6	ZHANG H， WU C R， ZHANG Z Y， et al. ResNeSt： split-attention networks ［EB/OL］. ［2021-03-02］. . 10.1155/2021/7544355
7	SANTOS I， BREZO F， NIEVES J， et al. Idea： opcode-sequence-based malware detection ［C］// Proceedings of the 2010 International Symposium on Engineering Secure Software and Systems， LNCS 5965. Berlin： Springer， 2010： 35-43.
8	WANG W， WANG X， FENG D W， et al. Exploring permission-induced risk in Android applications for malicious application detection ［J］. IEEE Transactions on Information Forensics and Security， 2014， 9（11）： 1869-1882. 10.1109/tifs.2014.2353996
9	GRINI L S， SHALAGINOV A， FRANKE K. Study of soft computing methods for large-scale multinomial malware types and families detection ［M］// ZADEH L A， YAGER R R， SHAHBAZOVA S N， et al. Recent Developments and the New Direction in Soft-Computing Foundations and Applications， STUDFUZZ 361. Cham： Springer， 2018： 337-350.
10	QIU J Y， ZHANG J， LUO W， et al. A3CM： automatic capability annotation for Android malware ［J］. IEEE Access， 2019， 7： 147156-147168. 10.1109/access.2019.2946392
11	张晨斌，张云春，郑杨，等.基于灰度图纹理指纹的恶意软件分类［J］.计算机科学，2018，45（6A）：383-386. 10.11896/j.issn.1002-137X.2018.Z6.083
	ZHANG C B， ZHANG Y C， ZHENG Y， et al. Malware classification based on texture fingerprint of gray-scale images ［J］. Computer Science， 2018， 45（6A）： 383-386. 10.11896/j.issn.1002-137X.2018.Z6.083
12	HUANG T T H D， KAO H Y. R2-D2： color-inspired Convolutional Neural Network （CNN）-based Android malware detections ［C］// Proceedings of the 2018 IEEE International Conference on Big Data. Piscataway： IEEE， 2018： 2633-2642.
13	VASAN D， ALAZAB M， WASSAN S， et al. IMCFN： image-based malware classification using fine-tuned convolutional neural network architecture ［J］. Computer Networks， 2020， 171： Article No.107138. 10.1016/j.comnet.2020.107138
14	高杨晨，方勇，刘亮，等.基于卷积神经网络的Android恶意软件检测技术研究［J］.四川大学学报（自然科学版），2020，57（4）：673-680. 10.3969/j.issn.0490-6756.2020.04.009
	GAO Y C， FANG Y， LIU L， et al. Android malware detection technology based on deep convolutional neural network ［J］. Journal of Sichuan University （Natural Science Edition）， 2020， 57（4）： 673-680. 10.3969/j.issn.0490-6756.2020.04.009
15	ZHAO M H， ZHONG S S， FU X Y， et al. Deep residual shrinkage networks for fault diagnosis ［J］. IEEE Transactions on Industrial Informatics， 2020， 16（7）： 4681-4690. 10.1109/tii.2019.2943898
16	CUI Y， JIA M L， LIN T Y， et al. Class-balanced loss based on effective number of samples ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 9260-9269. 10.1109/cvpr.2019.00949
17	NATARAJ L， KARTHIKEYAN S， JACOB G， et al. Malware images： visualization and automatic classification ［C］// Proceedings of the 2011 8th International Symposium on Visualization for Cyber Security. New York； ACM， 2011： Article No.4. 10.1145/2016904.2016908
18	CUI Z H， XUE F， CAI X J， et al. Detection of malicious code variants based on deep learning ［J］. IEEE Transactions on Industrial Informatics， 2018， 14（7）： 3187-3196. 10.1109/tii.2018.2822680
19	孙博文，张鹏，成茗宇，等.基于代码图像增强的恶意代码检测方法［J］.清华大学学报（自然科学版），2020，60（5）：386-392. 10.16511/j.cnki.qhdxxb.2020.25.008
	SUN B W， ZHANG P， CHENG M Y， et al. Malware detection method based on enhanced code images ［J］. Journal of Tsinghua University （Science and Technology）， 2020， 60（5）： 386-392. 10.16511/j.cnki.qhdxxb.2020.25.008
20	ARP D， SPREITZENBARTH M， HÜBNER M， et al. Drebin： efficient and explainable detection of Android malware in your pocket ［C］// Proceedings of the 2014 Annual Network and Distributed System Security Symposium. Reston： Internet Society， 2014： 1-12. 10.14722/ndss.2014.23247
21	PRESS W H， TEUKOLSKY S A， VETTERLING W T， et al. Numerical Recipes： the Art of Scientific Computing ［M］. 3rd ed. New York： Cambridge University Press， 2007： 123-128.
22	KEYS R. Cubic convolution interpolation for digital image processing ［J］. IEEE Transactions on Acoustics， Speech， and Signal Processing， 1981， 29（6）： 1153-1160. 10.1109/tassp.1981.1163711
23	TURKOWSKI K. Filters for common resampling tasks ［M］// GRASSNER A S. Graphics Gems. Waltham： Academic Press， 1990： 147-165. 10.1016/b978-0-08-050753-8.50042-5
24	LAWRENCE N D， SCHÖLKOPF B. Estimating a kernel fisher discriminant in the presence of label noise ［C］// Proceedings of the 2001 18th International Conference on Machine Learning. San Francisco： Morgan Kaufmann Publishers Inc.， 2001： 306-313.
25	XIA S Y， WANG G Y， CHEN Z Z， et al. Complete random forest based class noise filtering learning for improving the generalizability of classifiers ［J］. IEEE Transactions on Knowledge and Data Engineering， 2019， 31（11）： 2063-2078. 10.1109/tkde.2018.2873791
26	WU P X， ZHENG S Z， GOSWAMI M， et al. A topological filter for learning with label noise ［EB/OL］. ［2021-03-02］. .
27	HE H B， BAI Y， GARCIA E A， et al. ADASYN： adaptive synthetic sampling approach for imbalanced learning ［C］// Proceedings of the 2008 IEEE International Joint Conference on Neural Networks （IEEE World Congress on Computational Intelligence）. Piscataway： IEEE， 2008： 1322-1328. 10.1109/ijcnn.2008.4633969
28	ZOU Y， YU Z D， VIJAYA KUMAR B V K， et al. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training ［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11207. Cham： Springer， 2018： 297-313.
29	TAN M X， LE Q. EfficientNet： rethinking model scaling for convolutional neural networks ［C］// Proceedings of the 2019 36th International Conference on Machine Learning. New York： JMLR.org， 2019： 6105-6114.
30	SZEGEDY C， VANHOUCKE V， IOFFE S， et al. Rethinking the inception architecture for computer vision ［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 2818-2826. 10.1109/cvpr.2016.308

[1]	Tong CHEN, Fengyu YANG, Yu XIONG, Hong YAN, Fuxing QIU. Construction method of voiceprint library based on multi-scale frequency-channel attention fusion [J]. Journal of Computer Applications, 2024, 44(8): 2407-2413.
[2]	Yuan TANG, Yanping CHEN, Ying HU, Ruizhang HUANG, Yongbin QIN. Relation extraction model based on multi-scale hybrid attention convolutional neural networks [J]. Journal of Computer Applications, 2024, 44(7): 2011-2017.
[3]	Feiyu ZHAI, Handa MA. Hybrid classical-quantum classification model based on DenseNet [J]. Journal of Computer Applications, 2024, 44(6): 1905-1910.
[4]	Hongtian LI, Xinhao SHI, Weiguo PAN, Cheng XU, Bingxin XU, Jiazheng YUAN. Few-shot object detection via fusing multi-scale and attention mechanism [J]. Journal of Computer Applications, 2024, 44(5): 1437-1444.
[5]	Wangjun SHI, Jing WANG, Xiaojun NING, Youfang LIN. Sleep stage classification model by meta transfer learning in few-shot scenarios [J]. Journal of Computer Applications, 2024, 44(5): 1445-1451.
[6]	Haoran WANG, Dan YU, Yuli YANG, Yao MA, Yongle CHEN. Domain transfer intrusion detection method for unknown attacks on industrial control systems [J]. Journal of Computer Applications, 2024, 44(4): 1158-1165.
[7]	Boyue WANG, Yingxiang LI, Jiandan ZHONG. Segmentation network for day and night ground-based cloud images based on improved Res-UNet [J]. Journal of Computer Applications, 2024, 44(4): 1310-1316.
[8]	Qiujie LIU, Yuan WAN, Jie WU. Deep bi-modal source domain symmetrical transfer learning for cross-modal retrieval [J]. Journal of Computer Applications, 2024, 44(1): 24-31.
[9]	Mengmeng CHEN, Zhiwei QIAO. Sparse reconstruction of CT images based on Uformer with fused channel attention [J]. Journal of Computer Applications, 2023, 43(9): 2948-2954.
[10]	Meijia LIANG, Xinwu LIU, Xiaopeng HU. Small target detection algorithm for train operating environment image based on improved YOLOv3 [J]. Journal of Computer Applications, 2023, 43(8): 2611-2618.
[11]	Kezheng CHEN, Xiaoran GUO, Yong ZHONG, Zhenping LI. Relation extraction method based on negative training and transfer learning [J]. Journal of Computer Applications, 2023, 43(8): 2426-2430.
[12]	Zexi JIN, Lei LI, Ji LIU. Transfer learning model based on improved domain separation network [J]. Journal of Computer Applications, 2023, 43(8): 2382-2389.
[13]	Bona XUAN, Jin LI, Yafei SONG, Zexuan MA. Malicious code classification method based on improved MobileNetV2 [J]. Journal of Computer Applications, 2023, 43(7): 2217-2225.
[14]	Huibin ZHANG, Liping FENG, Yaojun HAO, Yining WANG. Ancient mural dynasty identification based on attention mechanism and transfer learning [J]. Journal of Computer Applications, 2023, 43(6): 1826-1832.
[15]	Kai ZHANG, Zhengchu QIN, Yue LIU, Xinyi QIN. Multi-learning behavior collaborated knowledge tracing model [J]. Journal of Computer Applications, 2023, 43(5): 1422-1429.

Android malware family classification method based on code image integration

基于代码图像合成的Android恶意软件家族分类方法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 14

References 30

Related Articles 15

Recommended Articles

Metrics