《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (5): 1490-1499.DOI: 10.11772/j.issn.1001-9081.2021030486
收稿日期:
2021-03-31
修回日期:
2021-06-23
接受日期:
2021-06-25
发布日期:
2022-06-11
出版日期:
2022-05-10
通讯作者:
芦天亮
作者简介:
李默(1995—),男,江西赣州人,硕士研究生,主要研究方向:恶意代码检测、机器学习基金资助:
Mo LI, Tianliang LU(), Ziheng XIE
Received:
2021-03-31
Revised:
2021-06-23
Accepted:
2021-06-25
Online:
2022-06-11
Published:
2022-05-10
Contact:
Tianliang LU
About author:
LI Mo, born in 1995,M. S. candidate. His research interestsinclude malware detection,machine learning.Supported by:
摘要:
代码图像化技术被提出后在Android恶意软件研究领域迅速普及。针对使用单个DEX文件转换而成的代码图像表征能力不足的问题,提出了一种基于代码图像合成的Android恶意软件家族分类方法。首先,将安装包中的DEX、XML与反编译生成的JAR文件进行灰度图像化处理,并使用Bilinear插值算法来放缩处理不同尺寸的灰度图像,然后将三张灰度图合成为一张三维RGB图像用于训练与分类。在分类模型上,将软阈值去噪模块与基于Split-Attention的ResNeSt相结合提出了STResNeSt。该模型具备较强的抗噪能力,更能关注代码图像的重要特征。针对训练过程中的数据长尾分布问题,在数据增强的基础上引入了类别平衡损失函数(CB Loss),从而为样本不平衡造成的过拟合现象提供了解决方案。在Drebin数据集上,合成代码图像的准确率领先DEX灰度图像2.93个百分点,STResNeSt与残差神经网络(ResNet)相比准确率提升了1.1个百分点,且数据增强结合CB Loss的方案将F1值最高提升了2.4个百分点。实验结果表明,所提方法的平均分类准确率达到了98.97%,能有效分类Android恶意软件家族。
中图分类号:
李默, 芦天亮, 谢子恒. 基于代码图像合成的Android恶意软件家族分类方法[J]. 计算机应用, 2022, 42(5): 1490-1499.
Mo LI, Tianliang LU, Ziheng XIE. Android malware family classification method based on code image integration[J]. Journal of Computer Applications, 2022, 42(5): 1490-1499.
文件大小 | 图像宽度/像素 |
---|---|
<10 KB | 64 |
[10 KB,30 KB) | 128 |
[30 KB,60 KB) | 214 |
[60 KB,100 KB) | 280 |
[100 KB,200 KB) | 400 |
[200 KB,500 KB) | 600 |
[500 KB,1 MB) | 864 |
[1 MB,2 MB) | 1 280 |
[2 MB,3 MB) | 1 600 |
[3 MB,4 MB] | 1 920 |
>4 MB | 2 048 |
表1 代码图像转换比例
Tab. 1 Ratio of code image conversation
文件大小 | 图像宽度/像素 |
---|---|
<10 KB | 64 |
[10 KB,30 KB) | 128 |
[30 KB,60 KB) | 214 |
[60 KB,100 KB) | 280 |
[100 KB,200 KB) | 400 |
[200 KB,500 KB) | 600 |
[500 KB,1 MB) | 864 |
[1 MB,2 MB) | 1 280 |
[2 MB,3 MB) | 1 600 |
[3 MB,4 MB] | 1 920 |
>4 MB | 2 048 |
变换方法 | 参数值 |
---|---|
中心旋转 | [-0.1,0.1] |
水平平移 | [-0.1,0.1] |
垂直平移 | [-0.1,0.1] |
中心放大 | [-0.2,0.2] |
翻转模式 | 水平翻转 |
填充模式 | 固定224插值填充 |
表2 数据增强参数值
Tab. 2 Parameter values of data augmentation
变换方法 | 参数值 |
---|---|
中心旋转 | [-0.1,0.1] |
水平平移 | [-0.1,0.1] |
垂直平移 | [-0.1,0.1] |
中心放大 | [-0.2,0.2] |
翻转模式 | 水平翻转 |
填充模式 | 固定224插值填充 |
家族 | 总样本数 | 训练集样本数 | 测试集样本数 | ||
---|---|---|---|---|---|
原始 | 扩增后 | 原始 | 扩增后 | ||
FakeInstaller | 925 | 1 185 | 740 | 1 000 | 185 |
DroidKungFu | 666 | 1 133 | 533 | 1 000 | 133 |
Plankton | 625 | 1 125 | 500 | 1 000 | 125 |
Opfake | 613 | 1 123 | 490 | 1 000 | 123 |
GinMaster | 339 | 1 068 | 271 | 1 000 | 68 |
BaseBridge | 329 | 1 066 | 263 | 1 000 | 66 |
Iconosys | 152 | 1 030 | 122 | 1 000 | 30 |
Kmin | 147 | 1 029 | 118 | 1 000 | 29 |
FakeDoc | 132 | 1 027 | 105 | 1 000 | 27 |
Geinimi | 92 | 1 019 | 73 | 1 000 | 19 |
表3 实验数据集
Tab. 3 Experimental dataset
家族 | 总样本数 | 训练集样本数 | 测试集样本数 | ||
---|---|---|---|---|---|
原始 | 扩增后 | 原始 | 扩增后 | ||
FakeInstaller | 925 | 1 185 | 740 | 1 000 | 185 |
DroidKungFu | 666 | 1 133 | 533 | 1 000 | 133 |
Plankton | 625 | 1 125 | 500 | 1 000 | 125 |
Opfake | 613 | 1 123 | 490 | 1 000 | 123 |
GinMaster | 339 | 1 068 | 271 | 1 000 | 68 |
BaseBridge | 329 | 1 066 | 263 | 1 000 | 66 |
Iconosys | 152 | 1 030 | 122 | 1 000 | 30 |
Kmin | 147 | 1 029 | 118 | 1 000 | 29 |
FakeDoc | 132 | 1 027 | 105 | 1 000 | 27 |
Geinimi | 92 | 1 019 | 73 | 1 000 | 19 |
插值算法 | 尺寸 | F1/% | 准确率/% | 耗时/s |
---|---|---|---|---|
Nearest | 97.76 | 98.23 | 1.64 | |
Box | 98.10 | 98.54 | 7.53 | |
Lanczos | 98.21 | 98.63 | 18.47 | |
Bicubic | 98.80 | 98.94 | 10.03 | |
Bilinear | 98.81 | 98.97 | 5.03 | |
Bicubic (EfficientNet) | 95.95 | 96.48 | 10.03 | |
Bilinear (EfficientNet) | 94.68 | 95.35 | 5.03 |
表4 不同插值算法的性能对比
Tab. 4 Performance comparison of different interpolation algorithms
插值算法 | 尺寸 | F1/% | 准确率/% | 耗时/s |
---|---|---|---|---|
Nearest | 97.76 | 98.23 | 1.64 | |
Box | 98.10 | 98.54 | 7.53 | |
Lanczos | 98.21 | 98.63 | 18.47 | |
Bicubic | 98.80 | 98.94 | 10.03 | |
Bilinear | 98.81 | 98.97 | 5.03 | |
Bicubic (EfficientNet) | 95.95 | 96.48 | 10.03 | |
Bilinear (EfficientNet) | 94.68 | 95.35 | 5.03 |
样本均衡方法 | 精确率/% | 召回率/% | F1/% |
---|---|---|---|
原数据 | 97.82 | 98.48 | 98.15 |
数据增强 | 98.43 | 97.91 | 98.17 |
CB Loss | 98.11 | 98.57 | 98.34 |
数据增强+CB Loss | 98.87 | 98.75 | 98.81 |
表5 不同样本均衡方法的性能对比
Tab. 5 Performance comparison of different sample balancing methods
样本均衡方法 | 精确率/% | 召回率/% | F1/% |
---|---|---|---|
原数据 | 97.82 | 98.48 | 98.15 |
数据增强 | 98.43 | 97.91 | 98.17 |
CB Loss | 98.11 | 98.57 | 98.34 |
数据增强+CB Loss | 98.87 | 98.75 | 98.81 |
基础网络 | 网络层数 | 精确率/% | 召回率/% | F1/% | 准确率/% |
---|---|---|---|---|---|
ResNet | 50 | 97.38 | 96.81 | 97.09 | 97.87 |
101 | 96.88 | 97.14 | 97.01 | 97.80 | |
ResNeXt | 50 | 97.65 | 97.71 | 97.68 | 97.98 |
101 | 97.64 | 97.40 | 97.52 | 98.19 | |
SENet | 50 | 97.83 | 97.81 | 97.82 | 98.18 |
101 | 97.96 | 97.94 | 97.95 | 98.27 | |
SKNet | 50 | 98.50 | 97.76 | 98.13 | 98.39 |
101 | 97.54 | 98.22 | 97.88 | 98.31 | |
ResNeSt | 50 | 98.14 | 98.66 | 98.40 | 98.61 |
101 | 98.41 | 98.33 | 98.37 | 98.57 | |
STResNeSt | 50 | 98.87 | 98.75 | 98.81 | 98.97 |
101 | 98.81 | 98.69 | 98.75 | 98.95 |
表6 不同残差网络的性能对比
Tab. 6 Performance comparison of different residual networks
基础网络 | 网络层数 | 精确率/% | 召回率/% | F1/% | 准确率/% |
---|---|---|---|---|---|
ResNet | 50 | 97.38 | 96.81 | 97.09 | 97.87 |
101 | 96.88 | 97.14 | 97.01 | 97.80 | |
ResNeXt | 50 | 97.65 | 97.71 | 97.68 | 97.98 |
101 | 97.64 | 97.40 | 97.52 | 98.19 | |
SENet | 50 | 97.83 | 97.81 | 97.82 | 98.18 |
101 | 97.96 | 97.94 | 97.95 | 98.27 | |
SKNet | 50 | 98.50 | 97.76 | 98.13 | 98.39 |
101 | 97.54 | 98.22 | 97.88 | 98.31 | |
ResNeSt | 50 | 98.14 | 98.66 | 98.40 | 98.61 |
101 | 98.41 | 98.33 | 98.37 | 98.57 | |
STResNeSt | 50 | 98.87 | 98.75 | 98.81 | 98.97 |
101 | 98.81 | 98.69 | 98.75 | 98.95 |
代码图像生成方法 | 精确率/% | 召回率/% | F1/% | 准确率/% |
---|---|---|---|---|
JAR | 94.59 | 94.43 | 94.51 | 94.97 |
JAR(字符筛选) | 95.73 | 95.29 | 95.51 | 96.01 |
XML | 95.96 | 96.38 | 96.17 | 96.92 |
DEX(灰度图) | 95.53 | 95.83 | 95.68 | 96.04 |
DEX | 96.79 | 96.69 | 96.74 | 97.09 |
合成图像 | 98.87 | 98.75 | 98.81 | 98.97 |
表7 不同代码图像生成方法的性能对比
Tab.7 Performance comparison of different code image generation methods
代码图像生成方法 | 精确率/% | 召回率/% | F1/% | 准确率/% |
---|---|---|---|---|
JAR | 94.59 | 94.43 | 94.51 | 94.97 |
JAR(字符筛选) | 95.73 | 95.29 | 95.51 | 96.01 |
XML | 95.96 | 96.38 | 96.17 | 96.92 |
DEX(灰度图) | 95.53 | 95.83 | 95.68 | 96.04 |
DEX | 96.79 | 96.69 | 96.74 | 97.09 |
合成图像 | 98.87 | 98.75 | 98.81 | 98.97 |
1 | LI M B, WANG W, WANG P, et al. LibD: scalable and precise third-party library detection in Android markets [C]// Proceedings of the 2017 IEEE/ACM 39th International Conference on Software Engineering. Piscataway: IEEE, 2017: 335-346. 10.1109/icse.2017.38 |
2 | HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778. 10.1109/cvpr.2016.90 |
3 | HU J, SHEN L, SUN G. Squeeze-and-excitation networks [C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 7132-7141. 10.1109/cvpr.2018.00745 |
4 | XIE S N, GIRSHICK R, DOLLÁR P, et al. Aggregated residual transformations for deep neural networks [C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 5987-5995. 10.1109/cvpr.2017.634 |
5 | LI X, WANG W H, HU X L, et al. Selective kernel networks [C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 510-519. 10.1109/cvpr.2019.00060 |
6 | ZHANG H, WU C R, ZHANG Z Y, et al. ResNeSt: split-attention networks [EB/OL]. [2021-03-02]. . 10.1155/2021/7544355 |
7 | SANTOS I, BREZO F, NIEVES J, et al. Idea: opcode-sequence-based malware detection [C]// Proceedings of the 2010 International Symposium on Engineering Secure Software and Systems, LNCS 5965. Berlin: Springer, 2010: 35-43. |
8 | WANG W, WANG X, FENG D W, et al. Exploring permission-induced risk in Android applications for malicious application detection [J]. IEEE Transactions on Information Forensics and Security, 2014, 9(11): 1869-1882. 10.1109/tifs.2014.2353996 |
9 | GRINI L S, SHALAGINOV A, FRANKE K. Study of soft computing methods for large-scale multinomial malware types and families detection [M]// ZADEH L A, YAGER R R, SHAHBAZOVA S N, et al. Recent Developments and the New Direction in Soft-Computing Foundations and Applications, STUDFUZZ 361. Cham: Springer, 2018: 337-350. |
10 | QIU J Y, ZHANG J, LUO W, et al. A3CM: automatic capability annotation for Android malware [J]. IEEE Access, 2019, 7: 147156-147168. 10.1109/access.2019.2946392 |
11 | 张晨斌,张云春,郑杨,等.基于灰度图纹理指纹的恶意软件分类[J].计算机科学,2018,45(6A):383-386. 10.11896/j.issn.1002-137X.2018.Z6.083 |
ZHANG C B, ZHANG Y C, ZHENG Y, et al. Malware classification based on texture fingerprint of gray-scale images [J]. Computer Science, 2018, 45(6A): 383-386. 10.11896/j.issn.1002-137X.2018.Z6.083 | |
12 | HUANG T T H D, KAO H Y. R2-D2: color-inspired Convolutional Neural Network (CNN)-based Android malware detections [C]// Proceedings of the 2018 IEEE International Conference on Big Data. Piscataway: IEEE, 2018: 2633-2642. |
13 | VASAN D, ALAZAB M, WASSAN S, et al. IMCFN: image-based malware classification using fine-tuned convolutional neural network architecture [J]. Computer Networks, 2020, 171: Article No.107138. 10.1016/j.comnet.2020.107138 |
14 | 高杨晨,方勇,刘亮,等.基于卷积神经网络的Android恶意软件检测技术研究[J].四川大学学报(自然科学版),2020,57(4):673-680. 10.3969/j.issn.0490-6756.2020.04.009 |
GAO Y C, FANG Y, LIU L, et al. Android malware detection technology based on deep convolutional neural network [J]. Journal of Sichuan University (Natural Science Edition), 2020, 57(4): 673-680. 10.3969/j.issn.0490-6756.2020.04.009 | |
15 | ZHAO M H, ZHONG S S, FU X Y, et al. Deep residual shrinkage networks for fault diagnosis [J]. IEEE Transactions on Industrial Informatics, 2020, 16(7): 4681-4690. 10.1109/tii.2019.2943898 |
16 | CUI Y, JIA M L, LIN T Y, et al. Class-balanced loss based on effective number of samples [C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 9260-9269. 10.1109/cvpr.2019.00949 |
17 | NATARAJ L, KARTHIKEYAN S, JACOB G, et al. Malware images: visualization and automatic classification [C]// Proceedings of the 2011 8th International Symposium on Visualization for Cyber Security. New York; ACM, 2011: Article No.4. 10.1145/2016904.2016908 |
18 | CUI Z H, XUE F, CAI X J, et al. Detection of malicious code variants based on deep learning [J]. IEEE Transactions on Industrial Informatics, 2018, 14(7): 3187-3196. 10.1109/tii.2018.2822680 |
19 | 孙博文,张鹏,成茗宇,等.基于代码图像增强的恶意代码检测方法[J].清华大学学报(自然科学版),2020,60(5):386-392. 10.16511/j.cnki.qhdxxb.2020.25.008 |
SUN B W, ZHANG P, CHENG M Y, et al. Malware detection method based on enhanced code images [J]. Journal of Tsinghua University (Science and Technology), 2020, 60(5): 386-392. 10.16511/j.cnki.qhdxxb.2020.25.008 | |
20 | ARP D, SPREITZENBARTH M, HÜBNER M, et al. Drebin: efficient and explainable detection of Android malware in your pocket [C]// Proceedings of the 2014 Annual Network and Distributed System Security Symposium. Reston: Internet Society, 2014: 1-12. 10.14722/ndss.2014.23247 |
21 | PRESS W H, TEUKOLSKY S A, VETTERLING W T, et al. Numerical Recipes: the Art of Scientific Computing [M]. 3rd ed. New York: Cambridge University Press, 2007: 123-128. |
22 | KEYS R. Cubic convolution interpolation for digital image processing [J]. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1981, 29(6): 1153-1160. 10.1109/tassp.1981.1163711 |
23 | TURKOWSKI K. Filters for common resampling tasks [M]// GRASSNER A S. Graphics Gems. Waltham: Academic Press, 1990: 147-165. 10.1016/b978-0-08-050753-8.50042-5 |
24 | LAWRENCE N D, SCHÖLKOPF B. Estimating a kernel fisher discriminant in the presence of label noise [C]// Proceedings of the 2001 18th International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers Inc., 2001: 306-313. |
25 | XIA S Y, WANG G Y, CHEN Z Z, et al. Complete random forest based class noise filtering learning for improving the generalizability of classifiers [J]. IEEE Transactions on Knowledge and Data Engineering, 2019, 31(11): 2063-2078. 10.1109/tkde.2018.2873791 |
26 | WU P X, ZHENG S Z, GOSWAMI M, et al. A topological filter for learning with label noise [EB/OL]. [2021-03-02]. . |
27 | HE H B, BAI Y, GARCIA E A, et al. ADASYN: adaptive synthetic sampling approach for imbalanced learning [C]// Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). Piscataway: IEEE, 2008: 1322-1328. 10.1109/ijcnn.2008.4633969 |
28 | ZOU Y, YU Z D, VIJAYA KUMAR B V K, et al. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training [C]// Proceedings of the 2018 European Conference on Computer Vision, LNCS 11207. Cham: Springer, 2018: 297-313. |
29 | TAN M X, LE Q. EfficientNet: rethinking model scaling for convolutional neural networks [C]// Proceedings of the 2019 36th International Conference on Machine Learning. New York: JMLR.org, 2019: 6105-6114. |
30 | SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision [C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 2818-2826. 10.1109/cvpr.2016.308 |
[1] | 秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974. |
[2] | 李云, 王富铕, 井佩光, 王粟, 肖澳. 基于不确定度感知的帧关联短视频事件检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2903-2910. |
[3] | 赵宇博, 张丽萍, 闫盛, 侯敏, 高茂. 基于改进分段卷积神经网络和知识蒸馏的学科知识实体间关系抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2421-2429. |
[4] | 张春雪, 仇丽青, 孙承爱, 荆彩霞. 基于两阶段动态兴趣识别的购买行为预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2365-2371. |
[5] | 陈彤, 杨丰玉, 熊宇, 严荭, 邱福星. 基于多尺度频率通道注意力融合的声纹库构建方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2407-2413. |
[6] | 陈虹, 齐兵, 金海波, 武聪, 张立昂. 融合1D-CNN与BiGRU的类不平衡流量异常检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2493-2499. |
[7] | 王东炜, 刘柏辰, 韩志, 王艳美, 唐延东. 基于低秩分解和向量量化的深度网络压缩方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 1987-1994. |
[8] | 唐媛, 陈艳平, 扈应, 黄瑞章, 秦永彬. 基于多尺度混合注意力卷积神经网络的关系抽取模型[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2011-2017. |
[9] | 高阳峄, 雷涛, 杜晓刚, 李岁永, 王营博, 闵重丹. 基于像素距离图和四维动态卷积网络的密集人群计数与定位方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2233-2242. |
[10] | 翟飞宇, 马汉达. 基于DenseNet的经典-量子混合分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1905-1910. |
[11] | 黄梦源, 常侃, 凌铭阳, 韦新杰, 覃团发. 基于层间引导的低光照图像渐进增强算法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1911-1919. |
[12] | 李健京, 李贯峰, 秦飞舟, 李卫军. 基于不确定知识图谱嵌入的多关系近似推理模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1751-1759. |
[13] | 姚迅, 秦忠正, 杨捷. 生成式标签对抗的文本分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1781-1785. |
[14] | 沈君凤, 周星辰, 汤灿. 基于改进的提示学习方法的双通道情感分析模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1796-1806. |
[15] | 席治远, 唐超, 童安炀, 王文剑. 基于双路时空网络的驾驶员行为识别[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1511-1519. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||