基于注意力机制和残差网络的恶意代码检测方法

doi:10.11772/j.issn.1001-9081.2021061410

《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (6): 1708-1715.DOI: 10.11772/j.issn.1001-9081.2021061410

所属专题： 2021年全国开放式分布与并行计算学术年会(DPCS 2021)论文

• 2021年全国开放式分布与并行计算学术年会(DPCS 2021)论文 • 上一篇下一篇

基于注意力机制和残差网络的恶意代码检测方法

张杨(), 郝江波

河北科技大学信息科学与工程学院，石家庄 050018

收稿日期:2021-08-06 修回日期:2021-09-10 接受日期:2021-10-20 发布日期:2022-01-10 出版日期:2022-06-10
通讯作者: 张杨
作者简介:郝江波（1996—），男，河北邢台人，硕士研究生，主要研究方向：智能软件分析。
基金资助:
国家自然科学基金资助项目(61440012);河北省基础研究计划重点基础研究专项(18960106D);河北省教育厅高等学校科学研究计划重点项目(ZD2019093)

Malicious code detection method based on attention mechanism and residual network

Yang ZHANG(), Jiangbo HAO

School of Information Science and Engineering，Hebei University of Science and Technology，Shijiazhuang Hebei 050018，China

Received:2021-08-06 Revised:2021-09-10 Accepted:2021-10-20 Online:2022-01-10 Published:2022-06-10
Contact: Yang ZHANG
About author:HAO Jiangbo，born in 1996，M. S. candidate. His research interests include intelligent software analysis.
Supported by:
National Natural Science Foundation of China(61440012);Key Basic Research Project of Hebei Fundamental Research Plan(189601106D);Key Project of Higher Education Research Program of Hebei Province(ZD2019093)

摘要/Abstract

摘要：

针对目前已有的基于深度学习的恶意代码检测方法提取特征不足和准确率低的问题，提出一种基于注意力机制和残差网络（ResNet）的恶意代码检测方法ARMD。为了支持该方法的训练，从Kaggle网站获取了47 580个恶意和良性代码的Hash值，并利用VirusTotal分析工具提取每个代码数据调用的API，在此之后将所调用的API整合为1 000个不重复的API作为检测的特征来构造训练样本数据；然后根据VirusTotal的分析结果进行良恶性判定进而标记样本数据，并采用SMOTE增强算法使数据样本均衡化；最后构建并训练注入注意力机制的ResNet，从而实现恶意代码检测。实验结果表明ARMD的恶意代码检测准确率为97.76%，且与目前已有的基于卷积神经网络（CNN）和ResNet模型的检测方法相比，平均精确率至少提高了2个百分点，验证了ARMD的有效性。

关键词: 深度学习, 恶意代码, 注意力机制, 残差网络, SMOTE

Abstract:

As the existing malicious code detection methods based on deep learning have problems of insufficiency and low accuracy of feature extraction， a malicious code detection method based on attention mechanism and Residual Network （ResNet） called ARMD was proposed. To support the training of this method， the hash values of 47 580 malicious and benign codes were obtained from Kaggle website， and the APIs called by each code were extracted by analysis tool VirusTotal. After that， the called APIs were integrated into 1 000 non-repeated APIs as the detection features， and the training sample data was constructed through these features. Then， the sample data was labeled by determining the benignity and maliciousness based on the VirusTotal analysis results， and the SMOTE （Synthetic Minority Over-sampling Technique） enhancement algorithm was used to equalize the data samples. Finally， the ResNet injecting with the attention mechanism was built and trained to complete the malicious code detection. Experimental results show that the accuracy of malicious code detection of ARMD is 97.76%， and compared with the existing detection methods based on Convolutional Neural Network （CNN） and ResNet models， ARMD has the average precision improved by at least 2%， verifying the effectiveness of ARMD.

Key words: deep learning, malicious code, attention mechanism, Residual Network (ResNet), SMOTE (Synthetic Minority Over-sampling Technique)

中图分类号:

TP311.53

张杨, 郝江波. 基于注意力机制和残差网络的恶意代码检测方法[J]. 计算机应用, 2022, 42(6): 1708-1715.

Yang ZHANG, Jiangbo HAO. Malicious code detection method based on attention mechanism and residual network[J]. Journal of Computer Applications, 2022, 42(6): 1708-1715.

图/表 11

图1 本文方法整体架构

Fig.1 Overall structure of proposed method

图2 本文提出的用于检测恶意代码的注意力机制

Fig.2 Proposed attention mechanism for detecting malicious code

图3 改进的残差块与原始的残差块的比较

Fig.3 Comparison between improved residual block with original one

图4 基于ResNet18和注意力机制的ARMD

Fig.4 ARMD based on ResNet18 and attention mechanism

表1 部分API函数及其功能描述

Tab. 1 Some API functions and their functional descriptions

API函数	功能
CreateRemoteThread	创建一个在其他进程地址空间中运行的线程（创建远程线程）
GetModuleHandle	为特定模块获取处理器，必须在调用进程中被加载
EnumResourceNamesA	枚举指定的二进制资源
GetInforAndOpenUrl	获取系统信息，检测是否存在杀毒软件，连接指定的Url将释放的文件写入注册表中，以实现病毒的自启
DeviceIOcontrol	在用户空间与内核空间传递信息
GetProcAddress	获取一个输出函数的地址，或从指定的动态链接库（DLL）获取变量
CreateStreamOnHGlobal	创建一个流对象，使用一个HGLOBAL内存处理器来存储流内容
wsprintf	向特定的缓冲区中写入格式化数据，可根据相应的格式化字符串标准向输出缓冲区中写入任意参数
LocalFree	释放指定的本地内存对象，并初始化该对象的处理器
ExitProcess	终止调用进程以及所有相关的线程。

表2 部分恶意代码的Hash值和所提取的API

Tab. 2 Hash values of some malicious codes and their extracted API

Hash值	GetProcAddress	ExitProcess	CloseHandle	OpenProcess	Malware
071e8c3f8922e186e57548cd4c703a5d	1	1	1	1	1
33f8e6d08a6aae939f25a8e0d63dd523	1	1	1	1	1
72049be7bd30ea61297ea624ae198067	1	1	0	1	1
2a1e576d411c5d5370e381042f973ea5	1	1	1	0	0
ca66c2f1ddaca8a4e682917a9b833e86	0	0	1	0	0
4e49b660879ece49c302e0c25cc5fc83	1	0	0	1	1

表3 二分类问题的混淆矩阵

Tab. 3 Confusion matrix of binary classification problem

实际值	预测值		合计
实际值	1	0	合计
合计	TP+FP	FN+TN	TP+TN+FP+FN
1	TP	FN	TP+FN
0	FP	TN	FP+TN

表4 问题1的测试结果 ( %)

Tab. 4 Test results for question 1

模型	精确率	召回率	$F 1$	准确率
CNN	90.2	91.6	90.1	91.6
LSTM	82.6	82.1	82.4	82.4
ResNet18	95.0	95.0	95.0	95.0

表4 问题1的测试结果 ( %)

Tab. 4 Test results for question 1

模型	精确率	召回率	$F 1$	准确率
CNN	90.2	91.6	90.1	91.6
LSTM	82.6	82.1	82.4	82.4
ResNet18	95.0	95.0	95.0	95.0

表5 问题2的测试结果 (单位 %)

Tab. 5 Test results for question 2

模型	精确率	召回率	$F 1$	准确率
KNN+SVM	94.5	95.5	94.5	95.5
ANN	95.6	95.1	95.4	95.2
ARMD	97.7	97.6	97.6	97.6

表5 问题2的测试结果 (单位 %)

Tab. 5 Test results for question 2

模型	精确率	召回率	$F 1$	准确率
KNN+SVM	94.5	95.5	94.5	95.5
ANN	95.6	95.1	95.4	95.2
ARMD	97.7	97.6	97.6	97.6

表6 问题3的评估结果 (单位 %)

Tab. 6 Evaluation results for question 3

模型	精确率	召回率	$F 1$	准确率
ResNet18	95.0	95.0	95.0	95.0
ARMD	97.7	97.6	97.6	97.6

表6 问题3的评估结果 (单位 %)

Tab. 6 Evaluation results for question 3

模型	精确率	召回率	$F 1$	准确率
ResNet18	95.0	95.0	95.0	95.0
ARMD	97.7	97.6	97.6	97.6

表7 问题4的评估结果 (单位 %)

Tab. 7 Evaluation results for question 4

模型	精确率	召回率	$F 1$	准确率
ARMD	97.7	97.6	97.6	97.6
ResNet34	95.8	95.8	95.8	95.8
ResNet34+SENet	96.6	96.6	96.6	96.6

表7 问题4的评估结果 (单位 %)

Tab. 7 Evaluation results for question 4

模型	精确率	召回率	$F 1$	准确率
ARMD	97.7	97.6	97.6	97.6
ResNet34	95.8	95.8	95.8	95.8
ResNet34+SENet	96.6	96.6	96.6	96.6

参考文献 29

1	LECUN Y， BENGIO Y， HINTON G. Deep learning［J］. Nature， 2015， 521（7553）：436-444. 10.1038/nature14539
2	国家计算机网络应急技术处理协调中心. 态势安全报告年报［EB/OL］. ［2021-07-31］.. 10.17706/ijcce.2021.10.2.37-51
	National Computer Network Emergency Response Technical Team/Coordination Center of China. Annual situation security report ［EB/OL］. ［2021-07-31］.. 10.17706/ijcce.2021.10.2.37-51
3	GHANAEI V， LLIOPOULOS C S， OVERILL R E. Statistical approach towards malware classification and detection［C］// Proceedings of the 2016 SAI Computing Conference. Piscataway： IEEE， 2016： 1093-1099. 10.1109/sai.2016.7556114
4	ZHAO S， MA X B， ZOU W， et al. DeepCG： classifying metamorphic malware through deep learning of call graphs［C］// Proceedings of the 2019 International Conference on Security and Privacy in Communication Systems， LNICST 304. Cham： Springer， 2019： 171-190.
5	SUNG A H， XU J Y， CHAVEZ P， et al. Static analyzer of vicious executables （SAVE）［C］// Proceedings of the 20th Annual Computer Security Applications Conference. Piscataway： IEEE， 2004： 326-334.
6	TABISH S M， SHAFIQ M Z， FAROOQ M. Malware detection using statistical analysis of byte-level file content［C］// Proceedings of the 2009 ACM SIGKDD Workshop on CyberSecurity and Intelligence Informatics. New York： ACM， 2009： 23-31. 10.1145/1599272.1599278
7	罗世奇，田生伟，孙华，等. 栈式自编码的恶意代码分类算法研究［J］. 计算机应用研究， 2018， 35（1）： 261-265. 10.3969/j.issn.1001-3695.2018.01.056
	LUO S Q， TIAN S W， SUN H， et al. Research on malicious code classification algorithm of stacked auto encoder［J］. Application Research of Computers， 2018， 35（1）： 261-265. 10.3969/j.issn.1001-3695.2018.01.056
8	张玮康. 基于恶意代码API的静态检测技术研究［D］. 西安：西安电子科技大学， 2018. 10.1088/1742-6596/1087/6/062026
	ZHANG W K， Research on static detection technology based on malicious code API［D］. Xi’an： Xidian University， 2018. 10.1088/1742-6596/1087/6/062026
9	CESARE S， XIANG Y， ZHOU W L. Control flow-based malware VariantDetection［J］. IEEE Transactions on Dependable and Secure Computing， 2014， 11（4）： 307-317. 10.1109/tdsc.2013.40
10	ANDERSON B， QUIST D， NEIL J， et al. Graph-based malware detection using dynamic analysis［J］. Journal in Computer Virology， 2011， 7（4）： 247-258. 10.1007/s11416-011-0152-x
11	WILLEMS C， HOLZ T， FREILING F. Toward automated dynamic malware analysis using CWSandbox［J］. IEEE Security and Privacy， 2007， 5（2）： 32-39. 10.1109/msp.2007.45
12	SALEHI Z， SAMI A， GHIASI M. Using feature generation from API calls for malware detection［J］. Computer Fraud and Security， 2014， 2014（9）： 9-18. 10.1016/s1361-3723(14)70531-7
13	荣俸萍，方勇，左政，等. MACSPMD：基于恶意API调用序列模式挖掘的恶意代码检测［J］. 计算机科学， 2018， 45（5）： 131-138.
	RONG F P， FANG Y， ZUO Z， et al. MACSPMD： malicious API call sequential pattern mining based malware detection［J］. Computer Science， 2018， 45（5）： 131-138.
14	KIM C W. NtMalDetect： a machine learning approach to malware detection using native API system calls［EB/OL］. （2018-05-19）［2021-03-20］.. 10.48550/arXiv.1802.05412
15	SAXE J， BERLIN K. Deep neural network based malware detection using two dimensional binary program features［C］// Proceedings of the 10th International Conference on Malicious and Unwanted Software. Piscataway： IEEE， 2015： 11-20. 10.1109/malware.2015.7413680
16	HUANG W Y， STOKES J W. MtNet： a multi-task neural network for dynamic malware classification［C］// Proceedings of the 2016 International Conference on Detection of Intrusions and Malware， and Vulnerability Assessment， LNSC 9721. Cham： Springer， 2016： 399-418.
17	KOLOSNJAJI B， ZARRAS A， WEBSTER G， et al. Deep learning for classification of malware system call sequences［C］// Proceedings of the 2016 Australasian Joint Conference on Artificial Intelligence， LNAI 9992. Cham： Springer， 2016： 137-149.
18	MCLAUGHLIN N， MARTINEZ DEL RINCON J， KANG B， et al. Deep Android malware detection［C］// Proceedings of the 7th ACM Conference on Data and Application Security and Privacy. New York： ACM， 2017： 301-308. 10.1145/3029806.3029823
19	FAN M， LIU J， LUO X P， et al. Android malware familial classification and representative sample selection via frequent subgraph analysis［J］. IEEE Transactions on Information Forensics and Security， 2018， 13（8）： 1890-1905. 10.1109/tifs.2018.2806891
20	ZHANG J X， QIN Z， YIN H， et al. A feature-hybrid malware variants detection using CNN based opcode embedding and BPNN based API embedding［J］. Computers and Security， 2019， 84： 376-392. 10.1016/j.cose.2019.04.005
21	CHAWLA N V， BOWYER K W， HALL L O， et al. SMOTE： synthetic minority over-sampling technique［J］. Journal of Artificial Intelligence Research， 2002， 16： 321-357. 10.1613/jair.953
22	HU J， SHEN L， SUN G. Squeeze-and-excitation networks［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 7132-7141. 10.1109/cvpr.2018.00745
23	HE K M， ZHANG X Y， REN S Q， et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778. 10.1109/cvpr.2016.90
24	FENG T， LIU J G， FANG X， et al. A double-branch surface detection system for armatures in vibration motors with miniature volume based on ResNet-101 and FPN［J］. Sensors， 2020， 20（8）： No.2360. 10.3390/s20082360
25	VirusTotal. VirusTotal［EB/OL］. ［2021-05-05］.. 10.1109/iacs.2017.7921994
26	MASRI R， ALDWAIRI M. Automated malicious advertisement detection using VirusTotal， URLVoid， and TrendMicro［C］// Proceedings of the 8th International Conference on Information and Communication Systems. Piscataway： IEEE， 2017： 336-341. 10.1109/iacs.2017.7921994
27	OLIVEIRA A. Malware analysis datasets： Top-1000 PE imports［DB/OL］. ［2021-03-23］..
28	GARG V， YADAV R K. Malware detection based on API calls frequency［C］// Proceedings of the 4th International Conference on Information Systems and Computer Networks. Piscataway： IEEE， 2019： 400-404. 10.1109/iscon47742.2019.9036219
29	MAKANDAR A， PATROT A. Malware analysis and classification using artificial neural network［C］// Proceedings of the 2015 International Conference on Trends in Automation， Communications and Computing Technology. Piscataway： IEEE， 2015： 1-6. 10.1109/itact.2015.7492653

[1]	潘烨新, 杨哲. 基于多级特征双向融合的小目标检测优化模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2871-2877.
[2]	赵志强, 马培红, 黑新宏. 基于双重注意力机制的人群计数方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2886-2892.
[3]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[4]	王熙源, 张战成, 徐少康, 张宝成, 罗晓清, 胡伏原. 面向手术导航3D/2D配准的无监督跨域迁移网络[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2911-2918.
[5]	李力铤, 华蓓, 贺若舟, 徐况. 基于解耦注意力机制的多变量时序预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2732-2738.
[6]	黄云川, 江永全, 黄骏涛, 杨燕. 基于元图同构网络的分子毒性预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2964-2969.
[7]	李顺勇, 李师毅, 胥瑞, 赵兴旺. 基于自注意力融合的不完整多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2696-2703.
[8]	薛凯鹏, 徐涛, 廖春节. 融合自监督和多层交叉注意力的多模态情感分析网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2387-2392.
[9]	汪雨晴, 朱广丽, 段文杰, 李书羽, 周若彤. 基于交互注意力机制的心理咨询文本情感分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2393-2399.
[10]	高鹏淇, 黄鹤鸣, 樊永红. 融合坐标与多头注意力机制的交互语音情感识别[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2400-2406.
[11]	刘禹含, 吉根林, 张红苹. 基于骨架图与混合注意力的视频行人异常检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2551-2557.
[12]	李钟华, 白云起, 王雪津, 黄雷雷, 林初俊, 廖诗宇. 基于图像增强的低照度人脸检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2588-2594.
[13]	莫尚斌, 王文君, 董凌, 高盛祥, 余正涛. 基于多路信息聚合协同解码的单通道语音增强[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2611-2617.
[14]	顾焰杰, 张英俊, 刘晓倩, 周围, 孙威. 基于时空多图融合的交通流量预测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2618-2625.
[15]	石乾宏, 杨燕, 江永全, 欧阳小草, 范武波, 陈强, 姜涛, 李媛. 面向空气质量预测的多粒度突变拟合网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2643-2650.

基于注意力机制和残差网络的恶意代码检测方法

Malicious code detection method based on attention mechanism and residual network

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 29

相关文章 15

编辑推荐

Metrics