基于多特征融合的恶意代码分类算法

doi:10.11772/j.issn.1001-9081.2019010116

计算机应用 ›› 2019, Vol. 39 ›› Issue (8): 2333-2338.DOI: 10.11772/j.issn.1001-9081.2019010116

基于多特征融合的恶意代码分类算法

郎大鹏^1,2, 丁巍¹, 姜昊辰¹, 陈志远¹

1. 哈尔滨工程大学计算机科学与技术学院, 哈尔滨 150001;
2. 中国科学院信息工程研究所中国科学院网络测评技术重点实验室, 北京 100093

收稿日期:2019-01-16 修回日期:2019-04-17 出版日期:2019-08-10 发布日期:2019-04-24
通讯作者: 丁巍
作者简介:郎大鹏(1983-),男,黑龙江哈尔滨人,讲师,博士,主要研究方向:信息安全;丁巍(1995-),男,湖南邵阳人,硕士研究生,主要研究方向:信息安全;姜昊辰(1994-),男,黑龙江哈尔滨人,硕士研究生,主要研究方向:信息安全;陈志远(1978-),男,黑龙江哈尔滨人,讲师,博士,主要研究方向:数字建模、水利水电工程。
基金资助:
中国科学院信息工程研究所中国科学院网络测评技术重点实验室开放课题资助项目（10201050201）。

Malicious code classification algorithm based on multi-feature fusion

LANG Dapeng^1,2, DING Wei¹, JIANG Haocheng¹, CHEN Zhiyuang¹

1. College of Computer Science and Technology, Harbin Engineerning University, Harbin Heilongjiang 150001, China;
2. Key Laboratory of Network Assessment Technology, Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China

Received:2019-01-16 Revised:2019-04-17 Online:2019-08-10 Published:2019-04-24
Supported by:
This work is partially supported by the Open Project of Key Laboratory of Network Assessment Technology, Institute of Information Engineering, Chinese Academy of Sciences (10201050201).

摘要/Abstract

摘要： 针对多数恶意代码分类研究都基于家族分类和恶意、良性代码分类，而种类分类比较少的问题，提出了多特征融合的恶意代码分类算法。采用纹理图和反汇编文件提取3组特征进行融合分类研究，首先使用源文件和反汇编文件提取灰度共生矩阵特征，由n-gram算法提取操作码序列；然后采用改进型信息增益（IG）算法提取操作码特征，其次将多组特征进行标准化处理后以随机森林（RF）为分类器进行学习；最后实现了基于多特征融合的随机森林分类器。通过对九类恶意代码进行学习和测试，所提算法取得了85%的准确度，相比单一特征下的随机森林、多特征下的多层感知器和Logistic回归算法分类器，准确率更高。

关键词: 恶意代码, 纹理特征, 操作码序列, 随机森林, 静态分析

Abstract: Concerning the fact that most malicious code classification researches are based on family classification and malicious and benign code classification, and the classification of categories is relatively few, a malicious code classification algorithm based on multi-feature fusion was proposed. Three sets of features extracted from texture maps and disassembly files were used for fusion classification research. Firstly, the gray level co-occurrence matrix features were extracted from source files and disassembly files and the sequences of operation codes were extracted by n-gram algorithm. Secondly, the improved Information Gain (IG) algorithm was used to extract the operation code features. Thirdly, Random Forest (RF) was used as the classifier to learn the multi-group features after normalization. Finally, the random forest classifier based on multi-feature fusion was realized. The proposed algorithm achieves 85% accuracy by learning and testing nine types of malicious codes. Compared with random forest under single feature, multi-layer perceptron under multi-feature and Logistic regression classifier, it has higher accuracy.

Key words: malicious code, texture feature, opcode sequence, Random Forest (RF), static analysis

中图分类号:

TP309

郎大鹏, 丁巍, 姜昊辰, 陈志远. 基于多特征融合的恶意代码分类算法[J]. 计算机应用, 2019, 39(8): 2333-2338.

LANG Dapeng, DING Wei, JIANG Haocheng, CHEN Zhiyuang. Malicious code classification algorithm based on multi-feature fusion[J]. Journal of Computer Applications, 2019, 39(8): 2333-2338.

参考文献

[1] 王洋,单征,赵炳麟,等基于静态行为轨迹的异常特征检测技术[J].计算机应用研究,2017,34(8):2434-2438. (WANG Y, SHAN Z, ZHAO B L, et al. Anomaly feature detection technology based on static behavior trajectories[J]. Application Research of Computers, 2017, 34(8):2434-2438.)
[2] SCHULTZ M, ESKIN E,ZADOK E, et al. Data mining methods for detection of new malicious executables[C]//Proceedings of the 2001 IEEE Symposium on Research in Security and Privacy. Piscataway, NJ:IEEE, 2001:38-49.
[3] CHRISTODORESCU M, JHA S, SESHIA S A, et al. Semantics-aware malware detection[C]//Proceedings of the 2005 IEEE Symposium on Security and Privacy. Piscataway, NJ:IEEE, 2005:32-46.
[4] KOLTER J Z, MALOOF M A. Learning to detect and classify maliciousexecutables in the wild[J]. Journal of Machine Learning Research, 2006, 7(1):2721-2744.
[5] NATARAJ L, KARTHIKEYAN S, JACOB G, et al. Malware images:visualization and automatic classification[C]//Proceedings of the 8th International Symposium on Visualization for Cyber Security. New York:ACM, 2011:No.4.
[6] 韩晓光,曲武,姚宣霞,等.基于纹理指纹的恶意代码变种检测方法研究[J].通信学报,2014,35(8):125-136. (HAN X G, QU W, YAO X X, et al. Research on malicious code variant detection method based on texture fingerprint[J]. Journal on Communications, 2014, 35(8):125-136.)
[7] BINDOG. GitHub[EB/OL].[2018-08-18]. https://github.com/bindog/ToyMalwareClassification/.
[8] ZHANG F, ZHAO T. Malware detection and classification based on n-grams attribute similarity[C]//Proceedings of 2017 IEEE International Conference on Computational Science and Engineering and IEEE International Conference on Embedded and Ubiquitous Computing. Washington, DC:IEEE Computer Society, 2017:793-796.
[9] KWON I, IM E G. Extracting the representative API call patterns of malware families using recurrent neural network[C]//Proceedings of the 2017 International Conference on Research in Adaptive and Convergent Systems. New York:ACM, 2017:202-207.
[10] FU J, XUE J, WANG Y, et al. Malware visualization for fine-grained classification[J]. IEEE Access, 2018, 6:14510-14523.
[11] DING Y, ZHU S. Malware detection based on deep learning algorithm[J]. Neural Computing and Applications, 2017, 31(2):461-472.
[12] 李雪虎,王发明,战凯.基于大样本的随机森林恶意代码检测与分类算法[J].信息技术与网络安全,2018,37(7):3-5,21. (LI X H, WANG F M, ZHAN K. Large sample-based random forest malicious code detection and classification algorithm[J]. Information Technology and Network Security, 2018, 37(7):3-5,21.)
[13] 潘良敏.基于GIST全局特征的钓鱼网站聚类算法研究[D].长沙:中南林业科技大学,2018:1-58. (PAN L M. Research on phishing website clustering algorithm based on the global characteristics of GIST[D]. Changsha:Central South University of Forestry and Technology, 2018:1-58.)
[14] OLIVA A, TORRALBA A. Modeling the shape of the scene:a holistic representation of the spatial envelope[J].International Journal of Computer Vision,2001,42(3):145-175.
[15] 戴逸辉,殷旭东.基于随机森林的恶意代码检测[J].网络空间安全,2018,9(2):70-75. (DAI Y H, YIN X D. Malicious code detection based on random forest[J]. Cyberspace Security, 2018,9(2):70-75.)

基于多特征融合的恶意代码分类算法

Malicious code classification algorithm based on multi-feature fusion

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	张杨, 董士程. 面向并发程序中锁机制的智能化推荐方法[J]. 计算机应用, 2021, 41(6): 1597-1603.
[2]	余东昌, 赵文芳, 聂凯, 张舸. 基于LightGBM算法的能见度预测模型[J]. 计算机应用, 2021, 41(4): 1035-1041.
[3]	蒋考林, 白玮, 张磊, 陈军, 潘志松, 郭世泽. 基于多通道图像深度学习的恶意代码检测[J]. 计算机应用, 2021, 41(4): 1142-1147.
[4]	姜倩玉, 王凤英, 贾立鹏. 基于感知哈希算法和特征融合的恶意代码检测方法[J]. 计算机应用, 2021, 41(3): 780-785.
[5]	张增辉, 姜高霞, 王文剑. 基于局部概率抽样的标签噪声过滤方法[J]. 计算机应用, 2021, 41(1): 67-73.
[6]	周翔, 翟俊海, 黄雅婕, 申瑞彩, 侯璎真. 基于随机森林和投票机制的大数据样例选择算法[J]. 计算机应用, 2021, 41(1): 74-80.
[7]	肖跃雷, 张云娇. 基于特征选择和超参数优化的恐怖袭击组织预测方法[J]. 计算机应用, 2020, 40(8): 2262-2267.
[8]	聂茜婵, 张阳, 余敦辉, 张兴盛. 面向全局优化的时空众包任务分配算法[J]. 计算机应用, 2020, 40(7): 1950-1958.
[9]	余英东, 杨怡, 林澜. 结合纹理特征分析的图像风格转换网络[J]. 计算机应用, 2020, 40(3): 638-644.
[10]	余敦辉, 袁旭, 张万山, 王晨旭. 基于动态阈值的时空众包在线分配算法[J]. 计算机应用, 2020, 40(3): 658-664.
[11]	王治忠, 钱龙龙, 韩闯, 师丽. 基于统计特征和熵特征融合的心肌梗死辅助诊断方法[J]. 计算机应用, 2020, 40(2): 608-615.
[12]	陈禹, 毛莺池. 基于随机森林和遗传算法的Ceph参数自动调优[J]. 计算机应用, 2020, 40(2): 347-351.
[13]	王博, 蔡弘昊, 苏旸. 基于VGGNet的恶意代码变种分类[J]. 计算机应用, 2020, 40(1): 162-167.
[14]	何新宇, 张晓龙. 基于深度神经网络的肺炎图像识别模型[J]. 计算机应用, 2019, 39(6): 1680-1684.
[15]	刘玉珍, 蒋政权, 赵娜. 基于近邻三值模式和协作表示的三维掌纹识别[J]. 计算机应用, 2019, 39(6): 1690-1695.