计算机应用 ›› 2019, Vol. 39 ›› Issue (8): 2333-2338.DOI: 10.11772/j.issn.1001-9081.2019010116

• 网络空间安全 • 上一篇    下一篇

基于多特征融合的恶意代码分类算法

郎大鹏1,2, 丁巍1, 姜昊辰1, 陈志远1   

  1. 1. 哈尔滨工程大学 计算机科学与技术学院, 哈尔滨 150001;
    2. 中国科学院信息工程研究所 中国科学院网络测评技术重点实验室, 北京 100093
  • 收稿日期:2019-01-16 修回日期:2019-04-17 出版日期:2019-08-10 发布日期:2019-04-24
  • 通讯作者: 丁巍
  • 作者简介:郎大鹏(1983-),男,黑龙江哈尔滨人,讲师,博士,主要研究方向:信息安全;丁巍(1995-),男,湖南邵阳人,硕士研究生,主要研究方向:信息安全;姜昊辰(1994-),男,黑龙江哈尔滨人,硕士研究生,主要研究方向:信息安全;陈志远(1978-),男,黑龙江哈尔滨人,讲师,博士,主要研究方向:数字建模、水利水电工程。
  • 基金资助:
    中国科学院信息工程研究所中国科学院网络测评技术重点实验室开放课题资助项目(10201050201)。

Malicious code classification algorithm based on multi-feature fusion

LANG Dapeng1,2, DING Wei1, JIANG Haocheng1, CHEN Zhiyuang1   

  1. 1. College of Computer Science and Technology, Harbin Engineerning University, Harbin Heilongjiang 150001, China;
    2. Key Laboratory of Network Assessment Technology, Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China
  • Received:2019-01-16 Revised:2019-04-17 Online:2019-08-10 Published:2019-04-24
  • Supported by:
    This work is partially supported by the Open Project of Key Laboratory of Network Assessment Technology, Institute of Information Engineering, Chinese Academy of Sciences (10201050201).

摘要: 针对多数恶意代码分类研究都基于家族分类和恶意、良性代码分类,而种类分类比较少的问题,提出了多特征融合的恶意代码分类算法。采用纹理图和反汇编文件提取3组特征进行融合分类研究,首先使用源文件和反汇编文件提取灰度共生矩阵特征,由n-gram算法提取操作码序列;然后采用改进型信息增益(IG)算法提取操作码特征,其次将多组特征进行标准化处理后以随机森林(RF)为分类器进行学习;最后实现了基于多特征融合的随机森林分类器。通过对九类恶意代码进行学习和测试,所提算法取得了85%的准确度,相比单一特征下的随机森林、多特征下的多层感知器和Logistic回归算法分类器,准确率更高。

关键词: 恶意代码, 纹理特征, 操作码序列, 随机森林, 静态分析

Abstract: Concerning the fact that most malicious code classification researches are based on family classification and malicious and benign code classification, and the classification of categories is relatively few, a malicious code classification algorithm based on multi-feature fusion was proposed. Three sets of features extracted from texture maps and disassembly files were used for fusion classification research. Firstly, the gray level co-occurrence matrix features were extracted from source files and disassembly files and the sequences of operation codes were extracted by n-gram algorithm. Secondly, the improved Information Gain (IG) algorithm was used to extract the operation code features. Thirdly, Random Forest (RF) was used as the classifier to learn the multi-group features after normalization. Finally, the random forest classifier based on multi-feature fusion was realized. The proposed algorithm achieves 85% accuracy by learning and testing nine types of malicious codes. Compared with random forest under single feature, multi-layer perceptron under multi-feature and Logistic regression classifier, it has higher accuracy.

Key words: malicious code, texture feature, opcode sequence, Random Forest (RF), static analysis

中图分类号: