计算机应用 ›› 2011, Vol. 31 ›› Issue (07): 1894-1897.DOI: 10.3724/SP.J.1087.2011.01894

• 信息安全 • 上一篇    下一篇

基于变长元组的文件类型识别算法

曹鼎,罗军勇,尹美娟   

  1. 信息工程大学 信息工程学院, 郑州 450002
  • 收稿日期:2011-01-21 修回日期:2011-03-02 发布日期:2011-07-01 出版日期:2011-07-01
  • 通讯作者: 曹鼎
  • 作者简介:曹鼎(1984-),女,河南郑州人,硕士研究生,主要研究方向:模式识别、数据挖掘;罗军勇(1964-),男,江西南昌人,教授,主要研究方向:信息安全,数据挖掘;尹美娟(1977-),女,安徽芜湖人,讲师,主要研究方向:数据挖掘、社会网络分析。
  • 基金资助:

    国防项目

Variable length gram based file type identification algorithm

Ding CAO,Jun-yong LUO,Mei-juan YIN   

  1. Institute of Information Engineering,Information Engineering University,Zhengzhou Henan 450002,China
  • Received:2011-01-21 Revised:2011-03-02 Online:2011-07-01 Published:2011-07-01
  • Contact: Ding CAO

摘要: 快速准确地判断文件实体的真实类型对保护计算机信息安全具有重要意义。通过分析现有基于二进制内容的文件类型识别算法中存在的问题,提出采用变长元组描述文件的统计特征,并结合结构化文件中元组的分散度、稳定度以及条件广泛度设计出一种特征评估函数,从而更加准确地选取有效的特征。该算法不依靠特定文件类型的结构和关键标识,适用范围更为广泛。实验表明该算法能有效提高文件类型识别的查准率和查全率。

关键词: 文件类型识别, 变长元组, 元组频率分布, 文件类型指纹, 特征选择

Abstract: Fast and accurate identification of the true type of an arbitrary file is very important in information security. Concerning the problems of current contentbased file type identification algorithms, variablelength gram was introduced for describing statistic characteristics of files binary content, and a new evaluation function combining gram divergence, stability and conditional width was adopted for feature selection for structured file types. This algorithm does not rely on the structure and key words of any specific file types, which allows the approach to be applied more widely. The experimental results show that the proposed approach improves the precision and recall of file type identification.

Key words: file type identification, variable length gram, gram frequency distribution, fileprints, feature selection

中图分类号: