计算机应用 ›› 2016, Vol. 36 ›› Issue (11): 3161-3164.DOI: 10.11772/j.issn.1001-9081.2016.11.3161

• 人工智能 • 上一篇    下一篇

基于主成分分析和K近邻的文件类型识别算法

鄢梦迪, 秦琳琳, 吴刚   

  1. 中国科学技术大学 信息科学技术学院, 合肥 230022
  • 收稿日期:2016-04-29 修回日期:2016-06-30 出版日期:2016-11-10 发布日期:2016-11-12
  • 通讯作者: 秦琳琳
  • 作者简介:鄢梦迪(1993-),女,安徽蚌埠人,硕士研究生,主要研究方向:网络传播与控制;秦琳琳(1975-),女,安徽枞阳人,高级工程师,博士,主要研究方向:人工环境建模与控制、混杂系统;吴刚(1964-),男,江苏南通人,教授,博士,主要研究方向:先进控制与优化、新能源汽车。
  • 基金资助:
    中央高校基本科研业务费专项资金资助项目(WK2100100024)。

File type detection algorithm based on principal component analysis and K nearest neighbors

YAN Mengdi, QIN Linlin, WU Gang   

  1. Institute of Information Science and Technology, University of Science and Technology of China, Hefei Anhui 230022, China
  • Received:2016-04-29 Revised:2016-06-30 Online:2016-11-10 Published:2016-11-12
  • Supported by:
    This work is partially supported by the Fundamental Research Funds for the Central Universities (WK2100100024).

摘要: 为解决基于文件后缀名和文件特征标识识别文件类型误判率较高的问题,在基于文件内容识别文件类型的算法基础上,提出主成分分析(PCA)和K近邻(KNN)算法相结合的文件类型识别算法。首先,使用PCA方法对样本预处理以降低样本空间的维数;然后,对降维后的训练样本集进行聚类处理,即用聚类质心代表每种类型的文件;最后,针对训练样本分布不均匀可能造成的分类误差,提出基于距离加权的KNN算法。实验结果表明,改进算法在样本数较多的情况下,能降低分类的计算复杂度,并保持了较高的识别正确率;而且该算法不依赖文件类型的特征标识,应用范围更为广泛。

关键词: 文件类型识别, 字节频率分布, 主成分分析, K近邻

Abstract: In order to solve the problem that using the file suffix and file feature to identify file type may cause a low recognition accuracy rate, a new content-based file-type detection algorithm was proposed, which was based on Principal Component Analysis (PCA) and K Nearest Neighbors (KNN). Firstly, PCA algorithm was used to reduce the dimension of the sample space. Then by clustering the training samples, each file type was represented by cluster centroids. In order to reduce the error caused by unbalanced training samples, KNN algorithm based on distance weighting was proposed. The experimental result shows that the improved algorithm, in the case of a large number of training samples, can reduce computational complexity, and can maintain a high recognition accuracy rate. This algorithm doesn't depend on the feature of each file, so it can be used more widely.

Key words: file type identification, byte frequency distribution, Principal Component Analysis (PCA), K Nearest Neighbors (KNN)

中图分类号: