基于主成分分析和K近邻的文件类型识别算法

doi:10.11772/j.issn.1001-9081.2016.11.3161

计算机应用 ›› 2016, Vol. 36 ›› Issue (11): 3161-3164.DOI: 10.11772/j.issn.1001-9081.2016.11.3161

基于主成分分析和K近邻的文件类型识别算法

鄢梦迪, 秦琳琳, 吴刚

中国科学技术大学信息科学技术学院, 合肥 230022

收稿日期:2016-04-29 修回日期:2016-06-30 发布日期:2016-11-12 出版日期:2016-11-10
通讯作者: 秦琳琳
作者简介:鄢梦迪(1993-),女,安徽蚌埠人,硕士研究生,主要研究方向:网络传播与控制;秦琳琳(1975-),女,安徽枞阳人,高级工程师,博士,主要研究方向:人工环境建模与控制、混杂系统;吴刚(1964-),男,江苏南通人,教授,博士,主要研究方向:先进控制与优化、新能源汽车。
基金资助:
中央高校基本科研业务费专项资金资助项目（WK2100100024）。

File type detection algorithm based on principal component analysis and K nearest neighbors

YAN Mengdi, QIN Linlin, WU Gang

Institute of Information Science and Technology, University of Science and Technology of China, Hefei Anhui 230022, China

Received:2016-04-29 Revised:2016-06-30 Online:2016-11-12 Published:2016-11-10
Supported by:
This work is partially supported by the Fundamental Research Funds for the Central Universities (WK2100100024).

摘要/Abstract

摘要： 为解决基于文件后缀名和文件特征标识识别文件类型误判率较高的问题，在基于文件内容识别文件类型的算法基础上，提出主成分分析（PCA）和K近邻（KNN）算法相结合的文件类型识别算法。首先，使用PCA方法对样本预处理以降低样本空间的维数；然后，对降维后的训练样本集进行聚类处理，即用聚类质心代表每种类型的文件；最后，针对训练样本分布不均匀可能造成的分类误差，提出基于距离加权的KNN算法。实验结果表明，改进算法在样本数较多的情况下，能降低分类的计算复杂度，并保持了较高的识别正确率；而且该算法不依赖文件类型的特征标识，应用范围更为广泛。

关键词: 文件类型识别, 字节频率分布, 主成分分析, K近邻

Abstract: In order to solve the problem that using the file suffix and file feature to identify file type may cause a low recognition accuracy rate, a new content-based file-type detection algorithm was proposed, which was based on Principal Component Analysis (PCA) and K Nearest Neighbors (KNN). Firstly, PCA algorithm was used to reduce the dimension of the sample space. Then by clustering the training samples, each file type was represented by cluster centroids. In order to reduce the error caused by unbalanced training samples, KNN algorithm based on distance weighting was proposed. The experimental result shows that the improved algorithm, in the case of a large number of training samples, can reduce computational complexity, and can maintain a high recognition accuracy rate. This algorithm doesn't depend on the feature of each file, so it can be used more widely.

Key words: file type identification, byte frequency distribution, Principal Component Analysis (PCA), K Nearest Neighbors (KNN)

中图分类号:

TP391.4

鄢梦迪, 秦琳琳, 吴刚. 基于主成分分析和K近邻的文件类型识别算法[J]. 计算机应用, 2016, 36(11): 3161-3164.

YAN Mengdi, QIN Linlin, WU Gang. File type detection algorithm based on principal component analysis and K nearest neighbors[J]. Journal of Computer Applications, 2016, 36(11): 3161-3164.

参考文献

[1] HICKOK D, LESNIAK D, ROWE M. File type detection technology[EB/OL].[2015-10-10]. http://www.micsymposium.org/mics_2005/papers/paper7.pdf.
[2] McDANIEL M, HEYDARI M H. Content based file type detection algorithms[C]//Proceedings of the 36th Annual Hawaii International Conference on System Sciences. Washington, DC:IEEE Computer Society, 2003:332a.
[3] LI W, WANG K, STOLFO S J, et al. Fileprints:identifying file types by n-gram analysis[C]//Proceedings of the 6th IEEE Systems, Man and Cybernetics Information Assurance Workshop. Piscataway, NJ:IEEE, 2005:64-71.
[4] 胡元, 石冰.基于区域划分的KNN文本快速分类算法研究[J]. 计算机科学, 2012, 39(10):182-186.(HU Y, SHI B. Fast KNN text classification algorithm based on area division[J]. Computer Science, 2012, 39(10):182-186.)
[5] SONG Q, NI J, WANG G. A fast clustering-based feature subset selection algorithm for high-dimensional data[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(1):1-14.
[6] 张永, 孟晓飞, 基于投影追踪的KNN文本分类算法的加速策略[J]. 科学技术与工程, 2014, 36(14):92-96.(ZHANG Y, MENG X F. Accelerated K-nearest neighbors text classification algorithm[J]. Science Technology and Engineering,2014, 36(14):92-96.)
[7] 郑洁, 罗军勇, 卢斌.基于统计特征值的文件类型识别算法[J]. 计算机工程, 2007, 33(1):142-144.(ZHENG J, LUO J Y, LU B. Documents type identification based on statistical characteristic[J]. Computer Engineering, 2007, 33(1):142-144.)
[8] 史淼, 刘锋.基于PCA和KNN混合算法的文本分类方法[J]. 电脑知识与技术, 2015, 11(10):169-171.(SHI M, LIU F. A hybrid algorithm for text classification based PCA and KNN[J]. Journal of Computer Knowledge and Technology, 2015, 11(10):169-171.)
[9] 陈振洲, 李磊, 姚正安.基于SVM的特征加权KNN算法[J]. 中山大学学报(自然科学版), 2005, 44(1):17-20.(CHEN Z Z, LI L, YAO Z A. Feature-weighted K-nearest neighbor algorithm with SVM[J]. Journal of Acta Scientiarum Naturalium Universitatis Sunyatseni, 2005, 44(1):17-20.)
[10] 沈志斌, 白清源.基于加权修正的KNN文本分类算法[C]//第二十五届中国数据库学术会议论文集. 重庆:计算机科学, 2008, 38(10A):220-225.(SHEN Z B, BAI Q Y. KNN text classification method based weight modify[C]//NDBC 2008:Proceedings of the 25th National DataBase Conference. Chongqing:Computer Science, 2008, 38(10A):220-225.)
[11] 曹鼎, 罗军勇, 尹美娟.基于变长元组的文件类型识别算法[J]. 计算机应用, 2011, 31(7):1894-1900.(CAO D, LUO J Y, YIN M J. Variable length gram based file type identification algorithm[J]. Journal of Computer Applications, 2011, 31(7):1894-1900).
[12] PANG G, JIN H, JIANG S. CenKNN:a scalable and effective text classifier[J]. Data Mining and Knowledge Discovery, 2015, 29(3):593-625.
[13] EVENSEN J D, LINDAHL S, GOODWIN M. File-type detection using naive Bayes and n-gram analysis[EB/OL].[2015-10-10]. http://ojs.bibsys.no/index.php/NISK/article/download/99/88.
[14] AHMED I, LHEE K S, SHIN H, et al. Content-based file-type identification using cosine similarity and a divide-and-conquer approach[J]. IETE Technical Review, 2010, 27(6):465-477.

[1]	冷强奎, 孙薛梓, 孟祥福. 基于样本势和噪声进化的不平衡数据过采样方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2466-2475.
[2]	葛焌迟, 赵为华. 矩阵数据基于鲁棒主成分分析的距离加权判别分析[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2073-2079.
[3]	陈国祥, 于自强, 赵浩宇. 面向动态路网的移动对象分布式k近邻查询算法[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3403-3410.
[4]	范贤博俊, 陈立家, 李珅, 王晨露, 王敏, 王赞, 刘名果. 鲁棒的视觉机械臂联合建模优化方法[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 962-971.
[5]	孟昱煜, 郭静. 信息熵改进主成分分析模型的链路预测算法[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2823-2829.
[6]	李莉, 石可欣, 任振康. 基于特征选择和TrAdaBoost的跨项目缺陷预测方法[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1554-1562.
[7]	张豪, 朱睿, 宋栿尧, 方鹏, 夏秀峰. 距离-关键字相似度约束的双色反k近邻查询方法[J]. 计算机应用, 2021, 41(6): 1686-1693.
[8]	王心, 朱浩华, 刘光灿. 卷积鲁棒主成分分析[J]. 计算机应用, 2021, 41(5): 1314-1318.
[9]	陆荣秀, 陈明明, 杨辉, 朱建勇. 基于溶液图像时序特征的元素组分含量动态监测系统[J]. 计算机应用, 2021, 41(10): 3075-3081.
[10]	陈利霞, 班颖, 王学文. 基于张量核范数与3D全变分的背景减除[J]. 计算机应用, 2020, 40(9): 2737-2742.
[11]	郑延斌, 韩梦云, 樊文鑫. 基于二维主成分分析与卷积神经网络的手写体汉字识别[J]. 计算机应用, 2020, 40(8): 2465-2471.
[12]	李东博, 黄铝文. 重加权稀疏主成分分析算法及其在人脸识别中的应用[J]. 计算机应用, 2020, 40(3): 717-722.
[13]	王海鹏, 降爱莲, 李鹏翔. 牛顿-软阈值迭代鲁棒主成分分析算法[J]. 计算机应用, 2020, 40(11): 3133-3138.
[14]	张晓博, 杨燕, 李天瑞, 陆凡, 彭莉兰. 基于医疗文本数据聚类的帕金森病早期诊断预测[J]. 计算机应用, 2020, 40(10): 3088-3094.
[15]	吴小莉, 郑艺峰. 基于K近邻算法的噪声种类识别和强度估计[J]. 计算机应用, 2020, 40(1): 264-270.

基于主成分分析和K近邻的文件类型识别算法

File type detection algorithm based on principal component analysis and K nearest neighbors

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics