基于密度的kNN分类器训练样本裁剪方法的改进

计算机应用 ›› 2010, Vol. 30 ›› Issue (3): 799-801.

基于密度的kNN分类器训练样本裁剪方法的改进

熊忠阳,杨营辉,张玉芳

重庆大学

收稿日期:2009-09-08 修回日期:2009-11-09 发布日期:2010-03-14 出版日期:2010-03-01
通讯作者: 杨营辉
基金资助:
中国博士后科学基金资助项目;重庆市科委自然科学基金计划资助项目

Improvement of density-based method for reducing training data in KNN text classification

Received:2009-09-08 Revised:2009-11-09 Online:2010-03-14 Published:2010-03-01
Supported by:
The Postdoctoral Science Foundation of China

摘要/Abstract

摘要： 在文本分类中，训练集的分布状态会直接影响k-近邻(kNN)分类器的效率和准确率。通过分析基于密度的kNN文本分类器训练样本的裁剪方法，发现它存在两大不足：一是裁剪之后的均匀状态只是以ε为半径的球形区域意义上的均匀状态，而非最理想的均匀状态即两两样本之间的距离相等；二是未对低密度区域的样本做任何处理，裁剪之后仍存在大量不均匀的区域。针对这两处不足，提出了以下两点改进：一是优化了裁剪策略，使裁剪之后的训练集更趋于理想的均匀状态；二是实现了对低密度区域样本的补充。通过实验对比，改进后的方法在稳定性和准确率方面都有明显提高。

关键词: 文本分类, k-近邻, 快速分类, 样本裁剪, 样本补充

Abstract: The density of training data directly influences the efficiency and precision of k- Nearest Neighbor (kNN) text classifier. Two disadvantages had been uncovered by the analysis of density-based method while reducing the amount of training data in kNN text classification. One is that after being reduced, the even density of the training data is just based on the spherical region which has a radius of ε，rather than the equal distance of every training text. The other is that there is no treatment of the low-density training texts while plenty of low-density texts still exist in the training data after being reduced. An improved approach to the mentioned deficiencies was proposed: the reduction strategy was optimized to make the training data yield evenly and the appropriate data were supplemented into the low-density texts. It is shown that the improved method has a distinctly better performance on both algorithm stability and accuracy.

Key words: text categorization, k- Nearest Neighbor (kNN), fast classification, sample reduction, sample supplement

熊忠阳杨营辉张玉芳. 基于密度的kNN分类器训练样本裁剪方法的改进[J]. 计算机应用, 2010, 30(3): 799-801.

[1]	张洋, 江铭虎. 基于注意力机制的文本作者识别[J]. 计算机应用, 2021, 41(7): 1897-1901.
[2]	温超东, 曾诚, 任俊伟, 张. 结合ALBERT和双向门控循环单元的专利文本分类[J]. 计算机应用, 2021, 41(2): 407-412.
[3]	张阳, 王小宁. 基于Word2Vec词嵌入和高维生物基因选择遗传算法的文本特征选择方法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3151-3155.
[4]	尹春勇, 何苗. 基于改进胶囊网络的文本分类[J]. 计算机应用, 2020, 40(9): 2525-2530.
[5]	廖胜兰, 殷实, 陈小平, 张波, 欧阳昱, 张衡. 面向电力业务对话系统的意图识别数据集[J]. 计算机应用, 2020, 40(9): 2549-2554.
[6]	王敏蕊, 高曙, 袁自勇, 袁蕾. 基于动态路由序列生成模型的多标签文本分类方法[J]. 计算机应用, 2020, 40(7): 1884-1890.
[7]	李鸣, 郭晨皓, 陈星. 视觉类深度神经网络的自动标注[J]. 计算机应用, 2020, 40(6): 1593-1600.
[8]	王留洋, 俞扬信, 陈伯伦, 章慧. 基于共识和分类改善文档聚类的识别信息方法[J]. 计算机应用, 2020, 40(4): 1069-1073.
[9]	张小川, 戴旭尧, 刘璐, 冯天硕. 融合多头自注意力机制的中文短文本分类模型[J]. 计算机应用, 2020, 40(12): 3485-3489.
[10]	马建刚, 马应龙. 语义驱动的司法文档学习分类方法[J]. 计算机应用, 2019, 39(6): 1696-1700.
[11]	马建刚, 张鹏, 马应龙. 基于知识块摘要和词转移距离的高效司法文档分类[J]. 计算机应用, 2019, 39(5): 1293-1298.
[12]	邱宁佳, 丛琳, 周思丞, 王鹏, 李岩芳. 结合改进主动学习的SVD-CNN弹幕文本分类算法[J]. 计算机应用, 2019, 39(3): 644-650.
[13]	唐小川, 邱曦伟, 罗亮. 基于交互作用的文本分类特征选择算法[J]. 计算机应用, 2018, 38(7): 1857-1861.
[14]	卢玲, 杨武, 王远伦, 雷子鉴, 李莹. 结合注意力机制的长文本分类方法[J]. 计算机应用, 2018, 38(5): 1272-1277.
[15]	张忠林, 刘述昌, 江粉桃. 深层次分类中候选类别搜索算法[J]. 计算机应用, 2017, 37(3): 635-639.