计算机应用 ›› 2010, Vol. 30 ›› Issue (3): 799-801.

• 数据库与数据挖掘 • 上一篇    下一篇

基于密度的kNN分类器训练样本裁剪方法的改进

熊忠阳,杨营辉,张玉芳   

  1. 重庆大学
  • 收稿日期:2009-09-08 修回日期:2009-11-09 发布日期:2010-03-14 出版日期:2010-03-01
  • 通讯作者: 杨营辉
  • 基金资助:
    中国博士后科学基金资助项目;重庆市科委自然科学基金计划资助项目

Improvement of density-based method for reducing training data in KNN text classification

  • Received:2009-09-08 Revised:2009-11-09 Online:2010-03-14 Published:2010-03-01
  • Supported by:
    The Postdoctoral Science Foundation of China

摘要: 在文本分类中,训练集的分布状态会直接影响k-近邻(kNN)分类器的效率和准确率。通过分析基于密度的kNN文本分类器训练样本的裁剪方法,发现它存在两大不足:一是裁剪之后的均匀状态只是以ε为半径的球形区域意义上的均匀状态,而非最理想的均匀状态即两两样本之间的距离相等;二是未对低密度区域的样本做任何处理,裁剪之后仍存在大量不均匀的区域。针对这两处不足,提出了以下两点改进:一是优化了裁剪策略,使裁剪之后的训练集更趋于理想的均匀状态;二是实现了对低密度区域样本的补充。通过实验对比,改进后的方法在稳定性和准确率方面都有明显提高。

关键词: 文本分类, k-近邻, 快速分类, 样本裁剪, 样本补充

Abstract: The density of training data directly influences the efficiency and precision of k- Nearest Neighbor (kNN) text classifier. Two disadvantages had been uncovered by the analysis of density-based method while reducing the amount of training data in kNN text classification. One is that after being reduced, the even density of the training data is just based on the spherical region which has a radius of ε,rather than the equal distance of every training text. The other is that there is no treatment of the low-density training texts while plenty of low-density texts still exist in the training data after being reduced. An improved approach to the mentioned deficiencies was proposed: the reduction strategy was optimized to make the training data yield evenly and the appropriate data were supplemented into the low-density texts. It is shown that the improved method has a distinctly better performance on both algorithm stability and accuracy.

Key words: text categorization, k- Nearest Neighbor (kNN), fast classification, sample reduction, sample supplement