计算机应用 ›› 2009, Vol. 29 ›› Issue (07): 1755-1757.

• 多媒体与软件技术 •    下一篇

新的CDF文本分类特征提取方法研究

熊忠阳,蒋健,张玉芳   

  1. 重庆大学
  • 收稿日期:2009-01-08 修回日期:2009-02-27 发布日期:2009-07-01 出版日期:2009-07-01
  • 通讯作者: 蒋健
  • 基金资助:

    省部级基金

New feature selection approach(CDF) in text categorization

  • Received:2009-01-08 Revised:2009-02-27 Online:2009-07-01 Published:2009-07-01

摘要:

对高维的特征集进行降维是文本分类过程中的一个重要环节。本文在研究了现有的特征降维技术的基础之上,对部分常用的特征提取方法做了简要的分析,之后结合类间集中度、类内分散度和类内平均频度,提出了一个新的特征提取方法,即CDF方法。实验采用K-最近邻分类算法(KNN)来考察CDF方法的有效性。结果表明该方法简单有效,能够取得比传统特征提取方法更优的降维效果。

关键词: 文本分类;降维;评价函数

Abstract:

Reducing the high dimension of feature vectors is an essential part of text categorization. After studying current dimension reduction technique and analyzing some normal methods of feature selection, a new approach, named CDF, in feature selection was proposed by comprehensively taking concentration among classes, distribution in class and average frequency in class into account. Experiment takes K-nearest neighbor(KNN) as the evaluating classifier. Experiment results prove that CDF approach is simple and effective, and get a better performance than conventional feature selection methods in dimension reduction.

中图分类号: