计算机应用 ›› 2005, Vol. 25 ›› Issue (09): 2031-2033.DOI: 10.3724/SP.J.1087.2005.02031

• 人工智能 • 上一篇    下一篇

基于词频差异的特征选取及改进的TF-IDF公式

罗欣,夏德麟,晏蒲柳   

  1.  武汉大学电子信息学院
  • 发布日期:2011-04-11 出版日期:2005-09-01

Improved feature selection method and TF-IDF formula based on word frequency differentia

LUO Xin,XIA De-lin,YAN Pu-liu   

  1. School of Electronics & Information,Wuhan University,Hubei Wuhan 430079,China
  • Online:2011-04-11 Published:2005-09-01

摘要: 文档向量化的质量对于文本分类的速度和准确度有着很大的影响。对文档向量化中常用的TF-IDF公式,互信息量公式以及信息增益公式进行了分析。提出一种基于词频差异的特征选取方法和改进的TF-IDF公式,以提高特征选取质量和文本分类的速度及准确度。

关键词:  , 特征选取, 向量空间模型, 文本分类, TF-IDF, 信息增益, 互信息量

Abstract: The vectorization of documents affects the speed and accuracy of text categorization greatly.The most common used formulas: TF-IDF,MI,and IG were analyzed.The method of feature selection based on word frequency differentia was proposed and TF-IDF formula was modified to improve the quality of feature selection,the speed and accuracy of categorization.

Key words: feature selection, VSM, text categorization, TF-IDF, IG, MI

中图分类号: