计算机应用 ›› 2009, Vol. 29 ›› Issue (08): 2268-2271.

• 人工智能 • 上一篇    下一篇

结合新型文档频和二进制可辨矩阵的特征选择

马春华1,朱颢东2,钟勇3   

  1. 1. 绥化学院
    2. 中科院成都计算机应用研究所
    3. 中国科学院成都计算机应用研究所
  • 收稿日期:2009-03-23 修回日期:2009-05-14 发布日期:2008-08-01 出版日期:2009-08-01
  • 通讯作者: 马春华
  • 基金资助:
    省部级基金

Feature selection combining new document frequency with binary discernibility matrix

  • Received:2009-03-23 Revised:2009-05-14 Online:2008-08-01 Published:2009-08-01

摘要: 特征选择是文本分类的一个核心研究课题。分析了几种经典特征选择方法并总结了它们的不足,提出了一个新型文档频,引入粗糙集理论,并给出了一个基于二进制可辨矩阵的属性约简算法,最后把该属性约简算法同新型文档频结合起来,提供了一个综合的特征选择方法。该方法首先利用新型文档频进行特征初选以过滤掉一些词条,然后利用所提属性约简算法消除冗余。通过对人民网的8类新闻组,每类300篇文档的分类实验,结果表明此种特征选择方法在分类准确率和召回率上优于互信息、CHI和信息增益方法。

关键词: 特征选择, 文本分类, 文档频, 二进制可辨矩阵, 粗糙集, 属性约简, feature selection, text categorization, document frequency, binary discernibility matrix, Rough Set (RS), attribution reduction

Abstract: Feature selection is a core research topic in text categorization. Several classic feature selection methods were analyzed and their deficiencies were summarized. A new document frequency was proposed, and Rough Set (RS) theory was adopted to provide an attribute reduction algorithm based on binary discernibility matrix. Based on the attribute reduction algorithm and the new document frequency, a comprehensive feature selection method was given. The comprehensive method firstly used the new document frequency to select features to filter out some terms, and then employed the attribute reduction algorithm to eliminate redundancy. The experimental results on data of 8 classes, 300 documents each class from http://www.people.com.cn show that the comprehensive method has higher accuracy and recall rate compared with Mutual Information (MI), CHI value and Information Gain (IG) methods.

中图分类号: