计算机应用 ›› 2010, Vol. 30 ›› Issue (10): 2621-2623.

• 数据库与数据挖掘 • 上一篇    下一篇

结合语义的特征选择方法

熊忠阳1,付玲玲2,张玉芳1,蒋健1   

  1. 1. 重庆大学
    2. 重庆大学计算机学院
  • 收稿日期:2010-04-06 修回日期:2010-05-24 发布日期:2010-09-21 出版日期:2010-10-01
  • 通讯作者: 付玲玲
  • 基金资助:
    中国博士后科学基金资助项目;重庆市科委基金资助项目

Improved feature selection approach combined with semantic

  • Received:2010-04-06 Revised:2010-05-24 Online:2010-09-21 Published:2010-10-01
  • Contact: Fulynn

摘要: 传统的基于词频统计的特征选择方法忽略了特征项本身的语义信息,特征项之间存在冗余使得维数有限的特征空间无法容纳更多的对分类有用的特征项。为此,利用《知网》(HowNet)的中英双语知识词典构建“概念—领域”表,对每个词语查询该表,如果在表中,则把该词语映射到“领域”;否则保留原词。这样不仅可以将较低层概念泛化到较高层概念,还能在一定程度上消除特征项之间的冗余,而且从语义上加强它对所在“领域”的分类贡献度。分别应用信息增益和χ2统计利用该方法进行文本分类实验,结果表明该方法可以有效地提高分类准确率。

关键词: 文本分类, 特征选择, 语义, 知网

Abstract: The traditional selection methods for text categorization are based on the statistical information of word frequency, which ignores the semantic effect of the words and cannot take more useful features because of the redundancy. A table named "conception-domain" was built based on the semantic dictionary HowNet, which included the word itself and its domain value. If a word from the text was existent in the table, it would be replaced by its domain value with more general meaning. By this way, more semantic information was added to the selected features and the redundancy between features of items could be eliminated to some extent. The experiments were carried out by improved information gain and χ2 respectively. And the results show that this method has effectively improved the precision of the text categorization.

Key words: text categorization, feature selection, semantic, HowNet

中图分类号: