计算机应用

• 人工智能 • 上一篇    下一篇

基于向量空间模型的中文文本层次分类方法研究

肖雪, 何中市   

  1. 重庆大学计算机学院
  • 收稿日期:2005-11-03 修回日期:2006-01-13 出版日期:2006-05-01 发布日期:2006-05-01
  • 通讯作者: 肖雪

Hierarchical categorization methods of Chinese text based on vector space model

<a href="http://www.joca.cn/EN/article/advancedSearchResult.do?searchSQL=((([Author]) AND 1[Journal]) AND year[Order])" target="_blank"></a>,<a href="http://www.joca.cn/EN/article/advancedSearchResult.do?searchSQL=((([Author]) AND 1[Journal]) AND year[Order])" target="_blank"></a>   

  1. 重庆大学计算机学院
  • Received:2005-11-03 Revised:2006-01-13 Online:2006-05-01 Published:2006-05-01

摘要: 在文本分类的类别数量庞大的情况下,层次分类是一种有效的分类途径。针对层次分类的结构特点,考虑到不同的层次对特征选择和分类方法有不同的要求,提出了新的基于向量空间模型的二重特征选择方法FDS以及层次分类算法HTC。二重特征选择方法对每一层均进行一次特征选择,并逐层改变特征数量和权重计算方法;HTC算法把分别对粗分和细分更有效的类中心向量法与SVM方法相结合。实验表明,该方法相对于平面分类和一般的层次分类方法,有较高的准确率。

关键词: 层次分类, 向量空间模型, 二重特征选择, 权重计算

Abstract: On large amount conditions of text quantity, hierarchical text categorization was an effective approach. Aiming at structural characteristics of hierarchical text categorization, and considering various demands of texts in different levels on both feature selection and categorization method, a new method, Feature Dual-Selection(FDS), and an algorithm of Hierarchical Text Categorization(HTC) based on vector space model was proposed. FDS is to perform feature selection in each level, and then modify feature number along with term weighting method accordingly; HTC algorithm integrates together center classification method and Support Vector Machine(SVM), which proves more effective for broad classification and subdivision respectively. Finally, experiment results show that the new approach, proposed in this paper, outperforms plain or generic hierarchical methods with improved accuracy.

Key words: hierarchical categorization, vector space model, FDS(feature dual-selection), term weighting