计算机应用 ›› 2010, Vol. 30 ›› Issue (4): 1015-1018.

• 人工智能 • 上一篇    下一篇

基于属性选择的半监督短文本分类算法

蔡月红1,朱倩2,孙萍2,程显毅2   

  1. 1. 江苏大学
    2.
  • 收稿日期:2009-10-12 修回日期:2009-12-02 发布日期:2010-04-15 出版日期:2010-04-01
  • 通讯作者: 蔡月红
  • 基金资助:
    基于粒子群优化和先验信息的约束学习算法研究

Semi-supervised short text categorization based on attribute selection

  • Received:2009-10-12 Revised:2009-12-02 Online:2010-04-15 Published:2010-04-01
  • Contact: Cai YueHong

摘要: 针对海量短文本分类中的标注语料匮乏问题,提出了一种基于属性选择的半监督短文本分类算法。通过基于ReliefF评估和独立性度量的属性选择技术选出部分具有较好的属性独立关系的属性参与分类模型的学习,以弱化朴素贝叶斯模型的强独立性假设条件;借助集成学习,以具有一定差异性的分类器组去估计初始值,并以多数投票策略去分类未标注语料集,以减低最大期望算法(EM)对于初始值的敏感。通过真实语料上进行的比较实验,证明了该方法能有效利用大量未标注语料提高算法的泛化能力。

关键词: 属性选择, 半监督学习, 短文本, 文本分类, 集成学习

Abstract: In order to solve the data scarcity problem of massive short text categorization, a semi-supervised short text categorization method based on attribute selection was presented. An attribute selection algorithm based on ReliefF and independence measures was used to overcome the limitation of the attributes independence assumption by deleting irrelevant or redundant attributes, and an ensemble algorithm based on Expectaion-Maximization (EM) was used to resolve the problems of sensitivity to initial values in semi-supervised EM algorithm. The experiments on real corpus show that the proposed method can more effectively and stably utilize the unlabeled examples to improve classification generalization.

Key words: attribute selection, semi-supervised learning, short text, text categorization, ensemble learning