计算机应用 ›› 2012, Vol. 32 ›› Issue (02): 403-410.DOI: 10.3724/SP.J.1087.2012.00403

• 人工智能 • 上一篇    下一篇

基于LDA主题模型的标签传递算法

刘培奇,孙捷焓   

  1. 西安建筑科技大学 信息与控制工程学院,西安 710055
  • 收稿日期:2011-08-05 修回日期:2011-09-21 发布日期:2012-02-23 出版日期:2012-02-01
  • 通讯作者: 刘培奇
  • 作者简介:刘培奇(1959-),男,陕西西安人,副教授,博士,主要研究方向:机器学习、数据挖掘、自然语言处理;
    孙捷焓(1988-),女,山东济南人,硕士研究生,主要研究方向:机器学习、数据挖掘。

Label propagation algorithm based on LDA model

LIU Pei-qi,SUN Jie-han   

  1. School of Information and Control Engineering, Xi'an University of Architecture and Technology,Xi'an Shaanxi 710055, China
  • Received:2011-08-05 Revised:2011-09-21 Online:2012-02-23 Published:2012-02-01
  • Contact: LIU Pei-qi

摘要: 标签传递算法是一种半监督分类方法,由于该算法存在要求数据分类结果符合流行假设、数据维数较高时计算复杂度高等问题,在文本分类中效果较差。针对这些问题,经过对LDA主题模型和标签传递算法原理及复杂度的分析,将两者结合,提出一种基于LDA主题模型的标签传递算法LPLDA。该算法用LDA主题模型中的主题表示文本数据,一方面使用LDA主题模型表示文本保证分类结果符合流行假设,另一方面有效减少标签传递算法相似度计算时间。经过实验证明,该算法在标记数据少于待测样本时,分类效果优于传统的有监督分类方法。

关键词: LDA主题模型, 标签传递算法, 半监督学习, 数据降维, 流行假设

Abstract: Label Propagation (LP) algorithm is one kind of semi-supervised learning methods. However, its performance in text classification is not good enough, because LP algorithm demands manifold assumption and it has high computational complexity in calculating the similarity of high dimension data. A new method was proposed to combine Latent Dirichlet Allocation (LDA) model with LP algorithm to solve the above problems after analyzing their principles and complexities. It represented documents with latent topics in LDA. On one hand, it reduces the dimension of matrixes; on the other hand, it can help LDA model lead to the classification results with manifold assumption. The experimental results show that the new method performs better than traditional supervised text classification methods in testing sets when labeled data is less than unlabeled data.

Key words: Latent Dirichlet Allocation (LDA) model, Label Propagation (LP) algorithm, semi-supervised learning, dimensional reduction, manifold assumption

中图分类号: