计算机应用 ›› 2013, Vol. 33 ›› Issue (06): 1587-1590.DOI: 10.3724/SP.J.1087.2013.01587

• 人工智能 • 上一篇    下一篇

基于LDA主题模型的短文本分类方法

张志飞,苗夺谦,高灿   

  1. 同济大学 计算机科学与技术系,上海 201804
  • 收稿日期:2012-12-14 修回日期:2013-01-24 出版日期:2013-06-01 发布日期:2013-06-05
  • 通讯作者: 张志飞
  • 作者简介:张志飞(1986-),男,江苏如东人,博士研究生,CCF学生会员,主要研究方向:粒计算、文本挖掘;苗夺谦(1964-),男,山西祁县人,教授,博士生导师,CCF高级会员,主要研究方向:粗糙集、Web智能、机器学习;高灿(1983-),男,湖南南县人,博士研究生,CCF学生会员,主要研究方向:粗糙集、机器学习。
  • 基金资助:

    国家自然科学基金资助项目(60970061);国家自然科学基金资助项目(61075056);国家自然科学基金资助项目(61103067);中央高校基本科研业务费专项资金资助项目

Short text classification using latent Dirichlet allocation

ZHANG Zhifei,MIAO Duoqian,GAO Can   

  1. Department of Computer Science and Technology, Tongji University, Shanghai 201804, China
  • Received:2012-12-14 Revised:2013-01-24 Online:2013-06-05 Published:2013-06-01
  • Contact: ZHANG Zhifei

摘要: 针对短文本的特征稀疏性和上下文依赖性两个问题,提出一种基于隐含狄列克雷分配模型的短文本分类方法。利用模型生成的主题,一方面区分相同词的上下文,降低权重;另一方面关联不同词以减少稀疏性,增加权重。采用K近邻方法对自动抓取的网易页面标题数据进行分类,实验表明新方法在分类性能上比传统的向量空间模型和基于主题的相似性度量分别高5%和2.5%左右。

关键词: 短文本, 分类, K近邻, 相似度, 隐含狄列克雷分配

Abstract: In order to solve the two key problems of the short text classification, very sparse features and strong context dependency, a new method based on latent Dirichlet allocation was proposed. The generated topics not only discriminate contexts of common words and decrease their weights, but also reduce sparsity by connecting distinguishing words and increase their weights. In addition, a short text dataset was constructed by crawling titles of Netease pages. Experiments were done by classifying these short titles using K-nearest neighbors. The proposed method outperforms vector space model and topic-based similarity.

Key words: short text, classification, K-Nearest Neighbor (K-NN), similarity measure, latent Dirichlet allocation

中图分类号: