基于LDA主题模型的短文本分类方法

doi:10.3724/SP.J.1087.2013.01587

计算机应用 ›› 2013, Vol. 33 ›› Issue (06): 1587-1590.DOI: 10.3724/SP.J.1087.2013.01587

基于LDA主题模型的短文本分类方法

张志飞,苗夺谦,高灿

同济大学计算机科学与技术系，上海 201804

收稿日期:2012-12-14 修回日期:2013-01-24 发布日期:2013-06-05 出版日期:2013-06-01
通讯作者: 张志飞
作者简介:张志飞（1986-），男，江苏如东人，博士研究生，CCF学生会员，主要研究方向：粒计算、文本挖掘；苗夺谦（1964-），男，山西祁县人，教授，博士生导师，CCF高级会员，主要研究方向：粗糙集、Web智能、机器学习；高灿（1983-），男，湖南南县人，博士研究生，CCF学生会员，主要研究方向：粗糙集、机器学习。
基金资助:
国家自然科学基金资助项目（60970061）;国家自然科学基金资助项目（61075056）;国家自然科学基金资助项目（61103067）;中央高校基本科研业务费专项资金资助项目

Short text classification using latent Dirichlet allocation

ZHANG Zhifei,MIAO Duoqian,GAO Can

Department of Computer Science and Technology, Tongji University, Shanghai 201804, China

Received:2012-12-14 Revised:2013-01-24 Online:2013-06-05 Published:2013-06-01
Contact: ZHANG Zhifei

摘要/Abstract

摘要： 针对短文本的特征稀疏性和上下文依赖性两个问题，提出一种基于隐含狄列克雷分配模型的短文本分类方法。利用模型生成的主题，一方面区分相同词的上下文，降低权重；另一方面关联不同词以减少稀疏性，增加权重。采用K近邻方法对自动抓取的网易页面标题数据进行分类，实验表明新方法在分类性能上比传统的向量空间模型和基于主题的相似性度量分别高5%和2.5%左右。

关键词: 短文本, 分类, K近邻, 相似度, 隐含狄列克雷分配

Abstract: In order to solve the two key problems of the short text classification, very sparse features and strong context dependency, a new method based on latent Dirichlet allocation was proposed. The generated topics not only discriminate contexts of common words and decrease their weights, but also reduce sparsity by connecting distinguishing words and increase their weights. In addition, a short text dataset was constructed by crawling titles of Netease pages. Experiments were done by classifying these short titles using K-nearest neighbors. The proposed method outperforms vector space model and topic-based similarity.

Key words: short text, classification, K-Nearest Neighbor (K-NN), similarity measure, latent Dirichlet allocation

中图分类号:

TP18

张志飞苗夺谦高灿. 基于LDA主题模型的短文本分类方法[J]. 计算机应用, 2013, 33(06): 1587-1590.

ZHANG Zhifei MIAO Duoqian GAO Can. Short text classification using latent Dirichlet allocation[J]. Journal of Computer Applications, 2013, 33(06): 1587-1590.

参考文献

［1］PARK E K, RA D Y, JANG M G. Techniques for improving Web retrieval effectiveness［J］. Information Processing Management, 2005, 41(5):1207-1223.

［2］LIU W Y, HAO T Y, CHEN W, et al. A Web-based platform for user-interactive question-answering［J］. World Wide Web, 2009, 12(2):107-124.

［3］郑斐然, 苗夺谦, 张志飞, 等. 一种中文微博新闻话题检测的方法［J］. 计算机科学, 2012, 39(1): 138-141.

［4］贺涛, 曹先彬, 谭辉. 基于免疫的中文网络短文本聚类算法［J］. 自动化学报, 2009, 35(7): 896-902.

［5］SALTON G, WONG A, YANG C S. A vector space model for automatic indexing ［J］. Communications of the ACM, 1975, 18(11): 613-620.

［6］PHAN X H, NGUYEN M L, HORIGUCHI S. Learning to classify short and sparse text & Web with hidden topics from large-scale data collections［C］// Proceedings of the 17th Conference on World Wide Web. New York: ACM, 2008: 91-100.

［7］WANG L, JIA Y, HAN W H. Instant message clustering based on extended vector space model［C］// Proceedings of the 2nd International Conference on Advances in Computation and Intelligence. Berlin: Springer-Verlag, 2007: 435-443.

［8］SAHAMI M, HEILMAN T D. A Web-based kernel function for measuring the similarity of short text snippets［C］// Proceedings of the 15th Conference on World Wide Web. New York: ACM, 2006: 377-386.

［9］YIH W, MEEK C. Improving similarity measures for short segments of text［C］// Proceedings of the 22nd Conference on Artificial Intelligence. Menlo Park: AAAI Press, 2007: 1489-1494.

［10］翟延冬, 王康平, 张东娜. 一种基于WordNet的短文本语义相似性算法［J］. 电子学报, 2012, 40(3): 617-620.

［11］BANERJEE S, RAMANATHAN K, GUPTA A. Clustering short texts using Wikipedia［C］// Proceedings of the 30th Annual International ACM SIGIR Conference on on Research and Development in Information Retrieval. New York: ACM, 2007: 787-788.

［12］BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation［J］. Journal of Machine Learning Research, 2003, 3(3): 993-1022.

［13］QUAN X J, LIU G, LU Z, et al. Short text similarity based on probabilistic topics［J］. Knowledge Information System, 2010, 25(3): 473-491.

［14］CHEN M, JIN X, SHEN D. Short text classification improved by learning multi-granularity topics［C］// Proceedings of the 22nd International Joint Conference on Artificial Intelligence. Menlo Park: AAAI Press, 2011: 1776-1781.

［15］SALTON G, YANG C S. On the specification of term values in automatic indexing ［J］. Journal of Documentation, 1973, 29(4): 351-372.

［16］GRIFFITHS T L, STEYVERS M. Finding scientific topics［J］.Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(S1): 5228-5235.

[1]	黄于欣, 徐佳龙, 余正涛, 侯书楷, 周家啟. 基于生成提示的无监督文本情感转换方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2667-2673.
[2]	孙淳, 胡春龙, 黄树成. 一致性保留的集成排序年龄估计方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2381-2386.
[3]	冷强奎, 孙薛梓, 孟祥福. 基于样本势和噪声进化的不平衡数据过采样方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2466-2475.
[4]	张全梅, 黄润萍, 滕飞, 张海波, 周南. 融合异构信息的自动国际疾病分类编码方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2476-2482.
[5]	王东炜, 刘柏辰, 韩志, 王艳美, 唐延东. 基于低秩分解和向量量化的深度网络压缩方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 1987-1994.
[6]	葛焌迟, 赵为华. 矩阵数据基于鲁棒主成分分析的距离加权判别分析[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2073-2079.
[7]	陆潜慧, 张羽, 王梦灵, 吴庭伟, 单玉忠. 基于改进循环池化网络的核电装备质量文本分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2034-2040.
[8]	黎施彬, 龚俊, 汤圣君. 基于Graph Transformer的半监督异配图表示学习模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1816-1823.
[9]	余新言, 曾诚, 王乾, 何鹏, 丁晓玉. 基于知识增强和提示学习的小样本新闻主题分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1767-1774.
[10]	翟飞宇, 马汉达. 基于DenseNet的经典-量子混合分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1905-1910.
[11]	李旭, 何玉林, 崔来中, 黄哲学, PHILIPPE Fournier‑Viger. 基于大数据随机样本划分的分布式观测点分类器[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1727-1733.
[12]	姚迅, 秦忠正, 杨捷. 生成式标签对抗的文本分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1781-1785.
[13]	袁子璇, 翁小清, 戈宁振. 基于正交局部保持映射和成本优化的多变量时间序列早期分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1832-1841.
[14]	李鑫, 孟乔, 皇甫俊逸, 孟令辰. 基于分离式标签协同学习的YOLOv5多属性分类[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1619-1628.
[15]	李鸿天, 史鑫昊, 潘卫国, 徐成, 徐冰心, 袁家政. 融合多尺度和注意力机制的小样本目标检测[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1437-1444.

基于LDA主题模型的短文本分类方法

Short text classification using latent Dirichlet allocation

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics