基于LDA主题模型的标签传递算法

doi:10.3724/SP.J.1087.2012.00403

计算机应用 ›› 2012, Vol. 32 ›› Issue (02): 403-410.DOI: 10.3724/SP.J.1087.2012.00403

基于LDA主题模型的标签传递算法

刘培奇,孙捷焓

西安建筑科技大学信息与控制工程学院，西安 710055

收稿日期:2011-08-05 修回日期:2011-09-21 发布日期:2012-02-23 出版日期:2012-02-01
通讯作者: 刘培奇
作者简介:刘培奇(1959-),男,陕西西安人，副教授，博士，主要研究方向：机器学习、数据挖掘、自然语言处理；
孙捷焓(1988-),女,山东济南人，硕士研究生，主要研究方向：机器学习、数据挖掘。

Label propagation algorithm based on LDA model

LIU Pei-qi,SUN Jie-han

School of Information and Control Engineering, Xi'an University of Architecture and Technology，Xi'an Shaanxi 710055, China

Received:2011-08-05 Revised:2011-09-21 Online:2012-02-23 Published:2012-02-01
Contact: LIU Pei-qi

摘要/Abstract

摘要： 标签传递算法是一种半监督分类方法，由于该算法存在要求数据分类结果符合流行假设、数据维数较高时计算复杂度高等问题，在文本分类中效果较差。针对这些问题，经过对LDA主题模型和标签传递算法原理及复杂度的分析，将两者结合，提出一种基于LDA主题模型的标签传递算法LPLDA。该算法用LDA主题模型中的主题表示文本数据，一方面使用LDA主题模型表示文本保证分类结果符合流行假设，另一方面有效减少标签传递算法相似度计算时间。经过实验证明，该算法在标记数据少于待测样本时，分类效果优于传统的有监督分类方法。

关键词: LDA主题模型, 标签传递算法, 半监督学习, 数据降维, 流行假设

Abstract: Label Propagation (LP) algorithm is one kind of semi-supervised learning methods. However, its performance in text classification is not good enough, because LP algorithm demands manifold assumption and it has high computational complexity in calculating the similarity of high dimension data. A new method was proposed to combine Latent Dirichlet Allocation (LDA) model with LP algorithm to solve the above problems after analyzing their principles and complexities. It represented documents with latent topics in LDA. On one hand, it reduces the dimension of matrixes; on the other hand, it can help LDA model lead to the classification results with manifold assumption. The experimental results show that the new method performs better than traditional supervised text classification methods in testing sets when labeled data is less than unlabeled data.

Key words: Latent Dirichlet Allocation (LDA) model, Label Propagation (LP) algorithm, semi-supervised learning, dimensional reduction, manifold assumption

中图分类号:

刘培奇孙捷焓. 基于LDA主题模型的标签传递算法[J]. 计算机应用, 2012, 32(02): 403-410.

LIU Pei-qi SUN Jie-han. Label propagation algorithm based on LDA model[J]. Journal of Computer Applications, 2012, 32(02): 403-410.

参考文献

[1]NIGAM K, MCCALLUM A K, THRUN S, et al. Text classification from labeled and unlabeled documents using EM[J]. Machine Learning, 1999,39(2):103-134. [2]许震,沙朝锋,王晓玲,等.基于KL距离的非平衡数据半监督学习算法[J].计算机研究与发展,2010,47(1):81-87. [3]孔祥南,黎铭,姜远,等.一种针对弱标记的直推式多标记分类方法[J].计算机研究与发展，2010,47(8):1392-1399. [4]ZHU XIAOJIN. Semi-supervised learning literature survey, Computer Sciences TR 1530[R/OL]. Madison: University of Wisconsin-Madison, Department of Computer Sciences, 2006 [2011-05-12]. http://pages.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf. [5]ZHU X, GHAHRAMANI Z, LAFFERTY J. Semi-supervised learning using Gaussian fields and harmonic functions [C]// ICML 2003: The 20th International Conference on Machine Learning. Palo Alto: AAAI, 2003:912-919. [6]ZHOU D, BOUSQUET O, LAL T N, et al. Learning with local and global consistency [C]// Advances in Neural Information Processing Systems 16: Proceedings of the 2003 Conference. Cambridge: MIT Press, 2004: 321-328. [7]WANG FEI, ZHANG CHANGSHUI. Label propagation through linear neighborhoods [C]// The 23th International Conference on Machine Learning. New York: ACM, 2006:985-992. [8]ZHU X, GHAHRAMANI Z. Learning from labeled and unlabeled data with label propagation, CMU-CALD-02-107 [R]. Pittsburgh: Carnegie Mellon University, Department of Computer Science, 2002. [9]BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation[J]. Journal of Machine Learning Research, 2003,3(5): 993-1022. [10]DUMAIS D S, LANDAUER T, FURNAS G, et al. Indexing by latent semantic analysis[J]. Journal of the American Society of Information Science,1998，41(6):391-407. [11]HOFMANN T. Probabilistic latent semantic indexing [C]// Proceedings of the 22nd ACM-SIGIR International Conference on Research and Development in Information Retrieval. New York: ACM, 1999：50-57.〖BP(〗http://www.cs.brown.edu/~th/papers/Hofmann-SIGIR99.pdf〖BP)〗 [12]CAI DENG, MEI QIAOZHU, HAN JIAWEI, et al. Modeling hidden topics on document manifold [C]// CIKM 08: Proceedingof the 17th ACM Conference on Information and Knowledge Management. New York: ACM, 2008: 911-920. [13]石晶,胡明，石鑫，等.基于LDA模型的文本分割[J].计算机学报,2008,31(10):1780-1787. [14]STEYVERS M, GRIFFITHS T. Probabilistic topic models [M]// Latent Semantic Analysis: A Road to Meaning. Mahwah: Lawrence Erlbaum Associates, 2007: 424-440. [15]BLEI D M, GRIFFITHS T L, JORDAN M I, et al. Hierarchical topic models and the nested Chinese restaurant process [C]// Advances in Neural Information Processing Systems 16: Proceedings of the 2003 Conference. Cambridge: MIT Press, 2004:17-24.

基于LDA主题模型的标签传递算法

Label propagation algorithm based on LDA model

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	毛铭泽, 曹芮浩, 闫春钢. 基于权值多样性的半监督分类算法[J]. 计算机应用, 2021, 41(9): 2473-2480.
[2]	张师鹏, 李永忠, 杜祥通. 基于半监督学习和三支决策的入侵检测模型[J]. 计算机应用, 2021, 41(9): 2602-2608.
[3]	曹玉红, 徐海, 刘荪傲, 王紫霄, 李宏亮. 基于深度学习的医学影像分割研究综述[J]. 计算机应用, 2021, 41(8): 2273-2287.
[4]	朱玉娜, 张玉涛, 闫少阁, 范钰丹, 陈韩托. 基于半监督子空间聚类的协议识别方法[J]. 计算机应用, 2021, 41(10): 2900-2904.
[5]	李东博, 黄铝文. 重加权稀疏主成分分析算法及其在人脸识别中的应用[J]. 计算机应用, 2020, 40(3): 717-722.
[6]	吕亚丽, 苗钧重, 胡玮昕. 基于标签进行度量学习的图半监督学习算法[J]. 计算机应用, 2020, 40(12): 3430-3436.
[7]	程凯, 王妍, 刘剑飞. 基于生成对抗网络的自动细胞核分割半监督学习方法[J]. 计算机应用, 2020, 40(10): 2917-2922.
[8]	陈可佳, 杨泽宇, 刘峥, 鲁浩. 基于邻域选择策略的图卷积网络模型[J]. 计算机应用, 2019, 39(12): 3415-3419.
[9]	任福龙, 曹鹏, 万超, 赵大哲. 结合代价敏感半监督集成学习的糖尿病视网膜病变分级[J]. 计算机应用, 2018, 38(7): 2124-2129.
[10]	孙圣姿, 万源, 曾成. 自适应嵌入的半监督多视角特征降维方法[J]. 计算机应用, 2018, 38(12): 3391-3398.
[11]	黄华, 郑佳敏, 钱鹏江. 调整聚类假设联合成对约束半监督分类方法[J]. 计算机应用, 2018, 38(11): 3119-3126.
[12]	吕佳, 黎隽男. 结合半监督聚类和数据剪辑的自训练方法[J]. 计算机应用, 2018, 38(1): 110-115.
[13]	郭喻栋, 郭志刚, 陈刚, 魏晗. 基于数据降维与精确欧氏局部敏感哈希的k近邻推荐方法[J]. 计算机应用, 2017, 37(9): 2665-2670.
[14]	代照坤, 刘辉, 王文哲, 王亚楠. 基于谱特征嵌入的脑网络状态观测矩阵降维方法[J]. 计算机应用, 2017, 37(8): 2410-2415.
[15]	陈嶷瑛, 柴变芳, 李文斌, 贺毅朝, 吴聪聪. 基于迭代框架的主动链接选择半监督社区发现算法[J]. 计算机应用, 2017, 37(11): 3085-3089.