基于LDA的改进K-means算法在文本聚类中的应用

doi:10.11772/j.issn.1001-9081.2014.01.0249

计算机应用 ›› 2014, Vol. 34 ›› Issue (1): 249-254.DOI: 10.11772/j.issn.1001-9081.2014.01.0249

基于LDA的改进K-means算法在文本聚类中的应用

王春龙¹,张敬旭²

1. 华北电力大学控制与计算机工程学院,北京 102206
2. 甘肃省电力公司,兰州 730030

收稿日期:2013-07-23 修回日期:2013-09-27 出版日期:2014-01-01 发布日期:2014-02-14
通讯作者: 王春龙
作者简介:王春龙 (1987-),男,河北保定人,硕士研究生,主要研究方向:信息检索、语义Web;张敬旭(1983-),男,山东莱芜人,硕士研究生,主要研究方向:信息系统。
基金资助:
国家自然科学基金资助项目;国家电网公司科技项目

Improved K-means algorithm based on latent Dirichlet allocation for text clustering

WANG Chunlong¹,ZHANG Jingxu²

1. School of Control and Computer Engineering, North China Electric Power University, Beijing 102206, China;
2. Gansu Electric Power Corporation, Lanzhou Gansu 730030, China

Received:2013-07-23 Revised:2013-09-27 Online:2014-01-01 Published:2014-02-14
Contact: WANG Chunlong

摘要/Abstract

摘要： 针对传统K-means算法初始聚类中心选择的随机性可能导致迭代次数增加、陷入局部最优和聚类结果不稳定现象的缺陷,提出一种基于隐含狄利克雷分布(LDA)主题概率模型的初始聚类中心选择算法。该算法选择蕴含在文本集中影响程度最大的前m个主题,并在这m个主题所在的维度上对文本集进行初步聚类,从而找到聚类中心,然后以这些聚类中心为初始聚类中心对文本集进行所有维度上的聚类,理论上保证了选择的初始聚类中心是基于概率可确定的。实验结果表明改进后算法聚类迭代次数明显减少,聚类结果更准确。

关键词: 主题模型, K-means, 聚类中心, 文本聚类, 隐含狄利克雷分布

Abstract: The traditional K-means algorithm has an increasing number of iterations, and often falls into local optimal solution and unstable clustering since the initial cluster centers are randomly selected. To solve these problems, an initial clustering centers selection algorithm based on Latent Dirichlet Allocation (LDA) model for the K-means algorithm was proposed. In this improved algorithm, the top-m most important topics in text corpora were first selected. Then, the text corpora was preliminarily clustered based on the m dimensions of topics. As a result, the m cluster centers could be got in the algorithm, which were used to further make clustering on all the dimensions of the text corpora. Theoretically, the center for each cluster can be determined based on the probability without randomly selecting them. The experiment demonstrates that the clustering results of the improved algorithm are more accurate with smaller number of iterations.

Key words: topic model, K-means, cluster center, text clustering, Latent Dirichlet Allocation (LDA)

中图分类号:

TP301.6

王春龙张敬旭. 基于LDA的改进K-means算法在文本聚类中的应用[J]. 计算机应用, 2014, 34(1): 249-254.

WANG Chunlong ZHANG Jingxu. Improved K-means algorithm based on latent Dirichlet allocation for text clustering[J]. Journal of Computer Applications, 2014, 34(1): 249-254.

[1]	杨丰瑞, 霍娜, 张许红, 韦巍. 基于注意力机制的主题扩展情感对话生成[J]. 计算机应用, 2021, 41(4): 1078-1083.
[2]	郭佳, 韩李涛, 孙宪龙, 周丽娟. 自动确定聚类中心的比较密度峰值聚类算法[J]. 计算机应用, 2021, 41(3): 738-744.
[3]	杨威亚, 余正涛, 高盛祥, 宋燃. 基于跨语言神经主题模型的汉越新闻话题发现方法[J]. 计算机应用, 2021, 41(10): 2879-2884.
[4]	朱思淼, 魏世伟, 魏思恒, 余敦辉. 基于弹幕情感分析和主题模型的视频推荐算法[J]. 计算机应用, 2021, 41(10): 2813-2819.
[5]	尹春勇, 章荪. 面向短文本情感分类的端到端对抗变分贝叶斯方法[J]. 计算机应用, 2020, 40(9): 2536-2542.
[6]	田保军, 刘爽, 房建东. 融合主题信息和卷积神经网络的混合推荐算法[J]. 计算机应用, 2020, 40(7): 1901-1907.
[7]	吴斌, 卢红丽, 江惠君. 自适应密度峰值聚类算法[J]. 计算机应用, 2020, 40(6): 1654-1661.
[8]	任杰, 闵帆, 汪敏. 基于最远总距离采样的代价敏感主动学习[J]. 计算机应用, 2019, 39(9): 2499-2504.
[9]	王巧玲, 乔非, 蒋友好. 基于聚合距离参数的改进K-means算法[J]. 计算机应用, 2019, 39(9): 2586-2590.
[10]	杨飞, 罗建桥, 李柏林. 结合全局和局部约束的sLDA铁路扣件分类模型[J]. 计算机应用, 2019, 39(3): 888-893.
[11]	王治和, 黄梦莹, 杜辉, 秦红武. 基于密度峰值与密度聚类的集成算法[J]. 计算机应用, 2019, 39(2): 398-402.
[12]	徐红艳, 王丹, 王富海, 王嵘冰. 融合潜在狄利克雷分布与元路径分析的用户相关性度量方法[J]. 计算机应用, 2019, 39(11): 3288-3292.
[13]	余慧, 冯旭鹏, 刘利军, 黄青松. 聊天机器人中用户就医意图识别方法[J]. 计算机应用, 2018, 38(8): 2170-2174.
[14]	曹大为, 贺超波, 陈启买, 刘海. 基于加权核非负矩阵分解的短文本聚类算法[J]. 计算机应用, 2018, 38(8): 2180-2184.
[15]	许银洁, 孙春华, 刘业政. 考虑用户特征的主题情感联合模型[J]. 计算机应用, 2018, 38(5): 1261-1266.

基于LDA的改进K-means算法在文本聚类中的应用

Improved K-means algorithm based on latent Dirichlet allocation for text clustering

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics