Journal of Computer Applications ›› 2014, Vol. 34 ›› Issue (1): 249-254.DOI: 10.11772/j.issn.1001-9081.2014.01.0249

• Network and distributed techno • Previous Articles     Next Articles

Improved K-means algorithm based on latent Dirichlet allocation for text clustering

WANG Chunlong1,ZHANG Jingxu2   

  1. 1. School of Control and Computer Engineering, North China Electric Power University, Beijing 102206, China;
    2. Gansu Electric Power Corporation, Lanzhou Gansu 730030, China
  • Received:2013-07-23 Revised:2013-09-27 Online:2014-01-01 Published:2014-02-14
  • Contact: WANG Chunlong



  1. 1. 华北电力大学 控制与计算机工程学院,北京 102206
    2. 甘肃省电力公司,兰州 730030
  • 通讯作者: 王春龙
  • 作者简介:王春龙 (1987-),男,河北保定人,硕士研究生,主要研究方向:信息检索、语义Web;张敬旭(1983-),男,山东莱芜人,硕士研究生,主要研究方向:信息系统。
  • 基金资助:


Abstract: The traditional K-means algorithm has an increasing number of iterations, and often falls into local optimal solution and unstable clustering since the initial cluster centers are randomly selected. To solve these problems, an initial clustering centers selection algorithm based on Latent Dirichlet Allocation (LDA) model for the K-means algorithm was proposed. In this improved algorithm, the top-m most important topics in text corpora were first selected. Then, the text corpora was preliminarily clustered based on the m dimensions of topics. As a result, the m cluster centers could be got in the algorithm, which were used to further make clustering on all the dimensions of the text corpora. Theoretically, the center for each cluster can be determined based on the probability without randomly selecting them. The experiment demonstrates that the clustering results of the improved algorithm are more accurate with smaller number of iterations.

Key words: topic model, K-means, cluster center, text clustering, Latent Dirichlet Allocation (LDA)

摘要: 针对传统K-means算法初始聚类中心选择的随机性可能导致迭代次数增加、陷入局部最优和聚类结果不稳定现象的缺陷,提出一种基于隐含狄利克雷分布(LDA)主题概率模型的初始聚类中心选择算法。该算法选择蕴含在文本集中影响程度最大的前m个主题,并在这m个主题所在的维度上对文本集进行初步聚类,从而找到聚类中心,然后以这些聚类中心为初始聚类中心对文本集进行所有维度上的聚类,理论上保证了选择的初始聚类中心是基于概率可确定的。实验结果表明改进后算法聚类迭代次数明显减少,聚类结果更准确。

关键词: 主题模型, K-means, 聚类中心, 文本聚类, 隐含狄利克雷分布

CLC Number: