Improved K-means algorithm based on latent Dirichlet allocation for text clustering
WANG Chunlong1,ZHANG Jingxu2
1. School of Control and Computer Engineering, North China Electric Power University, Beijing 102206, China;
2. Gansu Electric Power Corporation, Lanzhou Gansu 730030, China
Abstract��The traditional K-means algorithm has an increasing number of iterations, and often falls into local optimal solution and unstable clustering since the initial cluster centers are randomly selected. To solve these problems, an initial clustering centers selection algorithm based on Latent Dirichlet Allocation (LDA) model for the K-means algorithm was proposed. In this improved algorithm, the top-m most important topics in text corpora were first selected. Then, the text corpora was preliminarily clustered based on the m dimensions of topics. As a result, the m cluster centers could be got in the algorithm, which were used to further make clustering on all the dimensions of the text corpora. Theoretically, the center for each cluster can be determined based on the probability without randomly selecting them. The experiment demonstrates that the clustering results of the improved algorithm are more accurate with smaller number of iterations.
������ �ž���. ����LDA�ĸĽ�K-means�㷨���ı������е�Ӧ��[J]. �����Ӧ��, 2014, 34(1): 249-254.
WANG Chunlong ZHANG Jingxu. Improved K-means algorithm based on latent Dirichlet allocation for text clustering. Journal of Computer Applications, 2014, 34(1): 249-254.