Concerning the deficiency in scalability of the traditional hierarchical clustering algorithm when dealing with large-scale text, a parallel hierarchical clustering algorithm based on the MapReduce programming model was proposed. The vertical data partitioning algorithm based on the statistical characteristic of the components group of text vector was developed for data partitioning in MapReduce. Additionally, the sorting characteristics of the MapReduce were applied to select the merge points, making the algorithm be more efficient and conducive to improve clustering accuracy. The experimental results show that the proposed algorithm is effective and has good scalability.
ZHAO W, MA H, FU Y, et al. Research on parallel K-means algorithm design based on Hadoop platform [J]. Computer Science, 2011,38(10):166-168. (赵卫中,马慧芳,傅燕翔,等.基于云计算平台Hadoop的并行K-means聚类算法设计研究[J].计算机科学,2011,38(10):166-168.)
[4]
JIANG X, LI C, XIANG W, et al. Parallel implementing K-means clustering algorithm using MapReduce programming mode [J]. Journal of Huazhong University of Science and Technology: Natural Science, 2011,39(S1):120-124.(江小平,李成华,向文,等.K-means聚类算法的MapReduce并行化实现 [J].华中科技大学学报:自然科学版,2011,39(S1):120-124.)
[5]
ZHANG S, WU Z. Clustering algorithm optimization research based on Hadoop [J]. Computer Science, 2012, 39(S2): 115-118. (张石磊,武装.一种基于Hadoop云计算平台的聚类算法优化的研究[J].计算机科学,2012,39(S2):115-118.)
[6]
MAO D. Improved Canopy-Kmeans algorithm based on MapReduce [J]. Computer Engineering and Applications, 2012,48(27):22-26. (毛典辉.基于MapReduce的Canopy-Kmeans改进算法[J].计算机工程与应用,2012,48(27):22-26.)
[7]
YIN J, WANG L. Parallel K-means optimized by vertical dataset division [J]. Computer Engineering and Applications, 2010,46(15):127-131.(尹建君,王乐.数据划分优化的并行K-means算法[J].计算机工程与应用,2010,46(15):127-131.)
[8]
OLSON C F. Parallel algorithms for hierarchical clustering [J]. Parallel Computing, 1995,21(8):1313-1325.
[9]
RAJASEKARAN S. Efficient parallel hierarchical clustering algorithms [J]. IEEE Transactions on Parallel and Distributed Systems, 2005,16(6):497-502.
[10]
QU N. The distributed hits algorithm in social network analysis tool [D]. Beijing: Beijing University of Posts and Telecommunications, 2011.(渠娜.社会网络分析工具中的分布式超链接检索算法[D].北京:北京邮电大学,2011.)
[11]
HAN J, KAMBER M. Data mining: concepts and techniques [M]. 2nd ed. San Francisco: Morgan Kaufmann, 2006.
[12]
WHITE T. Hadoop: the definitive guide [M]. 3rd ed. Sebastopol: O'Reilly Media, 2012.