计算机应用 ›› 2014, Vol. 34 ›› Issue (6): 1595-1599.DOI: 10.11772/j.issn.1001-9081.2014.06.1595

• 先进计算 • 上一篇    下一篇

基于MapReduce的文本层次聚类并行化

余晓山,吴扬扬   

  1. 华侨大学 计算机科学与技术学院,福建 厦门 361021
  • 收稿日期:2013-11-12 修回日期:2013-12-30 出版日期:2014-06-01 发布日期:2014-07-02
  • 通讯作者: 吴扬扬
  • 作者简介:余晓山(1989-),男,福建泉州人,硕士研究生,CCF会员,主要研究方向:文本挖掘;吴扬扬(1957-),女,福建泉州人,教授,CCF会员,主要研究方向:数据库、数据挖掘。
  • 基金资助:

    福建省科技计划重大项目;福建省科技计划重点项目

Parallel text hierarchical clustering based on MapReduce

YU Xiaoshan,WU Yangyang   

  1. School of Computer Science and Technology, Huaqiao University, Xiamen Fujian 361021, China
  • Received:2013-11-12 Revised:2013-12-30 Online:2014-06-01 Published:2014-07-02
  • Contact: WU Yangyang
  • Supported by:

    Science and Technology Program Key Project of Fujian Province of China

摘要:

针对传统的层次聚类算法在处理大规模文本时可扩展性不足的问题,提出基于MapReduce编程模型的并行化文本层次聚类算法。将基于文本向量分量组特征统计的垂直数据划分算法应用于MapReduce的数据分发,将MapReduce的排序特性应用于合并点的选择,使得算法更加高效,同时有利于提高聚类精度。实验结果表明了利用该算法进行大规模文本聚类的有效性及良好的可扩展性。

Abstract:

Concerning the deficiency in scalability of the traditional hierarchical clustering algorithm when dealing with large-scale text, a parallel hierarchical clustering algorithm based on the MapReduce programming model was proposed. The vertical data partitioning algorithm based on the statistical characteristic of the components group of text vector was developed for data partitioning in MapReduce. Additionally, the sorting characteristics of the MapReduce were applied to select the merge points, making the algorithm be more efficient and conducive to improve clustering accuracy. The experimental results show that the proposed algorithm is effective and has good scalability.

中图分类号: