计算机应用 ›› 2013, Vol. 33 ›› Issue (04): 1023-1025.DOI: 10.3724/SP.J.1087.2013.01023

• 先进计算 • 上一篇    下一篇

基于MapReduce的K-Medoids并行算法

张雪萍,龚康莉,赵广才   

  1. 河南工业大学 信息科学与工程学院,郑州 450001
  • 收稿日期:2012-10-12 修回日期:2012-11-27 出版日期:2013-04-01 发布日期:2013-04-23
  • 通讯作者: 龚康莉
  • 作者简介:张雪萍(1968-),女,河南郑州人,教授,主要研究方向:智能信息处理、空间数据挖掘;龚康莉(1987-),女,江苏泰州人,硕士研究生,主要研究方向:智能信息处理;赵广才(1983-),男,河南漯河人,硕士研究生,主要研究方向:智能信息处理。
  • 基金资助:

    教育部新世纪优秀人才支持计划项目(NCET-08-0660);河南省高校科技创新人才支持计划项目(2008HASTIT012);海南省自然科学基金资助项目(610221);河南工业大学研究生创新计划基金资助项目(11YJCX69)

Parallel K-Medoids algorithm based on MapReduce

ZHANG Xueping,GONG Kangli,ZHAO Guangcai   

  1. College of Information Science and Engineering, Henan University of Technology, Zhengzhou Henan 450001, China
  • Received:2012-10-12 Revised:2012-11-27 Online:2013-04-01 Published:2013-04-23
  • Contact: GONG Kangli

摘要: 为了解决传统K-Medoids聚类算法在处理海量数据信息时所面临的内存容量和CPU处理速度的瓶颈问题,在深入研究K-Medoids算法的基础之上,提出了基于MapReduce编程模型的K-Medoids并行化算法思想。Map函数部分的主要任务是计算每个数据对象到簇类中心点的距离并(重新)分配其所属的聚类簇;Reduce函数部分的主要任务是根据Map部分得到的中间结果,计算出新簇类的中心点,然后作为中心点集给下一次MapReduce过程使用。实验结果表明:运行在Hadoop集群上的基于MapReduce的K-Medoids并行化算法具有较好的聚类结果和可扩展性,对于较大的数据集,该算法得到的加速比更接近于线性。

关键词: K-Medoids, 云计算, MapReduce, 并行计算, Hadoop

Abstract: In order to solve the bottleneck problems of memory capacity and CPU processing speed when the traditional K-Medoids clustering algorithm is used to deal with massive data, based on the in-depth study of K-Medoids algorithm, a parallel K-Medoids algorithm based on the MapReduce programming model was proposed. The part of Map function is to calculate the distance of each data object to the center point of the cluster and (re)allocation of their respective clusters, and the part of Reduce function is to calculate the new center point of each cluster according to the intermediate results of the Map section. The experimental results show that the parallel K-Medoids algorithm in the Hadoop cluster based on the MapReduce running has good clustering results and scalability, and for large data sets, the algorithm may get close to linear speedup.

Key words: K-Medoids, cloud computing, MapReduce, parallel computing, Hadoop

中图分类号: