计算机应用 ›› 2016, Vol. 36 ›› Issue (10): 2710-2714.DOI: 10.11772/j.issn.1001-9081.2016.10.2710

• 先进计算 • 上一篇    下一篇

基于GraphX的分布式幂迭代聚类

赵军1,2, 徐晓燕2   

  1. 1. 北京航空航天大学 计算机科学与技术系, 北京 100191;
    2. 国家计算机网络应急技术处理协调中心, 北京 100029
  • 收稿日期:2016-04-29 修回日期:2016-06-22 出版日期:2016-10-10 发布日期:2016-10-10
  • 通讯作者: 徐晓燕,E-mail:xuxy@cert.org.cn
  • 作者简介:赵军(1990—),男,四川绵阳人,硕士研究生,主要研究方向:大数据分析与处理、机器学习;徐晓燕(1987—),女,山东临沂人,工程师,博士,主要研究方向:网络安全。
  • 基金资助:
    国家973计划项目(2014CB340305)。

Distributed power iteration clustering based on GraphX

ZHAO Jun1,2, XU Xiaoyan2   

  1. 1. Department of Computer Science and Technology, Beihang University, Beijing 100191, China;
    2. National Computer Network Emergency Response Technical Team /Coordinate Center of China, Beijing 100029, China
  • Received:2016-04-29 Revised:2016-06-22 Online:2016-10-10 Published:2016-10-10
  • Supported by:
    BackgroundThis work is partially supported by the National Basic Research Program (973 Program) of China (2014CB340305).

摘要: 为解决幂迭代聚类算法并行实现中存在的编程繁琐、效率低下等问题,基于Spark大规模数据通用计算引擎及其GraphX组件,提出了一种在分布式环境下实现幂迭代聚类的方法。首先,利用某种相似性度量方法,将原始数据转换成一个可以视为图的亲和矩阵;然后,通过顶点切割,把行归一化后的亲和矩阵切分成若干个小图,分别存储在不同的机器上;最后,利用Spark基于内存计算的特点,对存储在集群中的图进行多次迭代计算,得到这个图的一个切割,图的每一个划分子图对应一个类簇。在不同规模的数据集和不同executor个数下进行的实验结果表明,基于GraphX的分布式幂迭代聚类算法具有良好的可扩展性,算法运行时间与executor个数呈负相关的线性关系,在6个executor下,与单个executor相比,算法的加速比达到了2.09到3.77。同时,通过与基于Hadoop的幂迭代聚类进行对比,在新闻数量为40000篇时,运行时间降低了61%。

关键词: GraphX, 图计算, 幂迭代聚类, 内存计算, RDD

Abstract: Concerning the cumbersome programming and low efficiency in parallel power iteration clustering algorithm, a new method for power iteration clustering in distributed environment was put forward based on Spark, a general computational engine for large-scale data processing, and its component GraphX. Firstly, the raw data was transformed into an affinity matrix which can be viewed as a graph by using some kind of similarity measure ment method. Secondly, by using vertex-cut technology, the row-normalized affinity matrix was divided into a number of subgraphs, which were stored on different machines of a cluster. Finally, using the in-memory computational framework Spark, several iterations were performed on the subgraphs stored in the cluster to get a cut of the original graph, and each subgraph of the original graph corresponded to a cluster. The experiments were carried out on datasets with different sizes and different number of executors. Experimental results show that the proposed distributed power iteration clustering algorithm has a good scalability, its running time is negatively correlated with the number of executors, the speedup of the algorithm ranges between 2.09 to 3.77 in a cluster of 6 executors compared with a single executor. Meanwhile, compared with the Hadoop-based power iteration clustering version, the running time of the proposed algorithm decreased significantly by 61% when dealing with 40000 pieces of news.

Key words: GraphX, graph computation, power iteration clustering, in-memory computation, Resilient Distributed Dataset (RDD)

中图分类号: