基于GraphX的分布式幂迭代聚类

doi:10.11772/j.issn.1001-9081.2016.10.2710

计算机应用 ›› 2016, Vol. 36 ›› Issue (10): 2710-2714.DOI: 10.11772/j.issn.1001-9081.2016.10.2710

基于GraphX的分布式幂迭代聚类

赵军^1,2, 徐晓燕²

1. 北京航空航天大学计算机科学与技术系, 北京 100191;
2. 国家计算机网络应急技术处理协调中心, 北京 100029

收稿日期:2016-04-29 修回日期:2016-06-22 出版日期:2016-10-10 发布日期:2016-10-10
通讯作者: 徐晓燕,E-mail:xuxy@cert.org.cn
作者简介:赵军(1990—),男,四川绵阳人,硕士研究生,主要研究方向:大数据分析与处理、机器学习;徐晓燕(1987—),女,山东临沂人,工程师,博士,主要研究方向:网络安全。
基金资助:
国家973计划项目（2014CB340305）。

Distributed power iteration clustering based on GraphX

ZHAO Jun^1,2, XU Xiaoyan²

1. Department of Computer Science and Technology, Beihang University, Beijing 100191, China;
2. National Computer Network Emergency Response Technical Team /Coordinate Center of China, Beijing 100029, China

Received:2016-04-29 Revised:2016-06-22 Online:2016-10-10 Published:2016-10-10
Supported by:
BackgroundThis work is partially supported by the National Basic Research Program (973 Program) of China (2014CB340305).

摘要/Abstract

摘要： 为解决幂迭代聚类算法并行实现中存在的编程繁琐、效率低下等问题，基于Spark大规模数据通用计算引擎及其GraphX组件，提出了一种在分布式环境下实现幂迭代聚类的方法。首先，利用某种相似性度量方法，将原始数据转换成一个可以视为图的亲和矩阵；然后，通过顶点切割，把行归一化后的亲和矩阵切分成若干个小图，分别存储在不同的机器上；最后，利用Spark基于内存计算的特点，对存储在集群中的图进行多次迭代计算，得到这个图的一个切割，图的每一个划分子图对应一个类簇。在不同规模的数据集和不同executor个数下进行的实验结果表明，基于GraphX的分布式幂迭代聚类算法具有良好的可扩展性，算法运行时间与executor个数呈负相关的线性关系，在6个executor下，与单个executor相比，算法的加速比达到了2.09到3.77。同时，通过与基于Hadoop的幂迭代聚类进行对比，在新闻数量为40000篇时，运行时间降低了61%。

关键词: GraphX, 图计算, 幂迭代聚类, 内存计算, RDD

Abstract: Concerning the cumbersome programming and low efficiency in parallel power iteration clustering algorithm, a new method for power iteration clustering in distributed environment was put forward based on Spark, a general computational engine for large-scale data processing, and its component GraphX. Firstly, the raw data was transformed into an affinity matrix which can be viewed as a graph by using some kind of similarity measure ment method. Secondly, by using vertex-cut technology, the row-normalized affinity matrix was divided into a number of subgraphs, which were stored on different machines of a cluster. Finally, using the in-memory computational framework Spark, several iterations were performed on the subgraphs stored in the cluster to get a cut of the original graph, and each subgraph of the original graph corresponded to a cluster. The experiments were carried out on datasets with different sizes and different number of executors. Experimental results show that the proposed distributed power iteration clustering algorithm has a good scalability, its running time is negatively correlated with the number of executors, the speedup of the algorithm ranges between 2.09 to 3.77 in a cluster of 6 executors compared with a single executor. Meanwhile, compared with the Hadoop-based power iteration clustering version, the running time of the proposed algorithm decreased significantly by 61% when dealing with 40000 pieces of news.

Key words: GraphX, graph computation, power iteration clustering, in-memory computation, Resilient Distributed Dataset (RDD)

中图分类号:

TP301.6

赵军, 徐晓燕. 基于GraphX的分布式幂迭代聚类[J]. 计算机应用, 2016, 36(10): 2710-2714.

ZHAO Jun, XU Xiaoyan. Distributed power iteration clustering based on GraphX[J]. Journal of Computer Applications, 2016, 36(10): 2710-2714.

参考文献

[1] 王虹旭, 吴斌, 刘旸. 基于Spark的并行图数据分析系统[J]. 计算机科学与探索, 2015, 9(9):1066-1074.(WANG H X, WU B, LIU Y. Parallel graph data analysis system based on Spark[J]. Journal of Frontiers of Computer Science and Technology, 2015, 9(9): 1066-1074.)
[2] YAN W, BRAHMAKSHATRIYA U, XUE Y, et al. p-PIC: parallel power iteration clustering for big data[J]. Journal of Parallel & Distributed Computing, 2013, 73(3):352-359.
[3] LIU L, CHEN X, LIU M, et al. An influence power-based clustering approach with PageRank-like model[J]. Applied Soft Computing, 2015, 40(11):17-32.
[4] LIN N P, CHANG C I, CHUEH H E, et al. A deflected grid-based algorithm for clustering analysis[J]. Wseas Transactions on Computers, 2008, 7(4):125-132.
[5] MENENDEZ H D, CAMACHO D. GANY: a genetic spectral-based clustering algorithm for large data analysis[C]//Proceedings of the 2015 IEEE Congress on Evolutionary Computation. Piscataway, NJ: IEEE, 2015:640-647.
[6] LIU L, SUN L, CHEN S, et al. K-PRSCAN: a clustering method based on PageRank[J]. Neurocomputing, 2016, 175(11):65-80.
[7] LIN F, COHEN W W. Power iteration clustering[C]//Proceedings of the 27th International Conference on Machine Learning. Haifa: [s.n.], 2010: 655-662.
[8] DARJI A, WAGHELA D. Parallel power iteration clustering for big data using MapReduce in Hadoop[J]. International Journal of Advanced Research in Computer Science and Software Engineering, 2014, 4(6):1357-1363.
[9] ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark: cluster computing with working sets[C]//HotCloud 2010: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. Berkeley: USENIX, 2010:1765-1773.
[10] GONZALEZ J E, XIN R S, DAVE A, et al. GraphX: graph processing in a distributed dataflow framework[C]//OSDI 2014: Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation. Berkeley: USENIX, 2014:599-613.
[11] ZAHARIA M, CHOWDHURY M, DAS T, et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing[C]//NSDI 2012: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. Berkeley: USENIX, 2012:2.
[12] 陈侨安, 李峰, 曹越, 等. 基于运行数据分析的Spark任务参数优化[J]. 计算机工程与科学, 2016, 38(1):11-19.(CHEN Q A, LI F, CAO Y, et al. Parameter optimization for Spark jobs based on runtime data analysis[J]. Computer Engineering and Science, 2016, 38(1):11-19.)

基于GraphX的分布式幂迭代聚类

Distributed power iteration clustering based on GraphX

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 5

编辑推荐

Metrics

[1]	张鲁飞, 孙茹君, 秦芳. 面向图计算的运行时系统的消息聚合技术[J]. 计算机应用, 2021, 41(4): 984-989.
[2]	卞琛, 于炯, 修位蓉, 英昌甜, 钱育蓉. 基于迭代填充的内存计算框架分区映射算法[J]. 计算机应用, 2017, 37(3): 647-653.
[3]	丁梦苏, 陈世敏. 轻量级大数据运算系统Helius[J]. 计算机应用, 2017, 37(2): 305-310.
[4]	赵永彬, 陈硕, 刘明, 王佳楠, 贲驰. 流计算与内存计算架构下的运营状态监测分析[J]. 计算机应用, 2017, 37(10): 3029-3033.
[5]	王桂兰, 周国亮, 萨初日拉, 朱永利. Spark环境下的并行模糊C均值聚类算法[J]. 计算机应用, 2016, 36(2): 342-347.