Search Result

Journals

Publication Years

Keywords

Please wait a minute...

For Selected:

Download Citations
EndNote Ris BibTeX

Toggle Thumbnails

Select

Distributed power iteration clustering based on GraphX

ZHAO Jun, XU Xiaoyan

Journal of Computer Applications 2016, 36 (10): 2710-2714. DOI: 10.11772/j.issn.1001-9081.2016.10.2710

Abstract （479）

PDF （706KB）（566）

Save

Concerning the cumbersome programming and low efficiency in parallel power iteration clustering algorithm, a new method for power iteration clustering in distributed environment was put forward based on Spark, a general computational engine for large-scale data processing, and its component GraphX. Firstly, the raw data was transformed into an affinity matrix which can be viewed as a graph by using some kind of similarity measure ment method. Secondly, by using vertex-cut technology, the row-normalized affinity matrix was divided into a number of subgraphs, which were stored on different machines of a cluster. Finally, using the in-memory computational framework Spark, several iterations were performed on the subgraphs stored in the cluster to get a cut of the original graph, and each subgraph of the original graph corresponded to a cluster. The experiments were carried out on datasets with different sizes and different number of executors. Experimental results show that the proposed distributed power iteration clustering algorithm has a good scalability, its running time is negatively correlated with the number of executors, the speedup of the algorithm ranges between 2.09 to 3.77 in a cluster of 6 executors compared with a single executor. Meanwhile, compared with the Hadoop-based power iteration clustering version, the running time of the proposed algorithm decreased significantly by 61% when dealing with 40000 pieces of news.

Reference | Related Articles | Metrics