%0 Journal Article
%A CHEN Xiaodong
%A CUI Yixin
%T Spark framework based optimized large-scale spectral clustering parallel algorithm
%D 2020
%R 10.11772/j.issn.1001-9081.2019061061
%J Journal of Computer Applications
%P 168-172
%V 40
%N 1
%X To solve the performance bottlenecks such as time-consuming computation and inability of clustering in spectral clustering on large-scale datasets, a spectral clustering parallelization algorithm suitable for large-scale datasets was proposed based on Spark technology. Firstly, the similar matrices were constructed through one-way loop iteration to avoid double counting. Then, the construction and normalization of Laplacian matrices were optimized by position transformation and scalar multiplication replacement in order to reduce the storage requirements. Finally, the approximate eigenvector calculation was used to further reduce the computational cost. The experimental results on different test datasets show that, as the size of test dataset increases, the proposed algorithm has the running time of one-way loop iteration and the approximate eigenvector calculation increased linearly with slow speed, the clustering effects of approximate eigenvector calculation are similar to those of exact eigenvector calculation, and the algorithm shows good extensibility on large-scale datasets. On the basis of obtaining better spectral clustering performance, the improved algorithm increases operation efficiency, and effectively alleviates high computational cost and the problem of clustering.
%U http://www.joca.cn/EN/10.11772/j.issn.1001-9081.2019061061