Journal of Computer Applications ›› 2020, Vol. 40 ›› Issue (1): 168-172.DOI: 10.11772/j.issn.1001-9081.2019061061

• Advanced computing • Previous Articles     Next Articles

Spark framework based optimized large-scale spectral clustering parallel algorithm

CUI Yixin, CHEN Xiaodong   

  1. Network and Information Center, Taiyuan Institute of Technology, Taiyuan Shanxi 030008, China
  • Received:2019-06-21 Revised:2019-09-22 Online:2020-01-10 Published:2019-10-10
  • Contact: 崔艺馨


崔艺馨, 陈晓东   

  1. 太原工业学院 网络与信息中心, 太原 030008
  • 作者简介:崔艺馨(1981-),女,山西忻州人,实验师,硕士,主要研究方向:数据挖掘、计算机网络;陈晓东(1978-),男,河北唐山人,副教授,博士,主要研究方向:计算机网络。

Abstract: To solve the performance bottlenecks such as time-consuming computation and inability of clustering in spectral clustering on large-scale datasets, a spectral clustering parallelization algorithm suitable for large-scale datasets was proposed based on Spark technology. Firstly, the similar matrices were constructed through one-way loop iteration to avoid double counting. Then, the construction and normalization of Laplacian matrices were optimized by position transformation and scalar multiplication replacement in order to reduce the storage requirements. Finally, the approximate eigenvector calculation was used to further reduce the computational cost. The experimental results on different test datasets show that, as the size of test dataset increases, the proposed algorithm has the running time of one-way loop iteration and the approximate eigenvector calculation increased linearly with slow speed, the clustering effects of approximate eigenvector calculation are similar to those of exact eigenvector calculation, and the algorithm shows good extensibility on large-scale datasets. On the basis of obtaining better spectral clustering performance, the improved algorithm increases operation efficiency, and effectively alleviates high computational cost and the problem of clustering.

Key words: large-scale spectral clustering, similar matrix sparsification, one-way loop iteration, approximate eigenvector, distributed Spark parallel computing

摘要: 为解决谱聚类在大规模数据集上存在的计算耗时和无法聚类等性能瓶颈制约,提出了基于Spark技术的大规模数据集谱聚类的并行化算法。首先,通过单向循环迭代优化相似矩阵的构建,避免重复计算;然后,通过位置变换和标量乘法替换来优化Laplacian矩阵的构建与正规化,降低存储需求;最后,采用近似特征向量计算来进一步减少计算量。不同测试数据集上的实验结果表明:随着测试数据集的规模增加,所提算法的单向循环迭代和近似特征值计算的运行时间呈线性增长,增长缓慢,其近似特征向量计算与精确特征向量计算取得相近的聚类效果,并且算法在大规模数据集上表现出良好的可扩展性。在获得较好的谱聚类性能的基础上,改进算法提高了运行效率,有效缓解了谱聚类的计算耗时及无法聚类问题。

关键词: 大规模谱聚类, 相似矩阵稀疏化, 单向循环迭代, 近似特征向量, 分布式Spark并行计算

CLC Number: