Spark框架优化的大规模谱聚类并行算法

doi:10.11772/j.issn.1001-9081.2019061061

计算机应用 ›› 2020, Vol. 40 ›› Issue (1): 168-172.DOI: 10.11772/j.issn.1001-9081.2019061061

Spark框架优化的大规模谱聚类并行算法

崔艺馨, 陈晓东

太原工业学院网络与信息中心, 太原 030008

收稿日期:2019-06-21 修回日期:2019-09-22 发布日期:2019-10-10 出版日期:2020-01-10
作者简介:崔艺馨(1981-),女,山西忻州人,实验师,硕士,主要研究方向:数据挖掘、计算机网络;陈晓东(1978-),男,河北唐山人,副教授,博士,主要研究方向:计算机网络。

Spark framework based optimized large-scale spectral clustering parallel algorithm

CUI Yixin, CHEN Xiaodong

Network and Information Center, Taiyuan Institute of Technology, Taiyuan Shanxi 030008, China

Received:2019-06-21 Revised:2019-09-22 Online:2019-10-10 Published:2020-01-10
Contact: 崔艺馨

摘要/Abstract

摘要： 为解决谱聚类在大规模数据集上存在的计算耗时和无法聚类等性能瓶颈制约，提出了基于Spark技术的大规模数据集谱聚类的并行化算法。首先，通过单向循环迭代优化相似矩阵的构建，避免重复计算；然后，通过位置变换和标量乘法替换来优化Laplacian矩阵的构建与正规化，降低存储需求；最后，采用近似特征向量计算来进一步减少计算量。不同测试数据集上的实验结果表明：随着测试数据集的规模增加，所提算法的单向循环迭代和近似特征值计算的运行时间呈线性增长，增长缓慢，其近似特征向量计算与精确特征向量计算取得相近的聚类效果，并且算法在大规模数据集上表现出良好的可扩展性。在获得较好的谱聚类性能的基础上，改进算法提高了运行效率，有效缓解了谱聚类的计算耗时及无法聚类问题。

关键词: 大规模谱聚类, 相似矩阵稀疏化, 单向循环迭代, 近似特征向量, 分布式Spark并行计算

Abstract: To solve the performance bottlenecks such as time-consuming computation and inability of clustering in spectral clustering on large-scale datasets, a spectral clustering parallelization algorithm suitable for large-scale datasets was proposed based on Spark technology. Firstly, the similar matrices were constructed through one-way loop iteration to avoid double counting. Then, the construction and normalization of Laplacian matrices were optimized by position transformation and scalar multiplication replacement in order to reduce the storage requirements. Finally, the approximate eigenvector calculation was used to further reduce the computational cost. The experimental results on different test datasets show that, as the size of test dataset increases, the proposed algorithm has the running time of one-way loop iteration and the approximate eigenvector calculation increased linearly with slow speed, the clustering effects of approximate eigenvector calculation are similar to those of exact eigenvector calculation, and the algorithm shows good extensibility on large-scale datasets. On the basis of obtaining better spectral clustering performance, the improved algorithm increases operation efficiency, and effectively alleviates high computational cost and the problem of clustering.

Key words: large-scale spectral clustering, similar matrix sparsification, one-way loop iteration, approximate eigenvector, distributed Spark parallel computing

中图分类号:

崔艺馨, 陈晓东. Spark框架优化的大规模谱聚类并行算法[J]. 计算机应用, 2020, 40(1): 168-172.

CUI Yixin, CHEN Xiaodong. Spark framework based optimized large-scale spectral clustering parallel algorithm[J]. Journal of Computer Applications, 2020, 40(1): 168-172.

参考文献

[1] MIJANGOS V, SIERRA G, MONTES A. Sentence level matrix representation for document spectral clustering[J]. Pattern Recognition Letters, 2017, 85:29-34.
[2] WANG Y, WU L, LIN X, et al. Multiview spectral clustering via structured low-rank matrix factorization[J]. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(10):4833-4843.
[3] JIA H, DING S, DU M. Self-tuning p-spectral clustering based on shared nearest neighbors[J]. Cognitive Computation, 2015, 7(5):622-632.
[4] 杨艺,马儒宁.基于核心点的大数据谱聚类算法[J].中国科学技术大学学报,2016,46(9):757-763.(YANG Y, MA R N. Core-points based spectral clustering for big data analysis[J]. Journal of University of Science and Technology of China, 2016, 46(9):757-763.)
[5] 金海,张劲松,吴睿.一种基于抽样改进加权核k-means的大数据谱聚类算法[J].测绘通报,2018(11):78-82.(JIN H, ZHANG J S, WU R. A large scale spectral clustering algorithm using sampling improved weighted kernel k-means[J]. Bulletin of Surveying and Mapping, 2018(11):78-82.)
[6] FOWLKES C, BELONGIE S, CHUNG F, et al. Spectral grouping using the Nyström method[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004, 26(2):214-225.
[7] 丁世飞,贾洪杰,史忠植.基于自适应Nyström采样的大数据谱聚类算法[J].软件学报,2014,25(9):2037-2049.(DING S F, JIA H J, SHI Z Z. Spectral clustering algorithm based on adaptive Nyström sampling for big data analysis[J]. Journal of Software, 2014, 25(9):2037-2049.)
[8] 王小玉,丁世飞.基于共享近邻的成对约束谱聚类算法[J].计算机工程与应用,2019,55(2):142-147.(WANG X Y, DING S F. Pairwise constrained spectral clustering algorithm based on shared nearest neighborhood[J]. Computer Engineering and Applications, 2019, 55(2):142-147.)
[9] JIN R, KOU C, LIU R, et al. Efficient parallel spectral clustering algorithm design for large data sets under cloud computing environment[J]. Journal of Cloud Computing:Advances, Systems and Applications, 2013, 2(1):No.47.
[10] 张文杰,蒋烈辉.一种基于MapReduce并行化计算的大数据聚类算法[J/OL].计算机应用研究,[2019-05-05] http://www.arocmag.com/article/02-2020-01-055.html.(ZHANG W J, JIANG L H. Parallel computation algorithm for big data clustering based on MapReduce[J/OL]. Application Research of Computers,[2019-05-05] http://www.arocmag.com/article/02-2020-01-055.html.)
[11] 李晓瑜,俞丽颖,雷航,等.一种K-means改进算法的并行化实现与应用[J].电子科技大学学报,2017,46(1):61-68.(LI X Y, YU L Y, LEI H, et al. The parallel implementation and application of an improved K-means algorithm[J]. Journal of University of Electronic Science and Technology of China, 2017, 46(1):61-68.)
[12] 刘鹏,滕家雨,丁恩杰,等.基于Spark的大规模文本k-means并行聚类算法[J].中文信息学报,2017,31(4):145-153.(LIU P, TENG J Y, DING E J, et al. Parallel k-means algorithm for massive texts on Spark[J]. Journal of Chinese Information Processing, 2017, 31(4):145-153.)
[13] 魏彩娜,钱鹏江,奚臣.域间F-范数正则化迁移谱聚类方法[J].计算机科学与探索,2018,12(3):472-483.(WEI C N, QIAN P J, XI C. Transfer spectral clustering based on inter domain F-norm regularization[J]. Journal of Frontiers of Computer Science and Technology, 2018, 12(3):472-483.)
[14] WU S, SONG H, CHENG G, et al. Civil engineering supervision video retrieval method optimization based on spectral clustering and R-tree[J]. Neural Computing and Applications, 2019, 31(9):4513-4525.
[15] 朱书伟,周治平,张道文.融合并行混沌萤火虫算法的K调和均值聚类[J].智能系统学报,2015,10(6):872-880.(ZHU S W, ZHOU Z P, ZHANG D W. K-harmonic means clustering merged with parallel chaotic firefly algorithm[J]. CAAI Transactions on Intelligent Systems. 2015, 10(6):872-880.)
[16] LU C, YAN S, LIN Z. Convex sparse spectral clustering single-view to multi-view[J]. IEEE Transactions on Image Processing, 2016, 25(6):2833-2843.

Spark框架优化的大规模谱聚类并行算法

Spark framework based optimized large-scale spectral clustering parallel algorithm

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 1

编辑推荐

Metrics