Spark环境下的并行模糊C均值聚类算法

doi:10.11772/j.issn.1001-9081.2016.02.0342

计算机应用 ›› 2016, Vol. 36 ›› Issue (2): 342-347.DOI: 10.11772/j.issn.1001-9081.2016.02.0342

• 第三届CCF大数据学术会议(CCF BigData 2015) • 上一篇下一篇

Spark环境下的并行模糊C均值聚类算法

王桂兰, 周国亮, 萨初日拉, 朱永利

华北电力大学信息与网络管理中心, 河北保定 071003

收稿日期:2015-08-29 修回日期:2015-09-13 出版日期:2016-02-10 发布日期:2016-02-03
通讯作者: 周国亮(1978-),男,河北保定人,副教授,博士,主要研究方向:智能电网、联机分析处理。
作者简介:王桂兰(1979-),女,河北保定人,讲师,博士研究生,主要研究方向:电力大数据分析、风机故障分析;萨初日拉(1992-),男(蒙古族),内蒙古通辽人,硕士研究生,主要研究方向:云计算、数据挖掘;朱永利(1963-),男,河北衡水人,教授,博士生导师,博士,CCF高级会员,主要研究方向:人工智能、电力调度自动化系统。
基金资助:
中央高校基本科研业务费专项资金资助项目(13MS103);河北省自然科学基金资助项目(F2014502069)。

Parallel fuzzy C-means clustering algorithm in Spark

WANG Guilan, ZHOU Guoliang, SA Churila, ZHU Yongli

Network and Information Management Center, North China Electric Power University, Baoding Hebei 071003, China

Received:2015-08-29 Revised:2015-09-13 Online:2016-02-10 Published:2016-02-03

摘要/Abstract

摘要： 针对聚类算法需要处理数据集的规模越来越大、时效性要求越来越高,对算法的大数据适应能力和性能要求更高的问题,提出一种在Spark分布式内存计算平台下的模糊C均值(FCM)算法Spark-FCM。首先对矩阵通过水平分割实现分布式存储,不同向量存储在不同节点;然后基于FCM算法的计算特点,设计了分布式和缓存敏感的常用矩阵操作,包括乘法、转置和加法等;最后基于矩阵操作和Spark平台特点,设计了Spark-FCM算法,主要数据结构采用分布式矩阵存储,具有节点间数据移动少和每个步骤分布式计算特点。通过在单机和集群环境下测试,算法具有良好的可扩展性,并可以适应大规模数据集,算法性能与数据量成线性关系,集群环境下性能比单机提高2~3倍。

关键词: Spark, 模糊C均值, 矩阵运算, 内存计算

Abstract: With the growing data volume and timeliness requirement, the clustering algorithms need to be adaptive to big data and higher performance. A new algorithm named Spark Fuzzy C-Means (FCM) was proposed based on Spark distributed in-memory computing platform. Firstly, the matrix was partitioned into vector set horizontally and distributedly stored, which meant different vectors were distributed in different nodes. Then based on the characteristics of FCM algorithm, matrix operations were redesigned considering distributed storage and cache sensitivity, including multiplication, addition and transpose. Finally, Spark-FCM algorithm which combined with matrix operations and Spark platform was implemented. The primary data structures of the algorithm adopted distributed matrix storage with fewer moving data between nodes and distributed computing in each step. The test results in stand-alone and cluster environments show that Spark-FCM has good scalability and can adjust to large-scale data sets, the performance and the size of data shows a linear relationship, and the performance in cluster environment is 2 to 3 times higher than that in stand-alone.

Key words: Spark, Fuzzy C-Means(FCM), matrix computing, in-memory computing

中图分类号:

TP393.027

王桂兰, 周国亮, 萨初日拉, 朱永利. Spark环境下的并行模糊C均值聚类算法[J]. 计算机应用, 2016, 36(2): 342-347.

WANG Guilan, ZHOU Guoliang, SA Churila, ZHU Yongli. Parallel fuzzy C-means clustering algorithm in Spark[J]. Journal of Computer Applications, 2016, 36(2): 342-347.

参考文献

[1] NOCK R, NIELSEN F. On weighting clustering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006, 28(8): 1223-1235.
[2] 宋易阳,李存斌,祁之强.基于云模型和模糊聚类的电力负荷模式提取方法[J].电网技术,2014,38(12):3378-3383. (SONG Y Y, LI C B, QI Z Q. Extraction of power load patterns based on cloud model and fuzzy clustering[J]. Power System Technology, 2014, 38(12): 3378-3383.)
[3] GHOSH S, DUBEY S K. Comparative analysis of K-Means and fuzzy C-Means algorithms[J]. International Journal of Advanced Computer Science and Applications, 2013, 4(4): 35-39.
[4] HUNG M-C, YANG D-L. An efficient fuzzy C-Means clustering algorithm[C]//ICDM 2001: Proceedings of the 2001 IEEE ICDM International Conference on Data Mining. Washington, DC: IEEE Computer Society, 2001: 225-232.
[5] 裴继红,谢维信.直方图模糊约束FCM聚类自适应多阈值图像分割[J].电子学报,1999,27(10):38-42. (PEI J H, XIE W X. Adaptive multi thresholds image segmentation based on fuzzy restrained histogram FCM clustering[J]. Acta Electronica Sinica, 1999, 27(10): 38-42.)
[6] 王永贵,李鸿绪,宋晓.MapReduce模型下的模糊C均值算法研究[J].计算机工程,2014,40(10):47-51. (WANG Y G, LI H X, SONG X. Research on fuzzy C-means algorithm on MapReduce model[J]. Computer Engineering, 2014, 40(10): 47-51.)
[7] ZAHARIA M, CHOWDHURY M, DAS T, et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing[C]//NSDI '12: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. Berkeley, CA: USENIX Association, 2012: 15-28.
[8] GU R, YANG X, YAN J, et al. SHadoop: improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters[J]. Journal of Parallel and Distributed Computing, 2014, 74(3): 2166-2179.
[9] GU R, HU W, HUANG Y H. Rainbow: a distributed and hierarchical RDF triple store with dynamic scalability[C]//Proceedings of the 2014 IEEE International Conference on Big Data. Washington, DC: IEEE Computer Society, 2014: 561-566.
[10] DIMPLE B, SUDARSHAN T. IBM text analytics on Apache Spark [EB/OL]. [2014-10-29]. https://spark-summit.org/.
[11] ALITOUKA. Spark-DBSCAN [EB/OL]. [2015-02-23]. https://github.com/alitouka/spark_dbscan.
[12] 李文,程华良,彭耀,等.基于Spark可视化大数据挖掘平台[C]//系统仿真技术及其应用学术论文集.合肥:中国自动化学会系统仿真专委会,2014,15:395-398. (LI W, CHENG H L, PENG Y, et al. Visualized data mining platform based on the Spark[C]//Proceedings of the 16th System Simulation Technology and Application. Hefei: Chinese Academy of Automation, Special Committee of System Simulatin, 2014, 15: 395-398.)
[13] 卞昊穹,陈跃国,杜小勇,等.Spark上的等值连接优化[J].华东师范大学学报(自然科学版), 2014(5): 263-270. (BIAN H Q, CHEN Y G, DU X Y, et al. Equal-join optimization on Spark[J]. Journal of East China Normal University (Natural Science), 2014(5): 263-270.)
[13] MYASUKA. A distributed matrix operations library built on top of Spark [EB/OL]. [2015-08-25]. https://github.com/PasaLab/marlin.

[1]	石雪松, 李宪华, 孙青, 宋韬. 基于人工蜂群与模糊C均值的自适应小波变换的噪声图像分割[J]. 计算机应用, 2021, 41(8): 2312-2317.
[2]	袁芊芊, 邓洪敏, 王晓航. 基于超像素快速模糊C均值聚类与支持向量机的柑橘病虫害区域分割[J]. 计算机应用, 2021, 41(2): 563-570.
[3]	孙建军, 徐岩. 基于加权改进模糊C均值聚类的欠定混合矩阵估计[J]. 计算机应用, 2020, 40(6): 1769-1773.
[4]	王燕, 何宏科. 基于邻域信息的改进模糊c均值脑MRI分割[J]. 计算机应用, 2020, 40(4): 1196-1201.
[5]	章夏杰, 朱敬华, 陈杨. Spark下的分布式粗糙集属性约简算法[J]. 计算机应用, 2020, 40(2): 518-523.
[6]	刘斌, 何进荣, 李远成, 韩宏. 基于分布式神经网络的苹果价格预测方法[J]. 计算机应用, 2020, 40(2): 369-374.
[7]	顾军华, 王锋, 戚永军, 孙哲然, 田泽培, 张亚娟. 基于多尺度卷积特征融合的肺结节图像检索方法[J]. 计算机应用, 2020, 40(2): 561-565.
[8]	崔艺馨, 陈晓东. Spark框架优化的大规模谱聚类并行算法[J]. 计算机应用, 2020, 40(1): 168-172.
[9]	程文亮, 王志宏, 周虞, 过弋, 赵俊锋. 面向外汇市场监测的分布式计算框架设计[J]. 计算机应用, 2020, 40(1): 173-180.
[10]	董发志, 丁洪伟, 杨志军, 熊成彪, 张颖婕. 基于遗传算法和模糊C均值聚类的WSN分簇路由算法[J]. 计算机应用, 2019, 39(8): 2359-2365.
[11]	刘靖, 肖冠烽. 基于Spark与粒子滤波算法的公交到站时间预测系统[J]. 计算机应用, 2019, 39(2): 429-435.
[12]	刘子豪, 李凌, 叶枫. 基于SparkR的水文传感器数据的异常检测方法[J]. 计算机应用, 2019, 39(2): 436-440.
[13]	冯国政, 徐金东, 范宝德, 赵甜雨, 朱萌, 孙潇. 基于半监督模糊C均值算法的遥感影像分类[J]. 计算机应用, 2019, 39(11): 3227-3232.
[14]	王荣淼, 张峰峰, 詹蔚, 陈军, 吴昊. 基于空间约束的模糊C均值聚类肝脏CT图像分割[J]. 计算机应用, 2019, 39(11): 3366-3369.
[15]	刘晓明, 沈明玉, 侯整风. 基于Levy飞行的萤火虫模糊聚类算法[J]. 计算机应用, 2019, 39(11): 3257-3262.

Spark环境下的并行模糊C均值聚类算法

Parallel fuzzy C-means clustering algorithm in Spark

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics