计算机应用 ›› 2016, Vol. 36 ›› Issue (2): 342-347.DOI: 10.11772/j.issn.1001-9081.2016.02.0342

• 第三届CCF大数据学术会议(CCF BigData 2015) • 上一篇    下一篇

Spark环境下的并行模糊C均值聚类算法

王桂兰, 周国亮, 萨初日拉, 朱永利   

  1. 华北电力大学 信息与网络管理中心, 河北 保定 071003
  • 收稿日期:2015-08-29 修回日期:2015-09-13 出版日期:2016-02-10 发布日期:2016-02-03
  • 通讯作者: 周国亮(1978-),男,河北保定人,副教授,博士,主要研究方向:智能电网、联机分析处理。
  • 作者简介:王桂兰(1979-),女,河北保定人,讲师,博士研究生,主要研究方向:电力大数据分析、风机故障分析;萨初日拉(1992-),男(蒙古族),内蒙古通辽人,硕士研究生,主要研究方向:云计算、数据挖掘;朱永利(1963-),男,河北衡水人,教授,博士生导师,博士,CCF高级会员,主要研究方向:人工智能、电力调度自动化系统。
  • 基金资助:
    中央高校基本科研业务费专项资金资助项目(13MS103);河北省自然科学基金资助项目(F2014502069)。

Parallel fuzzy C-means clustering algorithm in Spark

WANG Guilan, ZHOU Guoliang, SA Churila, ZHU Yongli   

  1. Network and Information Management Center, North China Electric Power University, Baoding Hebei 071003, China
  • Received:2015-08-29 Revised:2015-09-13 Online:2016-02-10 Published:2016-02-03

摘要: 针对聚类算法需要处理数据集的规模越来越大、时效性要求越来越高,对算法的大数据适应能力和性能要求更高的问题,提出一种在Spark分布式内存计算平台下的模糊C均值(FCM)算法Spark-FCM。首先对矩阵通过水平分割实现分布式存储,不同向量存储在不同节点;然后基于FCM算法的计算特点,设计了分布式和缓存敏感的常用矩阵操作,包括乘法、转置和加法等;最后基于矩阵操作和Spark平台特点,设计了Spark-FCM算法,主要数据结构采用分布式矩阵存储,具有节点间数据移动少和每个步骤分布式计算特点。通过在单机和集群环境下测试,算法具有良好的可扩展性,并可以适应大规模数据集,算法性能与数据量成线性关系,集群环境下性能比单机提高2~3倍。

关键词: Spark, 模糊C均值, 矩阵运算, 内存计算

Abstract: With the growing data volume and timeliness requirement, the clustering algorithms need to be adaptive to big data and higher performance. A new algorithm named Spark Fuzzy C-Means (FCM) was proposed based on Spark distributed in-memory computing platform. Firstly, the matrix was partitioned into vector set horizontally and distributedly stored, which meant different vectors were distributed in different nodes. Then based on the characteristics of FCM algorithm, matrix operations were redesigned considering distributed storage and cache sensitivity, including multiplication, addition and transpose. Finally, Spark-FCM algorithm which combined with matrix operations and Spark platform was implemented. The primary data structures of the algorithm adopted distributed matrix storage with fewer moving data between nodes and distributed computing in each step. The test results in stand-alone and cluster environments show that Spark-FCM has good scalability and can adjust to large-scale data sets, the performance and the size of data shows a linear relationship, and the performance in cluster environment is 2 to 3 times higher than that in stand-alone.

Key words: Spark, Fuzzy C-Means(FCM), matrix computing, in-memory computing

中图分类号: