Journal of Computer Applications ›› 2014, Vol. 34 ›› Issue (9): 2505-2509.DOI: 10.11772/j.issn.1001-9081.2014.09.2505

• Network and communications • Previous Articles     Next Articles

Distributed clustering algorithm with high communication efficiency for streaming data

ZHU Qiang1,SUN Yuqiang2   

  1. 1. Educational Technology Center, Zhejiang University of Media and Communications, Hangzhou Zhejiang 310018, China
    2. School of Mathmatics and Physics, Changzhou University, Changzhou Jiangsu 213164, China
  • Received:2014-04-01 Revised:2014-06-16 Online:2014-09-01 Published:2014-09-30
  • Contact: ZHU Qiang

高通信效率的分布式流数据聚类算法

朱强1,孙玉强2   

  1. 1. 浙江传媒学院 教育技术中心,杭州 310018;
    2. 常州大学 数理学院,江苏 常州 213164
  • 通讯作者: 朱强
  • 作者简介: 
    朱强(1964-),男,河南新乡人,副教授,硕士,主要研究方向:算法设计、图形图像处理、并行语法计算;
    孙玉强(1956-),男,河南郑州人,教授,博士,主要研究方向:并行算法、软件工程。
  • 基金资助:

    浙江省自然科学基金资助项目

Abstract:

The resources of sensor nodes are limited, while high communication overhead will consume much power. In order to reduce the communication overhead of distributed streaming data clustering algorithm, a new efficient algorithm with two phases, including online local clustering and offline coordinate clustering, was proposed. The online local clustering algorithm clustered data on each remote stream data source, then sent the results to the collaborative node by serialization method. The collaborative node collected and analyzed all local clusters to get the global clusters. The experimental results show that the time for sending data is constant, the time for clustering and total time linearly grow with increasing size of sliding window, which means that the execution time of the algorithm is not affected by sliding window size and cluster number. The accuracy of the proposed algorithm is close to centralized algorithm, and the communication overhead is far less than distributed algorithm. The experimental results show that the proposed algorithm has good scalability, and can be applied to the clustering analysis of distributed large-scale streaming data.

摘要:

传感器节点的资源是有限的,高的通信开销会消耗大量的电量。为了减小分布式流数据分类算法的通信开销,提出一种高效的分布式流数据聚类算法。该算法包含在线局部聚类和离线全局协同聚类两个阶段。在线局部聚类算法将每个流数据源进行局部聚类,并将聚类后的结果通过序列化技术发往协同节点;协同节点得到来自不同流数据源的局部聚类信息后进行全局聚类。从实验中可以看出,当不断增加窗口的大小时,算法用于数据发送的时间恒定不变,算法的聚类时间和总的时间呈线性增长,即所提出算法的执行时间不受滑动窗口宽度和聚类个数的影响;同时该算法与集中式算法的准确性接近,并且通信开销远远小于相关的分布式算法。实验结果表明,该算法具有很好的可扩展性,可应用于对大规模分布式流数据源进行聚类分析。

CLC Number: