计算机应用 ›› 2019, Vol. 39 ›› Issue (4): 1027-1031.DOI: 10.11772/j.issn.1001-9081.2018081790

• 数据科学与技术 • 上一篇    下一篇

基于连通距离和连通强度的BIRCH改进算法

樊仲欣1, 王兴1, 苗春生2   

  1. 1. 南京信息工程大学 大气与环境实验教学中心, 南京 210044;
    2. 南京信大气象科技有限公司, 南京 210044
  • 收稿日期:2018-08-30 修回日期:2018-10-27 出版日期:2019-04-10 发布日期:2019-04-10
  • 通讯作者: 樊仲欣
  • 作者简介:樊仲欣(1981-),男,江苏南京人,工程师,硕士,主要研究方向:数据挖掘、神经网络;王兴(1983-),男,江苏泰州人,高级工程师,博士,主要研究方向:数据挖掘、神经网络;苗春生(1954-),男,江苏南京人,教授,博士生导师,主要研究方向:天气气候预测、气象信息工程与技术。

Improved BIRCH clustering algorithm based on connectivity distance and intensity

FAN Zhongxin1, WANG Xing1, MIAO Chunsheng2   

  1. 1. Experimental Teaching Center of Atmosphere and Environment, Nanjing University of Information Science & Technology, Nanjing Jiangsu 210044, China;
    2. Nanjing Xinda Meteorological Science and Technology Company Limited, Nanjing Jiangsu 210044, China
  • Received:2018-08-30 Revised:2018-10-27 Online:2019-04-10 Published:2019-04-10

摘要: 为解决利用层次方法的平衡迭代规约和聚类(BIRCH)算法聚类结果依赖于数据对象的添加顺序,且对非球状的簇聚类效果不好以及受簇直径阈值的限制每个簇只能包含数量相近的数据对象的问题,提出一种改进的BIRCH算法。该算法用描述数据对象个体间连通性的连通距离和连通强度阈值替代簇直径阈值,还将簇合并的步骤加入到聚类特征树的生成过程中。在自定义及iris、wine、pendigits数据集上的实验结果表明,该算法比多阈值BIRCH、密度改进BIRCH等现有改进算法的聚类准确率更高,尤其在大数据集上比密度改进BIRCH准确率提高6个百分点,耗时降低61%。说明该算法能够适用于在线实时增量数据,可以识别非球形簇和体积不均匀簇,具有去噪功能,且时间和空间复杂度明显降低。

关键词: 层次聚类, 在线算法, BIRCH, 聚类特征, 聚类特征树

Abstract: Focusing on the issues that clustering results of Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) depend on the adding order of data objects, BIRCH has poor clustering effect on non-convex clusters, and each cluster of BIRCH can only contain a similar number of data objects because of the cluster diameter threshold, an improved BIRCH algorithm was proposed. In this algorithm, the cluster diameter threshold was replaced by connectivity distance and intensity threshold which described the connectivity between the data objects, and cluster merging step was added into the generation of cluster feature tree. Experimental result on custom and iris, wine, pendigits datasets show that the proposed algorithm has higher clustering accuracy than the existing improved algorithms such as multi-threshold BIRCH and density-improved BIRCH; especially on large datasets, the proposed algorithm has accuracy increased by 6 percentage points and running time reduced by 61% compared to density-improved BIRCH. The proposed algorithm can be applied to online real-time incremental data processing and identify non-convex clusters and clusters with uneven volume, has denoising function and significantly reduces time-complexity and space-complexity.

Key words: hierarchical clustering, on-line algorithm, Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH), Cluster Feature (CF), Cluster Feature Tree (CF Tree)

中图分类号: