计算机应用

• CCSCW 2008 & IIN 2008 会议论文 • 上一篇    下一篇

一种改进的BIRCH聚类算法

蒋盛益 李霞   

  1. 广东外语外贸大学 广东外语外贸大学
  • 收稿日期:2008-09-17 修回日期:1900-01-01 发布日期:2009-01-01 出版日期:2009-01-01
  • 通讯作者: 李霞

Improved BIRCH clustering algorithm

shen-yi Jiang Li Xia   

  • Received:2008-09-17 Revised:1900-01-01 Online:2009-01-01 Published:2009-01-01
  • Contact: 李霞 Li Xia

摘要: BIRCH算法是一种适应于大规模数据集的聚类算法,通过对所有叶节点设定统一阈值T来构建聚类特征(CF)树,并在各阶段采取不同的阈值来重建树,但没有给出一个合理设定阈值初值T及如何在各阶段提升阈值大小的具体方法。另外BIRCH算法只能处理数值型数据,这使其应用受到限制。针对以上不足,对BIRCH算法做了以下改进:1)改进原BIRCH算法的CF结构,使其可以处理混合型属性数据集; 2)启发式为BIRCH算法选择初始阈值T并给出了第二阶段提升阈值的具体操作方法; 3)对BIRCH算法的参数B和L做了探讨,指出当参数B=L时算法性能相近,并提出为获得较好聚类效果时B值的取值范围。实验结果表明,改进后的BIRCH算法具有较好的性能。

关键词: BIRCH算法, 聚类, 阈值, 混合属性数据, 数据挖掘

Abstract: BIRCH algorithm is a clustering algorithm suitable for very large data sets. In the algorithm, a CF-tree is built whose all entries in each leaf node must satisfy a uniform threshold T, and the CF-tree is rebuilt at each stage by different threshold. But how to set the initial threshold and how to increase the threshold of each stage are not given. In addition, the algorithm can only work with "metric" attribute, which makes its application restrained. This paper made some improvements on BIRCH algorithm: 1) Changed CF structure so that heterogeneous attributes could be manipulated; 2) Gave a heuristic method of getting initial threshold and increasing threshold of second stage of the algorithm; 3) Discussed the algorithm's parameter B and L and found that the algorithm had equal performance when B=L, at last, gave a sound scope for B. Experimental results on public data sets show that the improved algorithm has preferable performance.

Key words: BIRCH algorithm, clustering, threshold, heterogeneous attributes, data mining