基于层次划分的密度优化聚类算法

doi:10.11772/j.issn.1001-9081.2016.06.1634

计算机应用 ›› 2016, Vol. 36 ›› Issue (6): 1634-1638.DOI: 10.11772/j.issn.1001-9081.2016.06.1634

基于层次划分的密度优化聚类算法

逄琳^1,2, 刘方爱^1,2

1. 山东师范大学信息科学与工程学院, 济南 250014;
2. 山东省分布式计算机软件新技术重点实验室, 济南 250014

收稿日期:2015-11-30 修回日期:2015-12-30 出版日期:2016-06-10 发布日期:2016-06-08
通讯作者: 刘方爱
作者简介:逄琳(1991-),女,山东青岛人,硕士研究生,CCF会员,主要研究方向:数据挖掘、大数据分析;刘方爱(1962-),男,山东青岛人,教授,博士生导师,博士,主要研究方向:无线网络、分布式计算。
基金资助:
国家自然科学基金资助项目(61572301,90612003);山东省自然科学基金资助项目(ZR2013FM008)。

Optimized clustering algorithm based on density of hierarchical division

PANG Lin^1,2, LIU Fang'ai^1,2

1. College of Information Science and Engineering, Shandong Normal University, Jinan Shandong 250014, China;
2. Shandong Provincial Key Laboratory for Distributed Computer Software Novel Technology, Jinan Shandong 250014, China

Received:2015-11-30 Revised:2015-12-30 Online:2016-06-10 Published:2016-06-08
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61572301, 90612003), the Shandong Provincial Natural Science Foundation (ZR2013FM008).

摘要/Abstract

摘要： 针对传统的聚类算法对数据集反复聚类,且在大型数据集上计算效率欠佳的问题,提出一种基于层次划分的最佳聚类数和初始聚类中心确定算法——基于层次划分密度的聚类优化(CODHD)。该算法基于层次划分,对计算过程进行研究,不需要对数据集进行反复聚类。首先,扫描数据集获得所有聚类特征的统计值;其次,自底向上地生成不同层次的数据划分,计算每个划分数据点的密度,将最大密度点定为中心点,计算中心点距离更高密度点的最小距离,以中心点密度与最小距离乘积之和的平均值为有效性指标,增量地构建一条关于不同层次划分的聚类质量曲线;最后,根据曲线的极值点对应的划分估计最佳聚类数和初始聚类中心。实验结果表明,所提CODHD算法与预处理阶段的聚类优化(COPS)算法相比,聚类准确度提高了30%,聚类算法效率至少提高14.24%。所提算法具有较强的可行性和实用性。

关键词: 聚类算法, 层次划分, 最佳聚类数, 初始聚类中心, 聚类有效性指标

Abstract: The traditional clustering algorithms cluster the dataset repeatedly, and have poor computational efficiency on large datasets. In order to solve the problem, a novel algorithm based on hierarchy partition was proposed to determine the optimal number of clusters and initial centers of clusters, named Clusters Optimization based on Density of Hierarchical Division (CODHD). Based on hierarchical division, the computational process was studied, which did not need to cluster datasets repeatedly. First of all, all statistical values of clustering features were obtained by scanning dataset. Secondly, the data partitions of different level were generated from bottom-to-up, the density of each partition data point was calculated, and the maximum density point of each partition was taken as the initial center. At the same time, the minimum distance from the center to the higher density data point was calculated, the average of products' sum of the density of the center and the minimum distance was taken as the validity index and a clustering quality curve of different hierarchical division was built incrementally. Finally, the optimal number of clusters and the initial center of clusters were estimated corresponding to the partition of extreme points of curve. The experimental results demonstrate that, compared with Clusters Optimization on Preprocessing Stage (COPS), the proposed CODHD improved clustering accuracy by 30% and clustering algorithm efficiency at least 14.24%. The proposed algorithm has strong feasibility and practicability.

Key words: clustering algorithm, hierarchical division, optimal cluster number, initial cluster center, clustering validity index

中图分类号:

TP301.6

逄琳, 刘方爱. 基于层次划分的密度优化聚类算法[J]. 计算机应用, 2016, 36(6): 1634-1638.

PANG Lin, LIU Fang'ai. Optimized clustering algorithm based on density of hierarchical division[J]. Journal of Computer Applications, 2016, 36(6): 1634-1638.

参考文献

[1] BERKHIN P. A survey of clustering data mining techniques[M]//Grouping Multidimensional Data. Berlin: Springer, 2006: 25-71.
[2] TIBSHIRANI R, WALTHER G, HASTIE T. Estimating the number of clusters in a data set via the Gap statistic[J]. Journal of the Royal Statistical Society, 2000, 63(2): 411-423.
[3] 孙才志,王敬东,潘俊.模糊聚类分析最佳聚类数的确定方法研究[J].模糊系统与数学,2001,15(1):89-92.(SUN C Z, WANG J D, PAN J. Research on the method of determining the optimal class number of fuzzy cluster[J]. Fuzzy Systems and Mathematics, 2001, 15(1): 89-92.)
[4] DUDOIT S, FRIDLYAND J. A prediction-based resampling method for estimating the number of clusters in a dataset[J]. Genome Biology, 2002, 3(7):1-21.
[5] HALKIDI M, BATISTAKIS Y, VAZIRGIANNIS M. Clustering validity checking methods: part II[J]. ACM SIGMOD Record, 2002, 31(3): 19-27.
[6] 范九伦,吴成茂.可能性划分系数和模糊变差相结合的聚类有效性函数[J].电子与信息学报,2002,24(8):1017-1021.(FAN J L, WU C M. Clustering validity function based on possibilistic partition coefficient combined with fuzzy variation[J]. Journal of Electronics and Information Technology, 2002, 24(8): 1017-1021.)
[7] YU J, CHENG G. Search range of the optimal number of clusters in fuzzy clustering[J]. Science in China (Series E), 2002, 32(2): 274-280.
[8] SUN H, WANG S, JIANG Q. FCM-based model selection algorithms for determining the number of clusters[J]. Pattern Recognition, 2004, 37(10): 2027-2037.
[9] BOUGUESSA M, WANG S, SUN H. An objective approach to cluster validation[J]. Pattern Recognition Letters, 2006, 27(13): 1419-1430.
[10] 孙吉贵,刘杰,赵连宇.聚类算法研究[J].软件学报,2008,19(1):48-61. (SUN J G, LIU J, ZHAO L Y. Clustering algorithms research[J]. Journal of Software, 2008,19(1): 48-61.)
[11] CELEBI M E, KINGRAVI H A, VELA P A. A comparative study of efficient initialization methods for the k-means clustering algorithm[J]. Expert Systems with Applications, 2013, 40(1): 200-210.
[12] 陈黎飞,姜青山,王声瑞.基于层次划分的最佳聚类数确定方法[J].软件学报,2008,19(1):62-72.(CHEN L F, JIANG Q S, WANG S R. A hierarchical method for determining the number of clusters[J]. Journal of Software, 2008, 19(1): 62-72.)
[13] PAKHIRA M K, BANDYOPADHYAY S, MAULIK U. A study of some fuzzy cluster validity indices, genetic clustering and application to pixel classification[J]. Fuzzy Sets and Systems, 2005, 155(2): 191-214.
[14] WANG W, ZHANG Y. On fuzzy cluster validity indices[J]. Fuzzy Sets and Systems, 2007, 158(19): 2095-2117.
[15] REZAEE B. A cluster validity index for fuzzy clustering[J]. Fuzzy Sets and Systems, 2010, 161(23): 3014-3025.
[16] ALEX R, ALESSANDRO L. Machine learning. Clustering by fast search and find of density peaks[J]. Science, 2014, 344(6191): 1492-1496.
[17] AGRAWAL R, GEHRKE J, GUNOPULOS D, et al. Automatic subspace clustering of high dimensional data[J]. Data Mining & Knowledge Discovery, 2005, 11(1): 5-33.
[18] MEDEIROS C M S, BARRETO G A. A novel weight pruning method for MLP classifiers based on the MAXCORE principle[J]. Neural Computing & Applications, 2013, 22(1): 71-84.

基于层次划分的密度优化聚类算法

Optimized clustering algorithm based on density of hierarchical division

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	孙建军, 徐岩. 基于加权改进模糊C均值聚类的欠定混合矩阵估计[J]. 计算机应用, 2020, 40(6): 1769-1773.
[2]	黄永鑫, 唐雪飞. 基于近邻传播聚类和TANE算法的高校数据中函数依赖的发现[J]. 计算机应用, 2020, 40(1): 90-95.
[3]	毛伊敏, 刘银萍, 梁田, 毛丁慧. 基于模糊谱聚类的不确定蛋白质相互作用网络功能模块挖掘[J]. 计算机应用, 2019, 39(4): 1032-1040.
[4]	丁成, 王秋萍, 王晓峰. 基于广义反向学习的磷虾群算法及其在数据聚类中的应用[J]. 计算机应用, 2019, 39(2): 336-342.
[5]	刘晓明, 沈明玉, 侯整风. 基于Levy飞行的萤火虫模糊聚类算法[J]. 计算机应用, 2019, 39(11): 3257-3262.
[6]	叶双, 杨晓敏, 严斌宇. 基于自适应锚定邻域回归的图像超分辨率算法[J]. 计算机应用, 2019, 39(10): 3040-3045.
[7]	邱保志, 程栾. 基于拉普拉斯中心性和密度峰值的无参数聚类算法[J]. 计算机应用, 2018, 38(9): 2511-2514.
[8]	邵伦, 周新志, 赵成萍, 张旭. 基于多维网格空间的改进K-means聚类算法[J]. 计算机应用, 2018, 38(10): 2850-2855.
[9]	侯海耀, 钱育蓉, 英昌甜, 张晗, 卢学远, 赵燚. 基于Hilbert-R树分级索引的时空查询算法[J]. 计算机应用, 2018, 38(10): 2869-2874.
[10]	王红, 葛丽娜, 王苏青, 王丽颖, 张翼鹏, 梁竣程. 基于OPTICS聚类的差分隐私保护算法的改进[J]. 计算机应用, 2018, 38(1): 73-78.
[11]	王日宏, 崔兴梅. 融合集群度与距离均衡优化的K-均值聚类算法[J]. 计算机应用, 2018, 38(1): 104-109.
[12]	李焱, 刘弘, 郑向伟. 折半聚类算法在基于社会力的人群疏散仿真中的应用[J]. 计算机应用, 2017, 37(5): 1491-1495.
[13]	邱保志, 唐雅敏. 快速识别密度骨架的聚类算法[J]. 计算机应用, 2017, 37(12): 3482-3486.
[14]	郝美薇, 戴华林, 郝琨. 基于密度的K-means算法在轨迹数据聚类中的优化[J]. 计算机应用, 2017, 37(10): 2946-2951.
[15]	王智文, 蒋联源, 王宇航, 王日凤, 张灿龙, 黄镇谨, 王鹏涛. 基于尺度自适应局部时空特征的足球比赛视频中的多运动员行为表示[J]. 计算机应用, 2016, 36(8): 2134-2138.