Improved K-means algorithm with aggregation distance coefficient

doi:10.11772/j.issn.1001-9081.2019030485

Journal of Computer Applications ›› 2019, Vol. 39 ›› Issue (9): 2586-2590.DOI: 10.11772/j.issn.1001-9081.2019030485

• Data science and technology • Previous Articles Next Articles

Improved K-means algorithm with aggregation distance coefficient

WANG Qiaoling, QIAO Fei, JIANG Youhao

School of Electronics and Information Engineering, Tongji University, Shanghai 201804, China

Received:2019-03-25 Revised:2019-05-09 Online:2019-05-20 Published:2019-09-10
Supported by:
This work is partially supported by the Major Program of National Natural Science Foundation of China (71690230, 71690234).

基于聚合距离参数的改进K-means算法

王巧玲, 乔非, 蒋友好

同济大学电子与信息工程学院, 上海 201804

通讯作者: 乔非
作者简介:王巧玲(1994-),女,浙江宁波人,硕士研究生,主要研究方向:聚类算法、大数据分析;乔非(1967-),女,陕西西安人,教授,博士,主要研究方向:大数据分析、复杂制造调度、智能生产系统;蒋友好(1976-),男,山东枣庄人,博士研究生,主要研究方向:大数据分析、智能生产系统。
基金资助:
国家自然科学基金重大项目（71690230，71690234）。

Abstract

Abstract:

Initial centers and K value are determined randomly in the traditional K-means algorithm, which makes clustering results uncertain and with low precision. Therefore, an improved K-means algorithm based on aggregation distance was proposed. Firstly, high-quality cluster centers were filtered out based on the aggregation distance coefficient as the initial centers of the K-means algorithm. Secondly, Davies-Bouldin Index (DBI) was introduced as the criterion function of the algorithm, and the clustering was cyclically updated until the criterion function converged. Finally, the clustering was completed. The proposed algorithm provides good initial clustering centers and K value, avoiding the randomness of clustering results. The clustering results of two-dimensional numerical simulation data show that the improved algorithm can still maintain a good clustering effect when the number of data samples reaches 10000. For the adjusted Rand coefficients of the two UCI standard datasets named Iris and Seg, the improved algorithm respectively improves the performance of clustering by 83.7% and 71.0% compared to the traditional algorithm. It can be seen that the improved algorithm can increase the accuracy of the clustering result compared with the traditional algorithm.

Key words: aggregation distance coefficient, cluster center, clustering evaluation index, Davies-Bouldin Index (DBI), data clustering

摘要：

针对传统K均值聚类（K-means）算法随机选择初始中心及K值导致的聚类结果不确定且精度不高问题，提出了一种基于聚合距离的改进K-means算法。首先，基于聚合距离参数筛选出优质的初始聚类中心，并将其作用于K-means算法。然后，引入戴维森堡丁指数（DBI）作为算法的准则函数，循环更新聚类直到准则函数收敛，最后完成聚类。改进算法提供了优质的初始聚类中心及K值，避免了聚类结果的随机性。二维数值型仿真数据的聚类结果表明，改进算法在数据样本数达到10000时仍能保持较好的聚类效果。针对Iris和Seg这两个UCI标准数据集的调整兰德系数，改进算法比传统算法性能分别提高了83.7%和71.0%，最终验证了改进算法比传统算法聚类结果的准确性更高。

关键词: 聚合距离参数, 聚类中心, 聚类评判指标, 戴维森堡丁指数(DBI), 数据聚类

CLC Number:

TP301.6

WANG Qiaoling, QIAO Fei, JIANG Youhao. Improved K-means algorithm with aggregation distance coefficient[J]. Journal of Computer Applications, 2019, 39(9): 2586-2590.

王巧玲, 乔非, 蒋友好. 基于聚合距离参数的改进K-means算法[J]. 计算机应用, 2019, 39(9): 2586-2590.

References

[1] 王治和,黄梦莹,杜辉,等. 基于密度峰值与密度聚类的集成算法[J].计算机应用,2019,39(2):398-402. (WANG Z H, HUANG M Y, DU H, et al. Integrated algorithm based on density peaks and density-based clustering J]. Journal of Computer Applications, 2019, 39(2):398-402.)
[2] McLOUGHLIN F, DUFFY A, CONLON M. A clustering approach to domestic electricity load profile characterisation using smart metering data[J]. Applied Energy, 2015, 141:190-199.
[3] ALI A-W, WU J, JENKINS N. K-means based load estimation of domestic smart meter measurements[J]. Applied Energy, 2016, 194:333-342.
[4] 杨辉华,王克,李灵巧,等.基于自适应布谷鸟搜索算法的K-means聚类算法及其应用[J].计算机应用,2016,36(8):2066-2070.(YANG H H, WANG K, LI L Q, et al. K-means clustering algorithm based on adaptive cuckoo search and its application[J]. Journal of Computer Applications, 2016, 36(8):2066-2070.)
[5] 黄韬,刘胜辉,谭艳娜.基于K-means聚类算法的研究[J].计算机技术与发展,2011,21(7):54-57.(HUANG T, LIU S H, TAN Y N. Research of clustering algorithm based on K-means[J]. Computer Technology and Development, 2011, 21(7):54-57.)
[6] 王骏,王士同,邓赵红. 特征加权距离与软子空间学习相结合的文本聚类新方法[J].计算机学报, 2012, 35(8):1655-1665. (WANG J, WANG S T, DENG Z H. A novel text clustering algorithm based on feature weighting distance and soft subspace learning[J]. Chinese Journal of Computers, 2012, 35(8):1655-1665.)
[7] 郁启麟. K-means算法初始聚类中心选择的优化[J]. 计算机系统应用, 2017, 26(5):170-174. (YU Q L. Optimization of initial clustering centers selection method for K-means algorithm[J]. Computer Systems & Applications, 2017, 26(5):170-174.)
[8] 周润物,李智勇,陈少淼,等.面向大数据处理的并行优化抽样聚类K-means算法[J].计算机应用,2016,36(2):311-315.(ZHOU R W, LI Z Y, CHEN S M, et al. Parallel optimization sampling clustering K-means algorithm for big data processing[J]. Journal of Computer Applications, 2016, 36(2):311-315.)
[9] 王丽娟,郝志峰,蔡瑞初,等. 基于随机取样的选择性K-means聚类融合算法[J]. 计算机应用, 2013, 33(7):1969-1972. (WANG L J, HAO Z F, CAI R C, et al. Selective K-means clustering ensemble based on random sampling[J]. Journal of Computer Applications, 2013, 33(7):1969-1972.)
[10] 毛典辉.基于MapReduce的Canopy-Kmeans改进算法[J]. 计算机工程与应用, 2012, 48(27):22-26. (MAO D H. Improved Canopy-Kmeans algorithm based on MapReduce[J]. Computer Engineering and Applications, 2012, 48(27):22-26.)
[11] 赵昱,陈琴,苏一丹,等. 基于邻域相似度的近邻传播聚类算法[J]. 计算机工程与设计, 2018, 39(7):1883-1888. (ZHAO Y, CHEN Q, SU Y D, et al. Affinity propagation clustering algorithm based on neighborhood similarity[J]. Computer Engineering and Design, 2018, 39(7):1883-1888.)
[12] 刘鹏,王明阳,王焱.基于自适应动态球半径的K邻域搜索算法[J]. 机械设计与制造工程, 2016, 45(6):83-86.(LIU P, WANG M Y, WANG Y. K domain search algorithm based on adaptive dynamic sphere radius[J]. Machine Design and Manufacturing Engineering, 2016, 45(6):83-86.)
[13] NGUYEN D, LE T, NGUYEN S. An algorithmic method of calculating neighborhood radius for clustering in-home activities within smart home environment[C]//Proceedings of the 7th International Conference on Intelligent Systems, Modelling and Simulation. Piscataway, NJ:IEEE, 2016:42-47.
[14] COELHO G P, BARBANTE C C, BOCCATO L, et al. Automatic feature selection for BCI:an analysis using the Davies-Bouldin index and extreme learning machines[C]//Proceedings of the 2012 International Joint Conference on Neural Networks. Piscataway, NJ:IEEE, 2012:1-8.
[15] THOMAS J C R, PEÑAS M S, MORA M. New version of Davies-Bouldin index for clustering validation based on cylindrical distance[C]//Proceedings of the 32nd International Conference of the Chilean Computer Science Society. Piscataway, NJ:IEEE, 2013:49-53.

Improved K-means algorithm with aggregation distance coefficient

基于聚合距离参数的改进K-means算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 13

Recommended Articles

Metrics

[1]	CUI Xin, XU Hua, SU Chen. Over-sampling algorithm for imbalanced datasets [J]. Journal of Computer Applications, 2020, 40(6): 1662-1667.
[2]	ZHANG Yonglai, ZHOU Yaojian. Review of clustering algorithms [J]. Journal of Computer Applications, 2019, 39(7): 1869-1882.
[3]	DING Cheng, WANG Qiuping, WANG Xiaofeng. Krill herd algorithm based on generalized opposition-based learning and its application in data clustering [J]. Journal of Computer Applications, 2019, 39(2): 336-342.
[4]	WANG Zhihe, HUANG Mengying, DU Hui, QIN Hongwu. Integrated algorithm based on density peaks and density-based clustering [J]. Journal of Computer Applications, 2019, 39(2): 398-402.
[5]	HAN Zhonghua, BI Kaiyuan, SI Wen, LYU Zhe. Clustering by fast search and find of density peaks based on spectrum analysis [J]. Journal of Computer Applications, 2019, 39(2): 409-413.
[6]	ZHU Jie, CHEN Lifei. Bayesian clustering algorithm for categorical data [J]. Journal of Computer Applications, 2017, 37(4): 1026-1031.
[7]	PANG Lin, LIU Fang'ai. Optimized clustering algorithm based on density of hierarchical division [J]. Journal of Computer Applications, 2016, 36(6): 1634-1638.
[8]	WANG Zhengying, YU Jiong, YING Changtian, LU Liang. Energy-efficient strategy of distributed file system based on data block clustering storage [J]. Journal of Computer Applications, 2015, 35(2): 378-382.
[9]	WANG Chunlong ZHANG Jingxu. Improved K-means algorithm based on latent Dirichlet allocation for text clustering [J]. Journal of Computer Applications, 2014, 34(1): 249-254.
[10]	ZHANG Fang-fang QIAN Xue-zhong. Improved GK clustering algorithm [J]. Journal of Computer Applications, 2012, 32(09): 2476-2479.
[11]	XIE Juan-ying GUO Wen-juan XIE Wei-xin GAO Xin-bo. Improvement rival penalized competitive learning algorithm based on pattern distribution of samples [J]. Journal of Computer Applications, 2012, 32(03): 638-642.
[12]	ZENG Zhao-xian ZHANG Mao-jun WANG Wei XIONG Zhi-hui. Clustering based on energy diffusing model of sample points [J]. Journal of Computer Applications, 2011, 31(09): 2534-2537.
[13]	. Rough K-Modes clustering algorithm [J]. Journal of Computer Applications, 2011, 31(01): 97-100.