计算机应用 ›› 2019, Vol. 39 ›› Issue (9): 2586-2590.DOI: 10.11772/j.issn.1001-9081.2019030485

• 数据科学与技术 • 上一篇    下一篇

基于聚合距离参数的改进K-means算法

王巧玲, 乔非, 蒋友好   

  1. 同济大学 电子与信息工程学院, 上海 201804
  • 收稿日期:2019-03-25 修回日期:2019-05-09 发布日期:2019-05-20 出版日期:2019-09-10
  • 通讯作者: 乔非
  • 作者简介:王巧玲(1994-),女,浙江宁波人,硕士研究生,主要研究方向:聚类算法、大数据分析;乔非(1967-),女,陕西西安人,教授,博士,主要研究方向:大数据分析、复杂制造调度、智能生产系统;蒋友好(1976-),男,山东枣庄人,博士研究生,主要研究方向:大数据分析、智能生产系统。
  • 基金资助:

    国家自然科学基金重大项目(71690230,71690234)。

Improved K-means algorithm with aggregation distance coefficient

WANG Qiaoling, QIAO Fei, JIANG Youhao   

  1. School of Electronics and Information Engineering, Tongji University, Shanghai 201804, China
  • Received:2019-03-25 Revised:2019-05-09 Online:2019-05-20 Published:2019-09-10
  • Supported by:

    This work is partially supported by the Major Program of National Natural Science Foundation of China (71690230, 71690234).

摘要:

针对传统K均值聚类(K-means)算法随机选择初始中心及K值导致的聚类结果不确定且精度不高问题,提出了一种基于聚合距离的改进K-means算法。首先,基于聚合距离参数筛选出优质的初始聚类中心,并将其作用于K-means算法。然后,引入戴维森堡丁指数(DBI)作为算法的准则函数,循环更新聚类直到准则函数收敛,最后完成聚类。改进算法提供了优质的初始聚类中心及K值,避免了聚类结果的随机性。二维数值型仿真数据的聚类结果表明,改进算法在数据样本数达到10000时仍能保持较好的聚类效果。针对Iris和Seg这两个UCI标准数据集的调整兰德系数,改进算法比传统算法性能分别提高了83.7%和71.0%,最终验证了改进算法比传统算法聚类结果的准确性更高。

关键词: 聚合距离参数, 聚类中心, 聚类评判指标, 戴维森堡丁指数(DBI), 数据聚类

Abstract:

Initial centers and K value are determined randomly in the traditional K-means algorithm, which makes clustering results uncertain and with low precision. Therefore, an improved K-means algorithm based on aggregation distance was proposed. Firstly, high-quality cluster centers were filtered out based on the aggregation distance coefficient as the initial centers of the K-means algorithm. Secondly, Davies-Bouldin Index (DBI) was introduced as the criterion function of the algorithm, and the clustering was cyclically updated until the criterion function converged. Finally, the clustering was completed. The proposed algorithm provides good initial clustering centers and K value, avoiding the randomness of clustering results. The clustering results of two-dimensional numerical simulation data show that the improved algorithm can still maintain a good clustering effect when the number of data samples reaches 10000. For the adjusted Rand coefficients of the two UCI standard datasets named Iris and Seg, the improved algorithm respectively improves the performance of clustering by 83.7% and 71.0% compared to the traditional algorithm. It can be seen that the improved algorithm can increase the accuracy of the clustering result compared with the traditional algorithm.

Key words: aggregation distance coefficient, cluster center, clustering evaluation index, Davies-Bouldin Index (DBI), data clustering

中图分类号: