计算机应用 ›› 2018, Vol. 38 ›› Issue (10): 2850-2855.DOI: 10.11772/j.issn.1001-9081.2018040830

• 数据科学与技术 • 上一篇    下一篇

基于多维网格空间的改进K-means聚类算法

邵伦, 周新志, 赵成萍, 张旭   

  1. 四川大学 电子信息学院, 成都 610065
  • 收稿日期:2018-04-23 修回日期:2018-06-24 出版日期:2018-10-10 发布日期:2018-10-13
  • 通讯作者: 周新志
  • 作者简介:邵伦(1992-),男,湖北荆州人,硕士研究生,主要研究方向:智能控制、数据挖掘;周新志(1966-),男,四川德阳人,教授,博士,主要研究方向:人工智能、智能控制、分布式测控系统;赵成萍(1975-),女,山西太原人,副教授,博士,主要研究方向:模式识别、智能系统;张旭(1992-),男,安徽阜阳人,硕士研究生,主要研究方向:智能控制、数据挖掘。
  • 基金资助:
    国家973计划项目(2013CB328903-2)。

Improved K-means clustering algorithm based on multi-dimensional grid space

SHAO Lun, ZHOU Xinzhi, ZHAO Chengping, ZHANG Xu   

  1. College of Electronics and Information Engineering, Sichuan University, Chengdu Sichuan 610065, China
  • Received:2018-04-23 Revised:2018-06-24 Online:2018-10-10 Published:2018-10-13
  • Supported by:
    This work is partially supported by the National Basic Research Program (973 Program) of China (2013CB328903-2).

摘要: K-means算法是被广泛使用的一种聚类算法,传统的K-means算法中初始聚类中心的选择具有随机性,易使算法陷入局部最优,聚类结果不稳定。针对此问题,引入多维网格空间的思想,首先将样本集映射到一个虚拟的多维网格空间结构中,然后从中搜索出包含样本数最多且距离较远的子网格作为初始聚类中心网格,最后计算出各初始聚类中心网格中所包含样本的均值点来作为初始聚类中心。此法选择出来的初始聚类中心与实际聚类中心拟合度高,进而可据此初始聚类中心稳定高效地得到最终的聚类结果。通过使用计算机模拟数据集和UCI机器学习数据集进行测试,结果表明改进算法的迭代次数和错误率比较稳定,且均小于传统K-means算法测试结果的平均值,能有效避免陷入局部最优,并且聚类结果稳定。

关键词: K-means算法, 聚类算法, 初始聚类中心, 多维网格空间, 均值点

Abstract: K-means algorithm is a widely used clustering algorithm, but the selection of the initial clustering centers in the traditional K-means algorithm is random, which makes the algorithm easily fall into local optimum and causes instability in the clustering result. In order to solve this problem, the idea of multi-dimensional grid space was introduced to the selection of initial clustering center. Firstly, the sample set was mapped to a virtual multi-dimensional grid space structure. Secondly, the sub-grids containing the largest number of samples and being far away from each other were searched as the initial cluster center grids in the space structure. Finally, the mean points of the samples in the initial cluster center grids were calculated as the initial clustering centers. The initial clustering centers chosen by this method are very close to the actual clustering centers, so that the final clustering result can be obtained stably and efficiently. By using computer simulation data set and UCI machine learning data sets to test, both the iterative number and error rate of the improved algorithm are stable, and smaller than the average of the traditional K-means algorithm. The improved algorithm can effectively avoid falling into local optimum and guarantee the stability of clustering result.

Key words: K-means algorithm, clustering algorithm, initial clustering center, multi-dimensional grid space, mean point

中图分类号: