《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (3): 831-841.DOI: 10.11772/j.issn.1001-9081.2023030351

• 先进计算 • 上一篇    下一篇

基于自适应布谷鸟优化特征选择的K-means聚类

孙林1(), 刘梦含2   

  1. 1.天津科技大学 人工智能学院,天津 300457
    2.河南师范大学 计算机与信息工程学院,河南 新乡 453007
  • 收稿日期:2023-04-03 修回日期:2023-06-05 接受日期:2023-06-14 发布日期:2023-10-10 出版日期:2024-03-10
  • 通讯作者: 孙林
  • 作者简介:刘梦含(1996—),女,河南周口人,硕士研究生,主要研究方向:数据挖掘。
  • 基金资助:
    国家自然科学基金资助项目(62076089)

K-means clustering based on adaptive cuckoo optimization feature selection

Lin SUN1(), Menghan LIU2   

  1. 1.College of Artificial Intelligence,Tianjin University of Science & Technology,Tianjin 300457,China
    2.College of Computer and Information Engineering,Henan Normal University,Xinxiang Henan 453007,China
  • Received:2023-04-03 Revised:2023-06-05 Accepted:2023-06-14 Online:2023-10-10 Published:2024-03-10
  • Contact: Lin SUN
  • About author:LIU Menghan, born in 1996, M. S. candidate. Her research interests include data mining.
  • Supported by:
    National Natural Science Foundation of China(62076089)

摘要:

K-means聚类算法随机确定初始聚类数目,而且原始数据集中含有大量的冗余特征会导致聚类时精度降低,而布谷鸟搜索(CS)算法存在收敛速度慢和局部搜索能力弱等问题,为此提出一种基于自适应布谷鸟优化特征选择的K-means聚类算法(DCFSK)。首先,为提升CS算法的搜索速度和精度,在莱维飞行阶段,设计了自适应步长因子;为调节CS算法全局搜索和局部搜索之间的平衡、加快CS算法的收敛,动态调整发现概率,进而提出改进的动态CS算法(IDCS),在IDCS的基础上构建了结合动态CS的特征选择算法(DCFS)。其次,为提升传统欧氏距离的计算精确度,设计同时考虑样本和特征对距离计算贡献程度的加权欧氏距离;为了确定最佳聚类数目的选取方法,依据改进的加权欧氏距离构造了加权簇内距离和簇间距离。最后,为克服传统K-means聚类目标函数仅考虑簇内的距离而未考虑簇间距离的缺陷,提出基于中位数的轮廓系数的目标函数,进而设计了DCFSK。实验结果表明,在10个基准测试函数上,IDCS的各项指标取得了较优的结果;相较于K-means、DBSCAN(Density-Based Spatial Clustering of Applications with Noise)等算法,在6个合成数据集与6个UCI数据集上,DCFSK的聚类效果最佳。

关键词: 布谷鸟搜索算法, K-means聚类, 欧氏距离, 特征选择, 轮廓系数

Abstract:

The initial cluster number of the K-means clustering algorithm is randomly determined, a large number of redundant features are contained in the original datasets, which will lead to the decrease of clustering accuracy, and Cuckoo Search (CS) algorithm has the disadvantages of low convergence speed and weak local search. To address these issues, a K-means clustering algorithm combined with Dynamic CS Feature Selection (DCFSK) was proposed. Firstly, an adaptive step size factor was designed during the Levy flight phase to improve the search speed and accuracy of the CS algorithm. Then, to adjust the balance between global search and local search, and accelerate the convergence of the CS algorithm, the discovery probability was dynamically adjusted. An Improved Dynamic CS algorithm (IDCS) was constructed, and then a Dynamic CS-based Feature Selection algorithm (DCFS) was built. Secondly, to improve the calculation accuracy of the traditional Euclidean distance, a weighted Euclidean distance was designed to simultaneously consider the contribution of samples and features to distance calculation. To determine the selection scheme of the optimal number of clusters, the weighted intra-cluster and inter-cluster distances were constructed based on the improved weighted Euclidean distance. Finally, to overcome the defect that the objective function of the traditional K-means clustering only considers the distance within the clusters and does not consider the distance between the clusters, a objective function based on the contour coefficient of median was proposed. Thus, a K-means clustering algorithm based on the adaptive cuckoo optimization feature selection was designed. Experimental results show that, on ten benchmark test functions, IDCS achieves the best metrics. Compared to algorithms such as K-means and DBSCAN (Density-Based Spatial Clustering of Applications with Noise), DCFSK achieves the best clustering effects on six synthetic datasets and six UCI datasets.

Key words: Cuckoo Search (CS) algorithm, K-means clustering, Euclidean distance, feature selection, contour coefficient

中图分类号: