Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (6): 1914-1921.DOI: 10.11772/j.issn.1001-9081.2021040551

• Data science and technology • Previous Articles    

Density peak clustering algorithm based on adaptive reachable distance

Man ZHANG(), Zhengjun ZHANG, Junqi FENG, Tao YAN   

  1. School of Science,Nanjing University of Science and Technology,Nanjing Jiangsu 210094,China
  • Received:2021-04-12 Revised:2021-07-22 Accepted:2021-08-05 Online:2022-06-22 Published:2022-06-10
  • Contact: Man ZHANG
  • About author:ZHANG Zhengjun, born in 1965, Ph. D., associate professor. His research interests include data mining, graphics technology, image processing.
    FENG Junqi, born in 1997, M. S. candidate. His research interests include machine learning, data mining.
    YAN Tao, born in 1977, Ph. D., associate professor. His research interests include linear and nonlinear programming, optimization models and algorithms in application problems, complementarity problems, programming with equilibrium constraints.
    First author contact:ZHANG Man, born in 1998, M. S. candidate. Her research interests include machine learning, data mining.
  • Supported by:
    National Natural Science Foundation of China(11671205)

基于自适应可达距离的密度峰值聚类算法

章曼(), 张正军, 冯俊淇, 严涛   

  1. 南京理工大学 理学院,南京 210094
  • 通讯作者: 章曼
  • 作者简介:张正军(1965—),男,江苏阜宁人,副教授,博士,主要研究方向:数据挖掘、图形技术、图像处理
    冯俊淇(1997—),男,辽宁沈阳人,硕士研究生,主要研究方向:机器学习、数据挖掘
    严涛(1977—),江苏泰兴人,副教授,博士,主要研究方向:线性与非线性规划、应用问题中的优化模型及算法、互补问题、均衡约束规划。
    第一联系人:章曼(1998—),女,安徽太湖人,硕士研究生,主要研究方向:机器学习、数据挖掘
  • 基金资助:
    国家自然科学基金资助项目(11671205)

Abstract:

Concerning the problem that the cutoff distance needs to be selected manually in Clustering by Fast Search and Find of Density Peaks (CFSFDP) algorithm, as well as the poor clustering effect on complex datasets with different density clusters due to the error caused by nearest neighbor assignment, a Density Peak Clustering algorithm based on Adaptive Reachable Distance (ARD-DPC) was proposed. In this algorithm, a non-parametric kernel density estimation method was used to calculate the local density of points, and the clustering centers were selected by the decision graph. Then, an adaptive reachable distance was used to assign the data points and obtain the final clustering result. Simulation experiments were conducted on 4 synthetic datasets and 6 UCI datasets, and the proposed algorithm was compared with CFSFDP (Clustering by Fast Search and Find of Density Peaks), DBSCAN (Density-Based Spatial Clustering of Applications with Noise) and DADPC (Density Peaks Clustering based on Density Adaptive distance). Experimental results show that compared to the three other algorithms, the proposed ARD-DPC algorithm achieves the all highest Normalized Mutual Information (NMI), Rand Index (RI) and F1-measure on 4 synthetic datasets and 3 UCI datasets, the only highest NMI on UCI Breast dataset, the only highest F1-measure on UCI Heart dataset, but does not cluster UCI Pima dataset well, which has high fuzzyness and unclear clustering feature. At the same time, ARD-DPC algorithm can accurately identify the number of clusters and clusters with complex density on the synthetic datasets.

Key words: clustering algorithm, density peak, cutoff distance, non-parametric kernel density estimation, adaptive reachable distance

摘要:

针对基于快速搜索和发现密度峰值的聚类(CFSFDP)算法中截断距离需要人工选取,以及最近邻分配带来的误差导致的在具有不同密度簇的复杂数据集上的聚类效果不佳的问题,提出了一种基于自适应可达距离的密度峰值聚类(ARD-DPC)算法。该算法利用非参数核密度估计方法计算点的局部密度,根据决策图选取聚类中心,并利用自适应可达距离分配数据点,从而得到最终的聚类结果。在4个合成数据集和6个UCI数据集上进行了仿真实验,将所提算法ARD-DPC与基于快速搜索和发现密度峰值的聚类(CFSFDP)、基于密度的噪声应用空间聚类(DBSCAN)、基于密度自适应距离的密度峰聚类(DADPC)算法进行了比较,实验结果表明,相比其他三种算法,ARD-DPC算法在7个数据集上的标准化互信息(NMI)、兰德指数(RI)和F1-measure取得了最大值,在2个数据集分别取得F1-measure和NMI的最大值,只对模糊度较高、聚类特征不明显的Pima数据集聚类效果不佳;同时,ARD-DPC算法在合成数据集上能准确地识别出聚类数目和具有复杂密度的簇。

关键词: 聚类算法, 密度峰值, 截断距离, 非参数核密度估计, 自适应可达距离

CLC Number: