Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (8): 2450-2460.DOI: 10.11772/j.issn.1001-9081.2021061083

• Data science and technology • Previous Articles    

Varied density clustering algorithm based on border point detection

Yanwei CHEN1,2, Xingwang ZHAO1,2()   

  1. 1.School of Computer and Information Technology,Shanxi University,Taiyuan Shanxi 030006,China
    2.Key Laboratory Computational Intelligence and Chinese Information Processing of Ministry of Education (Shanxi University),Taiyuan Shanxi 030006,China
  • Received:2021-06-24 Revised:2021-12-07 Accepted:2021-12-17 Online:2022-01-25 Published:2022-08-10
  • Contact: Xingwang ZHAO
  • About author:CHEN Yanwei, born in 1996, M. S. candidate. His research interests include data mining, machine learning.
    ZHAO Xingwang, born in 1984, Ph. D., associate professor. His research interests include data mining, machine learning.
  • Supported by:
    National Natural Science Foundation of China(62072293)

基于边界点检测的变密度聚类算法

陈延伟1,2, 赵兴旺1,2()   

  1. 1.山西大学 计算机与信息技术学院, 太原 030006
    2.计算智能与中文信息处理教育部重点实验室(山西大学), 太原 030006
  • 通讯作者: 赵兴旺
  • 作者简介:陈延伟(1996—),男,山东潍坊人,硕士研究生,CCF会员,主要研究方向:数据挖掘、机器学习;
    赵兴旺(1984—),男,山西太谷人,副教授,博士,CCF会员,主要研究方向:数据挖掘、机器学习。
  • 基金资助:
    国家自然科学基金资助项目(62072293)

Abstract:

The density clustering algorithm has been widely used because of its robustness to noise and the ability to find clusters of any shapes. However, in practical applications, this type of algorithms faces the problem of poor clustering effect due to the uneven distribution of the densities of different clusters in the dataset and the difficulty of distinguishing the borders between clusters. In order to solve the above problem, a Varied Density Clustering algorithm based on Border point Detection (VDCBD) was proposed. Firstly, the border points between varied density clusters were recognized based on the given relative density measurement method to enhance the separability of adjacent clusters. Secondly, the points in the non-border area were clustered to find the core class structures of the dataset. Secondly, the detected border points were allocated to the corresponding core class structures according to the principle of high-density neighbor allocation. Finally, the noise points in the dataset were recognized based on the class structure information. The proposed algorithm was compared and analyzed with the clustering algorithms such as K-means, Density-Based Spatial Clustering of Applications with Noise (DBSCAN)algorithm, Density Peaks Clustering Algorithm (DPCA), CLUstering based on Backbone (CLUB)algorithm, Border Peeling clustering (BP)algorithm on artificial datasets and UCI datasets. Experimental results show that the proposed algorithm can effectively solve the problems of uneven distribution of density and indistinguishable borders, and is superior to the existing algorithms on the evaluation indicators of Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), F-Measure (FM), and Accuracy (ACC); in the analysis of operating efficiency, when the data size is relatively large, the operating efficiency of VDCBD is higher than those of DPCA, CLUB and BP algorithms.

Key words: density clustering, relative density, varied density, border point detection, noise recognition

摘要:

密度聚类算法因具有对噪声鲁棒、能够发现任意形状的类等优点,得到了广泛的应用。然而,在实际应用中,这种算法面临着由于数据集中不同类的密度分布不均,且类与类之间的边界难以区分等导致聚类效果较差的问题。为解决以上问题,提出一种基于边界点检测的变密度聚类算法(VDCBD)。首先,基于给出的相对密度度量方法识别变密度类之间的边界点,以此增强相邻类的可分性;其次,对非边界区域的点进行聚类以找到数据集的核心类结构;接着,依据高密度近邻分配原则将检测到的边界点分配到相应的核心类结构中;最后,基于类结构信息识别数据集中的噪声点。在人造数据集和UCI数据集上与K-means、基于密度的噪声应用空间聚类(DBSCAN)算法、密度峰值聚类算法(DPCA)、有效识别密度主干的聚类(CLUB)算法、边界剥离聚类(BP)算法进行了比较分析。实验结果表明,所提算法可以有效解决类分布密度不均、边界难以区分的问题,并在调整兰德指数(ARI)、标准化互信息(NMI)、F度量(FM)、准确度(ACC)评价指标上优于已有算法;在运行效率分析中,当数据规模较大时,VDCBD运行效率高于DPCA、CLUB和BP算法。

关键词: 密度聚类, 相对密度, 变密度, 边界点检测, 噪声识别

CLC Number: