计算机应用 ›› 2019, Vol. 39 ›› Issue (2): 403-408.DOI: 10.11772/j.issn.1001-9081.2018061373

• 数据科学与技术 • 上一篇    下一篇

混合的密度峰值聚类算法

王军1,2, 周凯1, 程勇2   

  1. 1. 南京信息工程大学 计算机与软件学院, 南京 210044;
    2. 南京信息工程大学 科技产业处, 南京 210044
  • 收稿日期:2018-07-02 修回日期:2018-08-24 出版日期:2019-02-10 发布日期:2019-02-15
  • 通讯作者: 周凯
  • 作者简介:王军(1970-),男,安徽铜陵人,教授,博士,CCF会员,主要研究方向:无线传感器网络、大数据;周凯(1993-),男,江苏连云港人,硕士研究生,主要研究方向:大数据;程勇(1980-),男,重庆人,高级工程师,博士,CCF会员,主要研究方向:无线传感器网络、大数据。
  • 基金资助:
    国家自然科学基金资助项目(41875184,61373064);江苏省"六大人才高峰"创新团队项目(TD-XYDXX-004);赛尔网络下一代互联网技术创新项目(NGII20170610,NGII20171204);江苏省农业气象重点实验室开放基金资助项目(KYQ1309)。

Mixed density peaks clustering algorithm

WANG Jun1,2, ZHOU Kai1, CHENG Yong2   

  1. 1. School of Computer & Software, Nanjing University of Information Science & Technology, Jiangsu Nanjing 210044, China;
    2. Technology Industry Department, Nanjing University of Information Science & Technology, Jiangsu Nanjing 210044, China
  • Received:2018-07-02 Revised:2018-08-24 Online:2019-02-10 Published:2019-02-15
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (41875184,61373064), the Six Talent Peaks Innovation Team Project in Jiangsu Province (TD-XYDXX-004), the CERNET Networks Next-Generation Internet Technology Innovation Project (NGII20170610,NGII20171204), the Jiangsu Provincial Key Laboratory of Agricultural Meteorology Fund (KYQ1309).

摘要: 密度峰值聚类(DP)算法是一种新的基于密度的聚类算法,当它处理的单个聚类包含多个密度峰值时,会将每个不同密度峰值视为潜在聚类中心,以致难以在数据集中确定正确数量聚类,为此,提出一种混合的密度峰值聚类算法C-DP。首先,以密度峰值点为初始聚类中心将数据集划分为子簇;然后,借鉴代表点层次聚类算法(CURE),从子簇中选取分散的代表点,将拥有最小距离的代表点对的类进行合并,引入参数收缩因子以控制类的形状。仿真实验结果表明,在4个合成数据集上C-DP算法比DP算法聚类效果更好;在真实数据集上的Rand Index指标对比表明,在数据集S1上,C-DP算法比DP算法性能提高了2.32%,在数据集4k2_far上,C-DP算法比DP算法性能提高了1.13%。由此可见,C-DP算法在单个类簇中包含多密度峰值的数据集中能提高聚类的准确性。

关键词: 密度峰值, 层次聚类, 类合并, 代表点, 收缩因子

Abstract: As a new density-based clustering algorithm, clustering by fast search and find of Density Peaks (DP) algorithm regards each density peak as a potential clustering center when dealing with a single cluster with multiple density peaks, therefore it is difficult to determine the correct number of clusters in the data set. To solve this problem, a mixed density peak clustering algorithm namely C-DP was proposed. Firstly, the density peak points were considered as the initial clustering centers and the dataset was divided into sub-clusters. Then, learned from the Clustering Using Representatives algorithm (CURE), the scattered representative points were selected from the sub-clusters, the clusters of the representative point pairs with the smallest distance were merged, and a parameter contraction factor was introduced to control the shape of the clusters. The experimental results show that the C-DP algorithm has better clustering effect than the DP algorithm on four synthetic datasets. The comparison of the Rand Index indicator on real datasets shows that on the dataset S1 and 4k2_far, the performance of C-DP is 2.32% and 1.13% higher than that of the DP. It can be seen that the C-DP algorithm improves the accuracy of clustering when datasets contain multiple density peaks in a single cluster.

Key words: density peak, hierarchical clustering, class merging, representative point, contraction factor

中图分类号: