《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (5): 1472-1479.DOI: 10.11772/j.issn.1001-9081.2021030515
收稿日期:
2021-04-06
修回日期:
2021-07-09
接受日期:
2021-07-14
发布日期:
2022-06-11
出版日期:
2022-05-10
通讯作者:
马燕
作者简介:
杜洁(1996—),女,浙江湖州人,硕士研究生,主要研究方向:模式识别、图像处理基金资助:
Received:
2021-04-06
Revised:
2021-07-09
Accepted:
2021-07-14
Online:
2022-06-11
Published:
2022-05-10
Contact:
Yan MA
About author:
DU Jie, born in 1996,M. S. candidate. Her research interestsinclude pattern recognition,image processing.Supported by:
摘要:
密度峰值聚类(DPC)算法对于密度多样、形状复杂的数据集不能准确选择聚类中心,同时基于局部引力的聚类(LGC)算法参数较多且需要手动调参。针对这些问题,提出了一种基于局部引力和距离的聚类算法(LGDC)。首先,利用局部引力模型计算数据点的集中度(CE),根据集中度确定每个数据点与高集中度的点之间的距离;然后,选取具有高集中度值和高距离值的数据点作为聚类中心;最后,基于簇的内部点集中度远高于边界点的集中度的思想,分配其余数据点,并且利用平衡k近邻实现参数的自动调整。实验结果表明,LGDC在4个合成数据集上取得了更好的聚类效果;且在Wine、SCADI、Soybean等真实数据集上,LGDC的调整兰德系数(ARI)指标相较DPC、LGC等算法平均提高了0.144 7。
中图分类号:
杜洁, 马燕, 黄慧. 基于局部引力和距离的聚类算法[J]. 计算机应用, 2022, 42(5): 1472-1479.
Jie DU, Yan MA, Hui HUANG. Clustering algorithm based on local gravity and distance[J]. Journal of Computer Applications, 2022, 42(5): 1472-1479.
数据集 | 样本数 | 维度 | 类别数 |
---|---|---|---|
DS1 | 788 | 2 | 7 |
DS2 | 2 000 | 2 | 5 |
DS3 | 1 390 | 2 | 4 |
DS4 | 1 000 | 2 | 4 |
Wine | 178 | 13 | 3 |
SCADI | 70 | 205 | 7 |
Soybean | 47 | 35 | 4 |
Waveform | 5 000 | 21 | 3 |
Libras | 360 | 90 | 15 |
Statlog | 4 435 | 36 | 7 |
表1 数据集信息
Tab. 1 Information of datasets
数据集 | 样本数 | 维度 | 类别数 |
---|---|---|---|
DS1 | 788 | 2 | 7 |
DS2 | 2 000 | 2 | 5 |
DS3 | 1 390 | 2 | 4 |
DS4 | 1 000 | 2 | 4 |
Wine | 178 | 13 | 3 |
SCADI | 70 | 205 | 7 |
Soybean | 47 | 35 | 4 |
Waveform | 5 000 | 21 | 3 |
Libras | 360 | 90 | 15 |
Statlog | 4 435 | 36 | 7 |
数据集 | LGC | CLA | LGDC | ||
---|---|---|---|---|---|
初始动量 | k | 中心阈值 | k | k | |
DS1 | 10 | 11 | 0.1 | 30 | 26 |
DS2 | 30 | 10 | 0.1 | 60 | 40 |
DS3 | 30 | 7 | 0.2 | 30 | 40 |
DS4 | 30 | 15 | 0 | 60 | 30 |
Wine | 10 | 5 | 0.5 | 14 | 14 |
SCADI | 8 | 7 | 0.5 | 7 | 7 |
Soybean | 5 | 5 | 0.5 | 10 | 5 |
Waveform | 40 | 25 | 0.1 | 60 | 40 |
Libras | 3 | 5 | 0.3 | 20 | 18 |
Statlog | 10 | 23 | 0.5 | 60 | 40 |
表2 不同算法在不同数据集上的参数取值
Tab. 2 Parameter settings of different algorithms on different datasets
数据集 | LGC | CLA | LGDC | ||
---|---|---|---|---|---|
初始动量 | k | 中心阈值 | k | k | |
DS1 | 10 | 11 | 0.1 | 30 | 26 |
DS2 | 30 | 10 | 0.1 | 60 | 40 |
DS3 | 30 | 7 | 0.2 | 30 | 40 |
DS4 | 30 | 15 | 0 | 60 | 30 |
Wine | 10 | 5 | 0.5 | 14 | 14 |
SCADI | 8 | 7 | 0.5 | 7 | 7 |
Soybean | 5 | 5 | 0.5 | 10 | 5 |
Waveform | 40 | 25 | 0.1 | 60 | 40 |
Libras | 3 | 5 | 0.3 | 20 | 18 |
Statlog | 10 | 23 | 0.5 | 60 | 40 |
聚类算法 | 指标 | Wine | SCADI | Soybean | Waveform | Libras | Statlog |
---|---|---|---|---|---|---|---|
K-means | RI | 0.709 8 | 0.807 6 | 0.922 9 | 0.667 3 | 0.904 3 | 0.830 7 |
ARI | 0.364 0 | 0.431 9 | 0.808 3 | 0.253 6 | 0.295 6 | 0.462 6 | |
FM | 0.587 7 | 0.559 0 | 0.863 3 | 0.503 9 | 0.349 1 | 0.568 3 | |
DPC | RI | 0.719 1 | 0.699 4 | 0.898 2 | 0.628 5 | 0.868 8 | 0.821 7 |
ARI | 0.371 5 | 0.189 1 | 0.725 1 | 0.189 7 | 0.261 6 | 0.479 1 | |
FM | 0.587 7 | 0.388 2 | 0.792 7 | 0.476 7 | 0.345 8 | 0.595 3 | |
GDPC | RI | 0.716 2 | 0.696 9 | 0.898 2 | 0.614 3 | 0.150 0 | 0.635 0 |
ARI | 0.457 6 | 0.354 4 | 0.725 1 | 0.239 4 | 0.001 0 | 0.169 4 | |
FM | 0.711 9 | 0.583 2 | 0.792 7 | 0.562 4 | 0.242 2 | 0.411 6 | |
LGC | RI | 0.718 5 | 0.754 5 | 0.842 7 | 0.333 3 | 0.757 4 | 0.754 7 |
ARI | 0.456 8 | 0.473 2 | 0.653 7 | 0.000 0 | 0.126 5 | 0.441 8 | |
FM | 0.706 6 | 0.664 1 | 0.783 9 | 0.577 3 | 0.265 5 | 0.627 3 | |
CLA | RI | 0.713 5 | 0.840 6 | 0.842 7 | 0.333 3 | 0.760 7 | 0.693 2 |
ARI | 0.453 6 | 0.606 7 | 0.653 7 | 0.000 0 | 0.169 0 | 0.304 6 | |
FM | 0.710 5 | 0.719 2 | 0.783 9 | 0.577 3 | 0.317 9 | 0.527 4 | |
LGDC | RI | 0.824 9 | 0.865 0 | 1.0000 | 0.697 8 | 0.906 5 | 0.836 2 |
ARI | 0.613 2 | 0.657 2 | 1.0000 | 0.341 7 | 0.344 6 | 0.502 0 | |
FM | 0.747 1 | 0.750 2 | 1.0000 | 0.575 8 | 0.399 5 | 0.606 3 |
表3 六种聚类算法在真实数据集上的结果对比
Tab. 3 Result comparison of six clustering algorithms on real datasets
聚类算法 | 指标 | Wine | SCADI | Soybean | Waveform | Libras | Statlog |
---|---|---|---|---|---|---|---|
K-means | RI | 0.709 8 | 0.807 6 | 0.922 9 | 0.667 3 | 0.904 3 | 0.830 7 |
ARI | 0.364 0 | 0.431 9 | 0.808 3 | 0.253 6 | 0.295 6 | 0.462 6 | |
FM | 0.587 7 | 0.559 0 | 0.863 3 | 0.503 9 | 0.349 1 | 0.568 3 | |
DPC | RI | 0.719 1 | 0.699 4 | 0.898 2 | 0.628 5 | 0.868 8 | 0.821 7 |
ARI | 0.371 5 | 0.189 1 | 0.725 1 | 0.189 7 | 0.261 6 | 0.479 1 | |
FM | 0.587 7 | 0.388 2 | 0.792 7 | 0.476 7 | 0.345 8 | 0.595 3 | |
GDPC | RI | 0.716 2 | 0.696 9 | 0.898 2 | 0.614 3 | 0.150 0 | 0.635 0 |
ARI | 0.457 6 | 0.354 4 | 0.725 1 | 0.239 4 | 0.001 0 | 0.169 4 | |
FM | 0.711 9 | 0.583 2 | 0.792 7 | 0.562 4 | 0.242 2 | 0.411 6 | |
LGC | RI | 0.718 5 | 0.754 5 | 0.842 7 | 0.333 3 | 0.757 4 | 0.754 7 |
ARI | 0.456 8 | 0.473 2 | 0.653 7 | 0.000 0 | 0.126 5 | 0.441 8 | |
FM | 0.706 6 | 0.664 1 | 0.783 9 | 0.577 3 | 0.265 5 | 0.627 3 | |
CLA | RI | 0.713 5 | 0.840 6 | 0.842 7 | 0.333 3 | 0.760 7 | 0.693 2 |
ARI | 0.453 6 | 0.606 7 | 0.653 7 | 0.000 0 | 0.169 0 | 0.304 6 | |
FM | 0.710 5 | 0.719 2 | 0.783 9 | 0.577 3 | 0.317 9 | 0.527 4 | |
LGDC | RI | 0.824 9 | 0.865 0 | 1.0000 | 0.697 8 | 0.906 5 | 0.836 2 |
ARI | 0.613 2 | 0.657 2 | 1.0000 | 0.341 7 | 0.344 6 | 0.502 0 | |
FM | 0.747 1 | 0.750 2 | 1.0000 | 0.575 8 | 0.399 5 | 0.606 3 |
1 | WANG Y, MA Y, HUANG H. A neighborhood-based three-stage hierarchical clustering algorithm [J]. Multimedia Tools and Applications, 2021, 80: 32379-32407. 10.1007/s11042-021-11171-w |
2 | LV X B, MA Y, HE X F, et al. CciMST: a clustering algorithm based on minimum spanning tree and cluster centers. mathematical problems in engineering [J]. Mathematical Problems in Engineering, 2018, 2018: Article No.8451796. 10.1155/2018/8451796 |
3 | XIE W B, LEE Y L, WANG C, et al. Hierarchical clustering supported by reciprocal nearest neighbors [J]. Information Sciences, 2020, 527: 279-292. 10.1016/j.ins.2020.04.016 |
4 | FELDMAN D, SCHMIDT M, SOHLER C. Turning big data into tiny data: constant-size core sets for k-means, PCA and projective clustering [C]// Proceedings of the 2013 24th Annual ACM-SIAM Symposium on Discrete Algorithms. Philadelphia: SIAM, 2013:1434-1453. 10.1137/1.9781611973105.103 |
5 | MA Y, LIN H R, WANG Y, et al. A multi-stage hierarchical clustering algorithm based on centroid of tree and cut edge constraint [J]. Information Sciences, 2021, 557: 194-219. 10.1016/j.ins.2020.12.016 |
6 | YANG J, MA Y, ZHANG X F, et al. An initialization method based on hybrid distance for k-means algorithm [J]. Neural Computation, 2017, 29(11): 3094-3117. 10.1162/neco_a_01014 |
7 | NG A Y, JORDAN M I, WEISS Y. On spectral clustering: analysis and an algorithm [C]// Proceedings of the 2001 14th International Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2001: 849-856. |
8 | SHEIKHOLESLAMI G, CHATTERJEE S, ZHANG A D. WaveCluster: a wavelet-based clustering approach for spatial data in very large databases [J]. The VLDB Journal, 2000, 8(3/4): 289-304. 10.1007/s007780050009 |
9 | 郭佳,韩李涛,孙宪龙,等.自动确定聚类中心的比较密度峰值聚类算法[J].计算机应用,2021,41(3):738-744. |
GUO J, HAN L T, SUN X L, et al. Comparative density peaks clustering algorithm with automatic determination of clustering center [J]. Journal of Computer Applications, 2021, 41(3): 738-744. | |
10 | ESTER M, KRIEGEL H P, SANDER J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise [C]// Proceedings of the 1996 2nd International Conference on Knowledge Discovery and Data Mining. Palo Alto: AAAI Press, 1996: 226-231. 10.1109/icde.1998.655795 |
11 | RODRIGUEZ A, LAIO A. Clustering by fast search and find of density peaks [J]. Science, 2014, 344(6191):1492-1496. 10.1126/science.1242072 |
12 | 温晓芳,杨志翀,陈梅.数据点的密度引力聚类新算法[J].计算机科学与探索,2018,12(12):1996-2006. |
WEN X F, YANG Z C, CHEN M. Density attraction clustering algorithm between data points [J]. Journal of Frontiers of Computer Science and Technology, 2018, 12(12): 1996-2006. | |
13 | WANG Z Q, YU Z W, CHEN C L P, et al. Clustering by local gravitation [J]. IEEE Transactions on Cybernetics, 2018, 48(5): 1383-1396. 10.1109/tcyb.2017.2695218 |
14 | LIU R, WANG H, YU X M. Shared-nearest-neighbor-based clustering by fast search and find of density peaks [J]. Information Sciences, 2018, 450:200-226. 10.1016/j.ins.2018.03.031 |
15 | 吴斌,卢红丽,江惠君.自适应密度峰值聚类算法[J].计算机应用,2020,40(6):1654-1661. |
WU B, LU H L, JIANG H J. Adaptive density peaks clustering algorithm [J]. Journal of Computer Applications, 2020, 40(6): 1654-1661. | |
16 | JIANG J H, HAO D H, CHEN Y J, et al. GDPC: Gravitation-based Density Peaks Clustering algorithm [J]. Physica A: Statistical Mechanics and its Applications, 2018, 502: 345-355. 10.1016/j.physa.2018.02.084 |
17 | WRIGHT W E. Gravitational clustering [J]. Pattern Recognition, 1977, 9(3): 151-166. 10.1016/0031-3203(77)90013-9 |
18 | WANG X X, ZHANG Y F, XIE J, et al. A density-core-based clustering algorithm with local resultant force [J]. Soft Computing, 2020, 24(9):6571-6590. 10.1007/s00500-020-04777-z |
19 | BRYANT A, CIOS K. RNN-DBSCAN: a density-based clustering algorithm using reverse nearest neighbor density estimates [J]. IEEE Transactions on Knowledge and Data Engineering, 2018, 30(6): 1109-1121. 10.1109/tkde.2017.2787640 |
20 | 魏康园,何庆,徐钦帅.基于改进引力搜索算法的K-means聚类[J].计算机应用研究,2019,36(11):3240-3244. |
WEI K Y, HE Q, XU Q S. Novel K-means clustering algorithm based on improved gravitational search algorithm [J]. Application Research of Computers, 2019, 36(11): 3240-3244. | |
21 | 孙伟鹏.基于密度峰值聚类算法的研究与实现[D].无锡:江南大学,2018:39-50. |
SUN W P. Research and implementation of density peaks clustering algorithm [D]. Wuxi: Jiangnan University, 2018: 39-50. | |
22 | DUA D, GRAFF C. UCI machine learning repository [DS/OL]. [2020-11-05]. . |
23 | YEUNG K Y, RUZZO W L. Principal component analysis for clustering gene expression data [J]. Bioinformatics, 2001, 17(9): 763-774. 10.1093/bioinformatics/17.9.763 |
24 | HUANG Y P J, POWERS R, MONTELIONE G T. Protein NMR Recall, Precision and F-measure Scores (RPF Scores): structure quality assessment measures based on information retrieval statistics [J]. Journal of the American Chemical Society, 2005, 127(6): 1665-1674. 10.1021/ja047109h |
25 | 蒋礼青,张明新,郑金龙,等.快速搜索与发现密度峰值聚类算法的优化研究[J].计算机应用研究,2016,33(11):3251-3254. 10.1109/icalip.2016.7846664 |
JIANG L Q, ZHANG M X, ZHENG J L, et al. Optimization of clustering by fast search and find of density peaks [J]. Application Research of Computers, 2016, 33(11): 3251- 3254. 10.1109/icalip.2016.7846664 | |
26 | MEHMOOD R, ZHANG G Z, BIE R F, et al. Clustering by fast search and find of density peaks via heat diffusion [J]. Neurocomputing, 2016, 208: 210-217. 10.1016/j.neucom.2016.01.102 |
[1] | 李烨恒, 罗光圣, 苏前敏. 基于改进YOLOv5的Logo检测算法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2580-2587. |
[2] | 高阳峄, 雷涛, 杜晓刚, 李岁永, 王营博, 闵重丹. 基于像素距离图和四维动态卷积网络的密集人群计数与定位方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2233-2242. |
[3] | 葛焌迟, 赵为华. 矩阵数据基于鲁棒主成分分析的距离加权判别分析[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2073-2079. |
[4] | 沈涵, 王中生, 周舟, 王长元. 基于多应用场景的改进DV-Hop定位模型[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1219-1227. |
[5] | 丁雨, 张瀚霖, 罗荣, 孟华. 基于信念子簇切割的模糊聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1128-1138. |
[6] | 孙林, 刘梦含. 基于自适应布谷鸟优化特征选择的K-means聚类[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 831-841. |
[7] | 彭鹏, 倪志伟, 朱旭辉, 陈千. 改进萤火虫群算法协同差分隐私的干扰轨迹发布[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 496-503. |
[8] | 钟静, 林晨, 盛志伟, 张仕斌. 基于汉明距离的量子K-Means算法[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2493-2498. |
[9] | 刘震宇, 王朝坤, 郭高扬. 面向动态网络的介数中心度并行算法[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 1987-1993. |
[10] | 方可, 刘蓉, 魏驰宇, 张心月, 刘杨. 复杂场景下的行人跌倒检测算法[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1811-1817. |
[11] | 向峰, 李中志, 熊熙, 李斌勇. 粒子群局部优化的反距离权重插值算法[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 385-390. |
[12] | 刘坚强, 屈也频, 吕余海. 基于全局距离最优的抗污染极短纠错码设计[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 630-635. |
[13] | 张海永, 方贤进, 张恩皖, 李宝玉, 彭超, 穆健翔. 基于测量报告信号聚类的指纹定位方法[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3947-3954. |
[14] | 薛状状, 李鹏, 樊卫北, 张宏俊, 孟凡朔. 基于动态加权张量距离的多聚类算法[J]. 《计算机应用》唯一官方网站, 2023, 43(11): 3449-3456. |
[15] | 赵一鉴, 林利, 王茜蒨, 闻鹏, 杨东. 基于大地距离计算相似度的海上目标轨迹预测[J]. 《计算机应用》唯一官方网站, 2023, 43(11): 3594-3598. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||