Clustering algorithm based on local gravity and distance

doi:10.11772/j.issn.1001-9081.2021030515

Abstract

Abstract:

The Density Peak Clustering （DPC） algorithm cannot accurately select the cluster centers for the datasets with various density and complex shape. The Clustering by Local Gravitation （LGC） algorithm has many parameters which need manual adjustment. To address these issues， a new Clustering algorithm based on Local Gravity and Distance （LGDC） was proposed. Firstly， the local gravity model was used to calculate the ConcEntration （CE） of data points， and the distance between each point and the point with higher CE value was determined according to CE. Then， the data points with high CE and high distance were selected as cluster centers. Finally， the remaining data points were allocated based on the idea that the CE of internal points of the cluster was much higher than that of the boundary points. At the same time， the balanced k nearest neighbor was used to adjust the parameters automatically. Experimental results show that， LGDC achieves better clustering effect on four synthetic datasets. Compared with algorithms such as DPC and LGC， LGDC has the index of Adjustable Rand Index （ARI） improved by 0.144 7 on average on the real datasets such as Wine， SCADI and Soybean.

Key words: density peak clustering, gravity clustering, local gravity model, concentration, distance

摘要：

密度峰值聚类（DPC）算法对于密度多样、形状复杂的数据集不能准确选择聚类中心，同时基于局部引力的聚类（LGC）算法参数较多且需要手动调参。针对这些问题，提出了一种基于局部引力和距离的聚类算法（LGDC）。首先，利用局部引力模型计算数据点的集中度（CE），根据集中度确定每个数据点与高集中度的点之间的距离；然后，选取具有高集中度值和高距离值的数据点作为聚类中心；最后，基于簇的内部点集中度远高于边界点的集中度的思想，分配其余数据点，并且利用平衡k近邻实现参数的自动调整。实验结果表明，LGDC在4个合成数据集上取得了更好的聚类效果；且在Wine、SCADI、Soybean等真实数据集上，LGDC的调整兰德系数（ARI）指标相较DPC、LGC等算法平均提高了0.144 7。

关键词: 密度峰值聚类, 引力聚类, 局部引力模型, 集中度, 距离

CLC Number:

TP181

Jie DU, Yan MA, Hui HUANG. Clustering algorithm based on local gravity and distance[J]. Journal of Computer Applications, 2022, 42(5): 1472-1479.

杜洁, 马燕, 黄慧. 基于局部引力和距离的聚类算法[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1472-1479.

Figures/Tables 10

References 26

1	WANG Y， MA Y， HUANG H. A neighborhood-based three-stage hierarchical clustering algorithm ［J］. Multimedia Tools and Applications， 2021， 80： 32379-32407. 10.1007/s11042-021-11171-w
2	LV X B， MA Y， HE X F， et al. CciMST： a clustering algorithm based on minimum spanning tree and cluster centers. mathematical problems in engineering ［J］. Mathematical Problems in Engineering， 2018， 2018： Article No.8451796. 10.1155/2018/8451796
3	XIE W B， LEE Y L， WANG C， et al. Hierarchical clustering supported by reciprocal nearest neighbors ［J］. Information Sciences， 2020， 527： 279-292. 10.1016/j.ins.2020.04.016
4	FELDMAN D， SCHMIDT M， SOHLER C. Turning big data into tiny data： constant-size core sets for k-means， PCA and projective clustering ［C］// Proceedings of the 2013 24th Annual ACM-SIAM Symposium on Discrete Algorithms. Philadelphia： SIAM， 2013：1434-1453. 10.1137/1.9781611973105.103
5	MA Y， LIN H R， WANG Y， et al. A multi-stage hierarchical clustering algorithm based on centroid of tree and cut edge constraint ［J］. Information Sciences， 2021， 557： 194-219. 10.1016/j.ins.2020.12.016
6	YANG J， MA Y， ZHANG X F， et al. An initialization method based on hybrid distance for k-means algorithm ［J］. Neural Computation， 2017， 29（11）： 3094-3117. 10.1162/neco_a_01014
7	NG A Y， JORDAN M I， WEISS Y. On spectral clustering： analysis and an algorithm ［C］// Proceedings of the 2001 14th International Conference on Neural Information Processing Systems. Cambridge： MIT Press， 2001： 849-856.
8	SHEIKHOLESLAMI G， CHATTERJEE S， ZHANG A D. WaveCluster： a wavelet-based clustering approach for spatial data in very large databases ［J］. The VLDB Journal， 2000， 8（3/4）： 289-304. 10.1007/s007780050009
9	郭佳，韩李涛，孙宪龙，等.自动确定聚类中心的比较密度峰值聚类算法［J］.计算机应用，2021，41（3）：738-744.
	GUO J， HAN L T， SUN X L， et al. Comparative density peaks clustering algorithm with automatic determination of clustering center ［J］. Journal of Computer Applications， 2021， 41（3）： 738-744.
10	ESTER M， KRIEGEL H P， SANDER J， et al. A density-based algorithm for discovering clusters in large spatial databases with noise ［C］// Proceedings of the 1996 2nd International Conference on Knowledge Discovery and Data Mining. Palo Alto： AAAI Press， 1996： 226-231. 10.1109/icde.1998.655795
11	RODRIGUEZ A， LAIO A. Clustering by fast search and find of density peaks ［J］. Science， 2014， 344（6191）：1492-1496. 10.1126/science.1242072
12	温晓芳，杨志翀，陈梅.数据点的密度引力聚类新算法［J］.计算机科学与探索，2018，12（12）：1996-2006.
	WEN X F， YANG Z C， CHEN M. Density attraction clustering algorithm between data points ［J］. Journal of Frontiers of Computer Science and Technology， 2018， 12（12）： 1996-2006.
13	WANG Z Q， YU Z W， CHEN C L P， et al. Clustering by local gravitation ［J］. IEEE Transactions on Cybernetics， 2018， 48（5）： 1383-1396. 10.1109/tcyb.2017.2695218
14	LIU R， WANG H， YU X M. Shared-nearest-neighbor-based clustering by fast search and find of density peaks ［J］. Information Sciences， 2018， 450：200-226. 10.1016/j.ins.2018.03.031
15	吴斌，卢红丽，江惠君.自适应密度峰值聚类算法［J］.计算机应用，2020，40（6）：1654-1661.
	WU B， LU H L， JIANG H J. Adaptive density peaks clustering algorithm ［J］. Journal of Computer Applications， 2020， 40（6）： 1654-1661.
16	JIANG J H， HAO D H， CHEN Y J， et al. GDPC： Gravitation-based Density Peaks Clustering algorithm ［J］. Physica A： Statistical Mechanics and its Applications， 2018， 502： 345-355. 10.1016/j.physa.2018.02.084
17	WRIGHT W E. Gravitational clustering ［J］. Pattern Recognition， 1977， 9（3）： 151-166. 10.1016/0031-3203(77)90013-9
18	WANG X X， ZHANG Y F， XIE J， et al. A density-core-based clustering algorithm with local resultant force ［J］. Soft Computing， 2020， 24（9）：6571-6590. 10.1007/s00500-020-04777-z
19	BRYANT A， CIOS K. RNN-DBSCAN： a density-based clustering algorithm using reverse nearest neighbor density estimates ［J］. IEEE Transactions on Knowledge and Data Engineering， 2018， 30（6）： 1109-1121. 10.1109/tkde.2017.2787640
20	魏康园，何庆，徐钦帅.基于改进引力搜索算法的K-means聚类［J］.计算机应用研究，2019，36（11）：3240-3244.
	WEI K Y， HE Q， XU Q S. Novel K-means clustering algorithm based on improved gravitational search algorithm ［J］. Application Research of Computers， 2019， 36（11）： 3240-3244.
21	孙伟鹏.基于密度峰值聚类算法的研究与实现［D］.无锡：江南大学，2018：39-50.
	SUN W P. Research and implementation of density peaks clustering algorithm ［D］. Wuxi： Jiangnan University， 2018： 39-50.
22	DUA D， GRAFF C. UCI machine learning repository ［DS/OL］. ［2020-11-05］. .
23	YEUNG K Y， RUZZO W L. Principal component analysis for clustering gene expression data ［J］. Bioinformatics， 2001， 17（9）： 763-774. 10.1093/bioinformatics/17.9.763
24	HUANG Y P J， POWERS R， MONTELIONE G T. Protein NMR Recall， Precision and F-measure Scores （RPF Scores）： structure quality assessment measures based on information retrieval statistics ［J］. Journal of the American Chemical Society， 2005， 127（6）： 1665-1674. 10.1021/ja047109h
25	蒋礼青，张明新，郑金龙，等.快速搜索与发现密度峰值聚类算法的优化研究［J］.计算机应用研究，2016，33（11）：3251-3254. 10.1109/icalip.2016.7846664
	JIANG L Q， ZHANG M X， ZHENG J L， et al. Optimization of clustering by fast search and find of density peaks ［J］. Application Research of Computers， 2016， 33（11）： 3251- 3254. 10.1109/icalip.2016.7846664
26	MEHMOOD R， ZHANG G Z， BIE R F， et al. Clustering by fast search and find of density peaks via heat diffusion ［J］. Neurocomputing， 2016， 208： 210-217. 10.1016/j.neucom.2016.01.102

数据集	样本数	维度	类别数
DS1	788	2	7
DS2	2 000	2	5
DS3	1 390	2	4
DS4	1 000	2	4
Wine	178	13	3
SCADI	70	205	7
Soybean	47	35	4
Waveform	5 000	21	3
Libras	360	90	15
Statlog	4 435	36	7

数据集	样本数	维度	类别数
DS1	788	2	7
DS2	2 000	2	5
DS3	1 390	2	4
DS4	1 000	2	4
Wine	178	13	3
SCADI	70	205	7
Soybean	47	35	4
Waveform	5 000	21	3
Libras	360	90	15
Statlog	4 435	36	7

聚类算法	指标	Wine	SCADI	Soybean	Waveform	Libras	Statlog
K-means	RI	0.709 8	0.807 6	0.922 9	0.667 3	0.904 3	0.830 7
	ARI	0.364 0	0.431 9	0.808 3	0.253 6	0.295 6	0.462 6
	FM	0.587 7	0.559 0	0.863 3	0.503 9	0.349 1	0.568 3
DPC	RI	0.719 1	0.699 4	0.898 2	0.628 5	0.868 8	0.821 7
	ARI	0.371 5	0.189 1	0.725 1	0.189 7	0.261 6	0.479 1
	FM	0.587 7	0.388 2	0.792 7	0.476 7	0.345 8	0.595 3
GDPC	RI	0.716 2	0.696 9	0.898 2	0.614 3	0.150 0	0.635 0
	ARI	0.457 6	0.354 4	0.725 1	0.239 4	0.001 0	0.169 4
	FM	0.711 9	0.583 2	0.792 7	0.562 4	0.242 2	0.411 6
LGC	RI	0.718 5	0.754 5	0.842 7	0.333 3	0.757 4	0.754 7
	ARI	0.456 8	0.473 2	0.653 7	0.000 0	0.126 5	0.441 8
	FM	0.706 6	0.664 1	0.783 9	0.577 3	0.265 5	0.627 3
CLA	RI	0.713 5	0.840 6	0.842 7	0.333 3	0.760 7	0.693 2
	ARI	0.453 6	0.606 7	0.653 7	0.000 0	0.169 0	0.304 6
	FM	0.710 5	0.719 2	0.783 9	0.577 3	0.317 9	0.527 4
LGDC	RI	0.824 9	0.865 0	1.0000	0.697 8	0.906 5	0.836 2
	ARI	0.613 2	0.657 2	1.0000	0.341 7	0.344 6	0.502 0
	FM	0.747 1	0.750 2	1.0000	0.575 8	0.399 5	0.606 3

聚类算法	指标	Wine	SCADI	Soybean	Waveform	Libras	Statlog
K-means	RI	0.709 8	0.807 6	0.922 9	0.667 3	0.904 3	0.830 7
	ARI	0.364 0	0.431 9	0.808 3	0.253 6	0.295 6	0.462 6
	FM	0.587 7	0.559 0	0.863 3	0.503 9	0.349 1	0.568 3
DPC	RI	0.719 1	0.699 4	0.898 2	0.628 5	0.868 8	0.821 7
	ARI	0.371 5	0.189 1	0.725 1	0.189 7	0.261 6	0.479 1
	FM	0.587 7	0.388 2	0.792 7	0.476 7	0.345 8	0.595 3
GDPC	RI	0.716 2	0.696 9	0.898 2	0.614 3	0.150 0	0.635 0
	ARI	0.457 6	0.354 4	0.725 1	0.239 4	0.001 0	0.169 4
	FM	0.711 9	0.583 2	0.792 7	0.562 4	0.242 2	0.411 6
LGC	RI	0.718 5	0.754 5	0.842 7	0.333 3	0.757 4	0.754 7
	ARI	0.456 8	0.473 2	0.653 7	0.000 0	0.126 5	0.441 8
	FM	0.706 6	0.664 1	0.783 9	0.577 3	0.265 5	0.627 3
CLA	RI	0.713 5	0.840 6	0.842 7	0.333 3	0.760 7	0.693 2
	ARI	0.453 6	0.606 7	0.653 7	0.000 0	0.169 0	0.304 6
	FM	0.710 5	0.719 2	0.783 9	0.577 3	0.317 9	0.527 4
LGDC	RI	0.824 9	0.865 0	1.0000	0.697 8	0.906 5	0.836 2
	ARI	0.613 2	0.657 2	1.0000	0.341 7	0.344 6	0.502 0
	FM	0.747 1	0.750 2	1.0000	0.575 8	0.399 5	0.606 3

[1]	Huanhuan ZHOU, Bochuan ZHENG, Zheng ZHANG, Qi ZHANG. Density peak clustering algorithm based on adaptive nearest neighbor parameters [J]. Journal of Computer Applications, 2022, 42(5): 1464-1471.
[2]	Xueqin CHEN, Tao TAO, Zhongwang ZHANG, Yilei WANG. Long short-term session-based recommendation algorithm combining paired coding scheme and two-dimensional conventional neural network [J]. Journal of Computer Applications, 2022, 42(5): 1347-1354.
[3]	Jingyu LIU, Qiuxia NIU, Xiaoyan LI, Qiaoshuo SHI, Youxi WU. Fast failure recovery method based on local redundant hybrid code [J]. Journal of Computer Applications, 2022, 42(4): 1244-1252.
[4]	Yuchang YIN, Hongyuan WANG, Li CHEN, Zundeng FENG, Yu XIAO. One-shot video-based person re-identification with multi-loss learning and joint metric [J]. Journal of Computer Applications, 2022, 42(3): 764-769.
[5]	Yajie HUANG, Junhai ZHAI, Xiang ZHOU, Yan LI. Voting instance selection algorithm based on learning to hash [J]. Journal of Computer Applications, 2022, 42(2): 389-394.
[6]	Yaming LI, Kai XING, Hongwu DENG, Zhiyong WANG, Xuan HU. Derivative-free few-shot learning based performance optimization method of pre-trained models with convolution structure [J]. Journal of Computer Applications, 2022, 42(2): 365-374.
[7]	Yaoming MA, Yu ZHANG. Insulator detection algorithm based on improved Faster-RCNN [J]. Journal of Computer Applications, 2022, 42(2): 631-637.
[8]	YANG Su, OUYANG Zhi, DU Nisuo. Unsupervised parallel hash image retrieval based on correlation distance [J]. Journal of Computer Applications, 2021, 41(7): 1902-1907.
[9]	ZHANG Hao, ZHU Rui, SONG Fuyao, FANG Peng, XIA Xiufeng. Bichromatic reverse k nearest neighbor query method based on distance-keyword similarity constraint [J]. Journal of Computer Applications, 2021, 41(6): 1686-1693.
[10]	GUO Maozu, YANG Qiannan, ZHAO Lingling. Image generation based on conditional-Wassertein generative adversarial network [J]. Journal of Computer Applications, 2021, 41(5): 1432-1437.
[11]	GAN Lan, SHEN Hongfei, WANG Yao, ZHANG Yuejin. Data augmentation method based on improved deep convolutional generative adversarial networks [J]. Journal of Computer Applications, 2021, 41(5): 1305-1313.
[12]	WANG Ning, SONG Huihui, ZHANG Kaihua. Accurate object tracking algorithm based on distance weighting overlap prediction and ellipse fitting optimization [J]. Journal of Computer Applications, 2021, 41(4): 1100-1105.
[13]	PENG Li, ZHANG Haiqing, LI Daiwei, TANG Dan, YU Xi, HE Lei. Imputation algorithm for hybrid information system of incomplete data analysis approach based on rough set theory [J]. Journal of Computer Applications, 2021, 41(3): 677-685.
[14]	LYU Jia, XIAN Yan. Co-training algorithm combining improved density peak clustering and shared subspace [J]. Journal of Computer Applications, 2021, 41(3): 686-693.
[15]	WEI Bo, YANG Rong, SHU Sihao, WAN Yong, MIAO Jianguo. Path planning of mobile robots based on ion motion-artificial bee colony algorithm [J]. Journal of Computer Applications, 2021, 41(2): 379-383.