融合转移概率矩阵的多阶最近邻图聚类算法

doi:10.11772/j.issn.1001-9081.2023050727

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (5): 1527-1538.DOI: 10.11772/j.issn.1001-9081.2023050727

• 数据科学与技术 • 上一篇

融合转移概率矩阵的多阶最近邻图聚类算法

徐童童¹, 解滨¹^,²^,³(), 张春昊¹, 张喜梅¹

^1.河北师范大学计算机与网络空间安全学院, 石家庄 050024
^2.河北省网络与信息安全重点实验室(河北师范大学), 石家庄 050024
^3.供应链大数据分析与数据安全河北省工程研究中心(河北师范大学), 石家庄 050024

收稿日期:2023-06-06 修回日期:2023-07-09 接受日期:2023-07-14 发布日期:2023-08-01 出版日期:2024-05-10
通讯作者: 解滨
作者简介:徐童童（1998—），女，河北衡水人，硕士研究生，CCF会员，主要研究方向：机器学习、网络安全
张春昊（1996—），男，河北沧州人，硕士研究生，CCF会员，主要研究方向：机器学习
张喜梅（1998—），女，湖南益阳人，硕士研究生，CCF会员，主要研究方向：机器学习、网络安全。
第一联系人：解滨（1976—），男，吉林通化人，教授，博士，CCF会员，主要研究方向：粒计算、机器学习、近似推理
基金资助:
国家自然科学基金资助项目(62076088);河北师范大学研究生创新项目(XCXZZSS202317)

Multi-order nearest neighbor graph clustering algorithm by fusing transition probability matrix

Tongtong XU¹, Bin XIE¹^,²^,³(), Chunhao ZHANG¹, Ximei ZHANG¹

^1.College of Computer and Cyber Security，Hebei Normal University，Shijiazhuang Hebei 050024，China
^2.Hebei Provincial Key Laboratory of Network and Information Security （Hebei Normal University），Shijiazhuang Hebei 050024，China
^3.Hebei Provincial Engineering Research Center for Supply Chain Big Data Analytics and Data Security （Hebei Normal University），Shijiazhuang Hebei 050024，China

Received:2023-06-06 Revised:2023-07-09 Accepted:2023-07-14 Online:2023-08-01 Published:2024-05-10
Contact: Bin XIE
About author:XU Tongtong， born in 1998， M. S. candidate. Her research interests include machine learning， cyber security.
ZHANG Chunhao， born in 1996， M. S. candidate. His research interests include machine learning.
ZHANG Ximei， born in 1998， M. S. candidate. Her research interests include machine learning， cyber security.
Supported by:
National Natural Science Foundation of China(62076088);Graduate Innovation Project of Hebei Normal University(XCXZZSS202317)

摘要/Abstract

摘要：

聚类是根据样本之间的相似性将数据集划分为多个类簇。现有的大多数聚类方法都存在两个挑战：一方面，在定义样本间相似性时往往没有考虑样本的空间分布结构，无法构建稳定的相似度矩阵；另一方面，图聚类构造的样本图结构过于复杂，计算成本较高。为解决这两个问题，提出融合转移概率矩阵的多阶最近邻图聚类算法（MNNGC）。首先，综合样本的近邻关系和空间分布结构，将共享近邻定义的相似度进行趋密性加权，得到节点间的趋密性亲和矩阵；其次，利用节点间多阶概率转移预测非邻接点的关联程度，并通过融合多阶转移概率矩阵得到稳定的节点间亲和矩阵；再次，为进一步增强图局部结构，重新构建节点的多阶最近邻图，并对多阶最近邻图的局部结构分层聚类；最后，优化了边缘点分配策略。定位实验结果表明，MNNGC在合成数据集上的准确率（Acc）均优于对比算法，且在8个UCI数据集上的Acc为最大值。其中在Compound数据集上，MNNGC的Acc、调整互信息（AMI）、调整兰德指数（ARI）和FM指数（FMI）相较于基于局部密度峰值的谱聚类（LDP-SC）算法分别提高38.6、27.2、45.4、35.1个百分点。

关键词: 共享近邻, 趋密性, 转移概率, 多阶最近邻, 分层聚类

Abstract:

Clustering is to divide a dataset into multiple clusters based on the similarity between samples. Most existing clustering methods face two challenges. On the one hand， when defining the similarity between samples， the spatial distribution structure of the samples is often not considered， making it difficult to construct a stable similarity matrix. On the other hand， the sample graph structure constructed by graph clustering is too complex and has high computational costs. To solve these two problems， a Multi-order Nearest Neighbor Graph Clustering algorithm by fusing transition probability matrix （MNNGC） was proposed. Firstly， the nearest neighbor relationship and spatial distribution structure of samples were comprehensively considered， the similarity defined by shared nearest neighbor was weighted for densification， and the densification affinity matrix between nodes was obtained. Secondly， by utilizing multi-order probability transition between nodes， the correlation degrees of non-adjacent nodes were predicted， and a stable inter-node affinity matrix was obtained by fusing the multi-order transition probability matrix. Then， to further enhance the local structure of the graph， the multi-order nearest neighbor graph of nodes was reconstructed， and hierarchically clustered. Finally， the edge node allocation strategy was optimized. Positioning experimental results show that MNNGC achieves the highest Accuracy （Acc） among comparison clustering algorithms on all the synthetic datasets and 8 UCI datasets. The Acc， Adjusted Mutual Information （AMI）， Adjusted Rand Index （ARI） and Fowlkes and Mallows Index （FMI） of MNNGC algorithm are improved by 38.6， 27.2， 45.4 and 35.1 percentage points， respectively， compared with Local Density Peaks-based Spectral Clustering （LDP-SC） algorithm.

Key words: shared neighbor, densification, transition probability, multi-order nearest neighbor, hierarchical clustering

中图分类号:

TP181

徐童童, 解滨, 张春昊, 张喜梅. 融合转移概率矩阵的多阶最近邻图聚类算法[J]. 计算机应用, 2024, 44(5): 1527-1538.

Tongtong XU, Bin XIE, Chunhao ZHANG, Ximei ZHANG. Multi-order nearest neighbor graph clustering algorithm by fusing transition probability matrix[J]. Journal of Computer Applications, 2024, 44(5): 1527-1538.

图/表 18

图 1 10个样本的互为k近邻图

Fig. 1 Mutual k-nearest neighbor graph of ten samples

图 2 连通子图中密度最高点之间连接

Fig. 2 Connections between highest density nodes in connected subgraphs

图 3 Compound数据集图结构可视化

Fig. 3 Graph structure visualization for Compound dataset

表1 不同算法的时间复杂度比较

Tab.1 Time complexity comparison among different algorithms

算法	时间复杂度	算法	时间复杂度
K-means^［12］	$O (n)$	SC^［17］	$O (n 3)$
DPC^［14］	$O (n 2)$	CTCEHC^［20］	$O (n 2)$
SNN-DPC^［15］	$O (n 2)$	LDP-SC^［21］	$O (n 2 + p 3)$
AGNES^［16］	$O (n 3)$	MNNGC	$O (o t n 2 + p 3)$

表1 不同算法的时间复杂度比较

Tab.1 Time complexity comparison among different algorithms

算法	时间复杂度	算法	时间复杂度
K-means^［12］	$O (n)$	SC^［17］	$O (n 3)$
DPC^［14］	$O (n 2)$	CTCEHC^［20］	$O (n 2)$
SNN-DPC^［15］	$O (n 2)$	LDP-SC^［21］	$O (n 2 + p 3)$
AGNES^［16］	$O (n 3)$	MNNGC	$O (o t n 2 + p 3)$

表2 合成数据集与UCI数据集详细信息

Tab. 2 Detailed information of synthetic and UCI datasets

数据集			实例数	属性数	类数
	名称	简称	实例数	属性数	类数
合成数据集	Jain^［15］	s1	373	2	2
	Spiral^［28］	s2	312	2	3
	R15^［29］	s3	600	2	15
	4k2_far^［30］	s4	400	2	4
	Circle^［21］	s5	299	2	3
	Happy^［21］	s6	266	2	3
	Compound^［14］	s7	399	2	6
	PathBased^［14］	s8	300	2	3
UCI 数据集	Glass	u1	214	9	6
	Ionosphere	u2	351	33	2
	Knowledge	u3	172	5	4
	Planning	u4	182	12	2
	Flags	u5	194	28	8
	Haberman-normal	u6	306	3	2
	Balance-scale	u7	625	4	3
	Bupa	u8	345	6	2
	Breast-cancer	u9	286	9	2
	Ecoli	u10	336	7	8

表3 各算法参数设置

Tab. 3 Parameter settings for each algorithm

算法	参数	取值范围	算法	参数	取值范围
AGNES	c	—	SC	c	—
DPC	$d c$	（0.5%，2%）	CTCEHC	c	—
K-means	c	随机选取	LDP-SC	k/c	（3，35）/—
SNN-DPC	k	（3，35）	MNNGC	k/c	（3，35）/—

表3 各算法参数设置

Tab. 3 Parameter settings for each algorithm

算法	参数	取值范围	算法	参数	取值范围
AGNES	c	—	SC	c	—
DPC	$d c$	（0.5%，2%）	CTCEHC	c	—
K-means	c	随机选取	LDP-SC	k/c	（3，35）/—
SNN-DPC	k	（3，35）	MNNGC	k/c	（3，35）/—

图4 各对比算法在Jain数据集上的聚类结果

Fig. 4 Clustering results of each comparison algorithm on Jain dataset

图5 各对比算法在R15数据集上的聚类结果

Fig. 5 Clustering results of each comparison algorithm on R15 dataset

图6 各对比算法在Spiral数据集上的聚类结果

Fig. 6 Clustering results of each comparison algorithm on Spiral dataset

图7 各对比算法在4k2_far数据集上的聚类结果

Fig. 7 Clustering results of each comparison algorithm on 4k2_far dataset

图8 各对比算法在Circle数据集上的聚类结果

Fig. 8 Clustering results of each comparison algorithm on Circle dataset

图9 各对比算法在Happy数据集上的聚类结果

Fig. 9 Clustering results of each comparison algorithm on Happy dataset

图10 各对比算法在Compound数据集上的聚类结果

Fig. 10 Clustering results of each comparison algorithm on Compound dataset

图11 各对比算法在PathBased数据集上的聚类结果

Fig. 11 Clustering results of each comparison algorithm on PathBased dataset

表4 合成数据集上各聚类算法性能比较

Tab. 4 Performance comparison of clustering algorithms on synthetic datasets

算法	Jain							R15							Spiral
算法	d_c	c	k	Acc	AMI	ARI	FMI	d_c	c	k	Acc	AMI	ARI	FMI	d_c	c	k	Acc	AMI	ARI	FMI
AGNES	—	2	—	0.946	0.696	0.779	0.922	—	15	—	0.995	0.992	0.989	0.990	—	3	—	0.369	-0.003	-0.002	0.357
DPC	2%	—	—	0.740	0.618	0.716	0.882	2%	—	—	0.997	0.994	0.993	0.993	2%	—	—	1.000	1.000	1.000	1.000
K‑means	—	2	—	0.786	0.368	0.324	0.701	—	15	—	0.927	0.956	0.898	0.906	—	3	—	0.349	-0.005	-0.006	0.328
SNN‑DPC	—	—	12	1.000	1.000	1.000	1.000	—	—	10	0.997	0.994	0.993	0.993	—	—	8	1.000	1.000	1.000	1.000
SC	—	2	—	1.000	1.000	1.000	1.000	—	15	—	0.996	0.994	0.993	0.993	—	3	—	1.000	1.000	1.000	1.000
CTCEHC	—	2	—	1.000	1.000	1.000	1.000	—	15	—	0.980	0.972	0.959	0.961	—	3	—	0.676	0.440	0.270	0.573
LDP‑SC	—	2	12	1.000	1.000	1.000	1.000	—	15	12	0.997	0.994	0.993	0.993	—	3	8	1.000	1.000	1.000	1.000
MNNGC	—	2	8	1.000	1.000	1.000	1.000	—	15	3	0.997	0.988	0.987	0.988	—	3	8	1.000	1.000	1.000	1.000
算法	4k2_far							Circle							Happy
算法	d_c	c	k	Acc	AMI	ARI	FMI	d_c	c	k	Acc	AMI	ARI	FMI	d_c	c	k	Acc	AMI	ARI	FMI
AGNES	—	4	—	1.000	1.000	1.000	1.000	—	3	—	0.649	0.297	0.164	0.589	—	3	—	0.748	0.554	0.429	0.633
DPC	2%	—	—	1.000	1.000	1.000	1.000	2%	—	—	0.452	0.138	0.049	0.466	2%	—	—	0.797	0.584	0.488	0.665
K‑means	—	4	—	0.873	0.783	0.624	0.744	—	3	—	0.482	0.157	0.053	0.403	—	3	—	0.793	0.580	0.481	0.661
SNN‑DPC	—	—	17	0.998	0.990	0.997	0.998	—	—	33	1.000	1.000	1.000	1.000	—	—	10	1.000	1.000	1.000	1.000
SC	—	4	—	1.000	1.000	1.000	1.000	—	3	—	0.632	0.332	0.199	0.599	—	3	—	0.828	0.568	0.462	0.650
CTCEHC	—	4	—	1.000	1.000	1.000	1.000	—	3	—	0.692	0.378	0.253	0.619	—	3	—	0.771	0.558	0.445	0.640
LDP‑SC	—	4	12	1.000	1.000	1.000	1.000	—	3	8	1.000	1.000	1.000	1.000	—	3	8	1.000	1.000	1.000	1.000
MNNGC	—	4	6	1.000	1.000	1.000	1.000	—	3	8	1.000	1.000	1.000	1.000	—	3	8	1.000	1.000	1.000	1.000
算法	Compound							PathBased
算法	d_c	c	k	Acc	AMI	ARI	FMI	d_c	c	k	Acc	AMI	ARI	FMI
AGNES	—	6	—	0.875	0.831	0.803	0.862	—	3	—	0.730	0.518	0.444	0.653
DPC	2%	—	—	0.832	0.857	0.783	0.855	2%	—	—	0.733	0.536	0.453	0.659
K‑means	—	6	—	0.832	0.715	0.509	0.621	—	3	—	0.743	0.543	0.462	0.662
SNN‑DPC	—	—	17	0.870	0.897	0.850	0.896	—	—	12	0.833	0.591	0.575	0.720
SC	—	6	—	0.809	0.763	0.531	0.640	—	3	—	0.808	0.765	0.684	0.792
CTCEHC	—	6	—	0.729	0.643	0.448	0.572	—	3	—	0.560	0.337	0.150	0.540
LDP‑SC	—	6	12	0.614	0.728	0.546	0.649	—	3	8	0.797	0.594	0.528	0.695
MNNGC	—	6	9	1.000	1.000	1.000	1.000	—	3	5	0.893	0.758	0.729	0.825

表4 合成数据集上各聚类算法性能比较

Tab. 4 Performance comparison of clustering algorithms on synthetic datasets

算法	Jain							R15							Spiral
算法	d_c	c	k	Acc	AMI	ARI	FMI	d_c	c	k	Acc	AMI	ARI	FMI	d_c	c	k	Acc	AMI	ARI	FMI
AGNES	—	2	—	0.946	0.696	0.779	0.922	—	15	—	0.995	0.992	0.989	0.990	—	3	—	0.369	-0.003	-0.002	0.357
DPC	2%	—	—	0.740	0.618	0.716	0.882	2%	—	—	0.997	0.994	0.993	0.993	2%	—	—	1.000	1.000	1.000	1.000
K‑means	—	2	—	0.786	0.368	0.324	0.701	—	15	—	0.927	0.956	0.898	0.906	—	3	—	0.349	-0.005	-0.006	0.328
SNN‑DPC	—	—	12	1.000	1.000	1.000	1.000	—	—	10	0.997	0.994	0.993	0.993	—	—	8	1.000	1.000	1.000	1.000
SC	—	2	—	1.000	1.000	1.000	1.000	—	15	—	0.996	0.994	0.993	0.993	—	3	—	1.000	1.000	1.000	1.000
CTCEHC	—	2	—	1.000	1.000	1.000	1.000	—	15	—	0.980	0.972	0.959	0.961	—	3	—	0.676	0.440	0.270	0.573
LDP‑SC	—	2	12	1.000	1.000	1.000	1.000	—	15	12	0.997	0.994	0.993	0.993	—	3	8	1.000	1.000	1.000	1.000
MNNGC	—	2	8	1.000	1.000	1.000	1.000	—	15	3	0.997	0.988	0.987	0.988	—	3	8	1.000	1.000	1.000	1.000
算法	4k2_far							Circle							Happy
算法	d_c	c	k	Acc	AMI	ARI	FMI	d_c	c	k	Acc	AMI	ARI	FMI	d_c	c	k	Acc	AMI	ARI	FMI
AGNES	—	4	—	1.000	1.000	1.000	1.000	—	3	—	0.649	0.297	0.164	0.589	—	3	—	0.748	0.554	0.429	0.633
DPC	2%	—	—	1.000	1.000	1.000	1.000	2%	—	—	0.452	0.138	0.049	0.466	2%	—	—	0.797	0.584	0.488	0.665
K‑means	—	4	—	0.873	0.783	0.624	0.744	—	3	—	0.482	0.157	0.053	0.403	—	3	—	0.793	0.580	0.481	0.661
SNN‑DPC	—	—	17	0.998	0.990	0.997	0.998	—	—	33	1.000	1.000	1.000	1.000	—	—	10	1.000	1.000	1.000	1.000
SC	—	4	—	1.000	1.000	1.000	1.000	—	3	—	0.632	0.332	0.199	0.599	—	3	—	0.828	0.568	0.462	0.650
CTCEHC	—	4	—	1.000	1.000	1.000	1.000	—	3	—	0.692	0.378	0.253	0.619	—	3	—	0.771	0.558	0.445	0.640
LDP‑SC	—	4	12	1.000	1.000	1.000	1.000	—	3	8	1.000	1.000	1.000	1.000	—	3	8	1.000	1.000	1.000	1.000
MNNGC	—	4	6	1.000	1.000	1.000	1.000	—	3	8	1.000	1.000	1.000	1.000	—	3	8	1.000	1.000	1.000	1.000
算法	Compound							PathBased
算法	d_c	c	k	Acc	AMI	ARI	FMI	d_c	c	k	Acc	AMI	ARI	FMI
AGNES	—	6	—	0.875	0.831	0.803	0.862	—	3	—	0.730	0.518	0.444	0.653
DPC	2%	—	—	0.832	0.857	0.783	0.855	2%	—	—	0.733	0.536	0.453	0.659
K‑means	—	6	—	0.832	0.715	0.509	0.621	—	3	—	0.743	0.543	0.462	0.662
SNN‑DPC	—	—	17	0.870	0.897	0.850	0.896	—	—	12	0.833	0.591	0.575	0.720
SC	—	6	—	0.809	0.763	0.531	0.640	—	3	—	0.808	0.765	0.684	0.792
CTCEHC	—	6	—	0.729	0.643	0.448	0.572	—	3	—	0.560	0.337	0.150	0.540
LDP‑SC	—	6	12	0.614	0.728	0.546	0.649	—	3	8	0.797	0.594	0.528	0.695
MNNGC	—	6	9	1.000	1.000	1.000	1.000	—	3	5	0.893	0.758	0.729	0.825

图12 各对比算法在8个合成数据集上的平均聚类性能

Fig. 12 Average clustering performance of each comparison algorithm on 8 synthetic datasets

表5 UCI数据集上各聚类算法性能比较

Tab. 5 Performance comparison of clustering algorithms on UCI datasets

算法	Glass							Ionosphere							Knowledge
算法	d_c	c	k	Acc	AMI	ARI	FMI	d_c	c	k	Acc	AMI	ARI	FMI	d_c	c	k	Acc	AMI	ARI	FMI
AGNES	—	6	—	0.383	0.067	0.02	0.493	—	2	—	0.644	0.003	0.005	0.733	—	4	—	0.616	0.303	0.267	0.513
DPC	2%	—	—	0.477	0.232	0.155	0.418	1%	—	—	0.678	0.082	0.123	0.582	2%	—	—	0.512	0.112	0.129	0.462
K-means	—	6	—	0.495	0.253	0.165	0.373	—	2	—	0.709	0.130	0.173	0.603	—	4	—	0.523	0.223	0.144	0.375
SNN-DPC	—	—	8	0.495	0.274	0.221	0.406	—	—	8	0.641	0.000	0.000	0.734	—	—	7	0.529	0.189	0.169	0.478
SC	—	6	—	0.391	0.124	0.054	0.494	—	2	—	0.597	0.084	-0.047	0.637	—	4	—	0.477	0.091	0.073	0.320
CTCEHC	—	6	—	0.608	0.321	0.215	0.420	—	2	—	0.735	0.110	0.129	0.547	—	4	—	0.541	0.159	0.120	0.371
LDP-SC	—	6	8	0.421	0.313	0.200	0.395	—	2	8	0.672	0.072	0.115	0.580	—	4	12	0.529	0.186	0.201	0.446
MNNGC	—	6	6	0.514	0.370	0.234	0.554	—	2	5	0.725	0.135	0.188	0.700	—	4	4	0.620	0.366	0.301	0.513
算法	Planning							Flags							Haberman-normal
算法	d_c	c	k	Acc	AMI	ARI	FMI	d_c	c	k	Acc	AMI	ARI	FMI	d_c	c	k	Acc	AMI	ARI	FMI
AGNES	—	2	—	0.714	-0.005	-0.007	0.762	—	8	—	0.356	0.029	0.007	0.408	—	2	—	0.735	-0.061	0.003	0.774
DPC	2%	—	—	0.462	-0.004	-0.005	0.549	2%	—	—	0.340	0.037	0.052	0.313	1%	—	—	0.533	-0.003	-0.002	0.555
K-means	—	2	—	0.714	-0.004	-0.004	0.539	—	8	—	0.412	0.076	0.136	0.323	—	2	—	0.735	-0.002	-0.004	0.551
SNN-DPC	—	—	15	0.506	-0.002	-0.008	0.543	—	—	8	0.351	0.088	0.063	0.293	—	—	6	0.735	0.000	0.000	0.781
SC	—	2	—	0.722	0.012	0.014	0.553	—	8	—	0.314	0.013	0.005	0.445	—	2	—	0.714	-0.001	-0.003	0.551
CTCEHC	—	2	—	0.714	0.002	-0.036	0.669	—	8	—	0.397	0.109	0.053	0.216	—	2	—	0.735	-0.001	0.015	0.627
LDP-SC	—	2	12	0.681	-0.007	-0.009	0.716	—	8	6	0.284	0.086	0.027	0.209	—	2	10	0.513	-0.002	-0.005	0.553
MNNGC	—	2	10	0.725	0.020	0.042	0.763	—	8	6	0.417	0.083	0.131	0.334	—	2	6	0.737	-0.005	0.025	0.730
算法	Balance-scale							Bupa
算法	d_c	c	k	Acc	AMI	ARI	FMI	d_c	c	k	Acc	AMI	ARI	FMI
AGNES	—	3	—	0.602	0.070	0.079	0.430	—	2	—	0.580	0.009	-0.010	0.687
DPC	2%	—	—	0.334	0.074	0.041	0.474	2%	—	—	0.452	-0.002	0.001	0.559
K-means	—	3	—	0.630	0.073	0.090	0.436	—	2	—	0.580	-0.002	-0.005	0.641
SNN-DPC	—	—	8	0.397	0.088	0.003	0.616	—	—	8	0.420	0.000	0.000	0.715
SC	—	3	—	0.606	0.085	0.103	0.443	—	2	—	0.579	-0.001	-0.002	0.712
CTCEHC	—	3	—	0.461	0.000	0.000	0.656	—	2	—	0.580	-0.001	0.006	0.671
LDP-SC	—	3	6	0.525	0.071	0.101	0.448	—	2	10	0.559	0.001	-0.007	0.678
MNNGC	—	3	12	0.693	0.160	0.179	0.507	—	2	6	0.587	0.010	0.007	0.711
算法	Breast-cancer							Ecoli
算法	d_c	c	k	Acc	AMI	ARI	FMI	d_c	c	k	Acc	AMI	ARI	FMI
AGNES	—	2	—	0.706	0.007	0.010	0.762	—	8	—	0.664	0.555	0.462	0.681
DPC	2%	—	—	0.703	0.000	0.000	0.762	1%	—	—	0.557	0.605	0.649	0.767
K-means	—	2	—	0.703	0.022	0.001	0.546	—	8	—	0.756	0.543	0.373	0.518
SNN-DPC	—	—	8	0.490	-0.001	-0.003	0.537	—	—	6	0.717	0.555	0.527	0.705
SC	—	2	—	0.695	0.008	-0.027	0.725	—	8	—	0.764	0.529	0.409	0.547
CTCEHC	—	2	—	0.703	-0.003	0.010	0.654	—	8	—	0.735	0.472	0.268	0.424
LDP-SC	—	2	6	0.619	-0.003	-0.013	0.634	—	8	8	0.571	0.558	0.393	0.532
MNNGC	—	2	8	0.707	0.000	0.025	0.751	—	8	4	0.767	0.633	0.684	0.770

表5 UCI数据集上各聚类算法性能比较

Tab. 5 Performance comparison of clustering algorithms on UCI datasets

算法	Glass							Ionosphere							Knowledge
算法	d_c	c	k	Acc	AMI	ARI	FMI	d_c	c	k	Acc	AMI	ARI	FMI	d_c	c	k	Acc	AMI	ARI	FMI
AGNES	—	6	—	0.383	0.067	0.02	0.493	—	2	—	0.644	0.003	0.005	0.733	—	4	—	0.616	0.303	0.267	0.513
DPC	2%	—	—	0.477	0.232	0.155	0.418	1%	—	—	0.678	0.082	0.123	0.582	2%	—	—	0.512	0.112	0.129	0.462
K-means	—	6	—	0.495	0.253	0.165	0.373	—	2	—	0.709	0.130	0.173	0.603	—	4	—	0.523	0.223	0.144	0.375
SNN-DPC	—	—	8	0.495	0.274	0.221	0.406	—	—	8	0.641	0.000	0.000	0.734	—	—	7	0.529	0.189	0.169	0.478
SC	—	6	—	0.391	0.124	0.054	0.494	—	2	—	0.597	0.084	-0.047	0.637	—	4	—	0.477	0.091	0.073	0.320
CTCEHC	—	6	—	0.608	0.321	0.215	0.420	—	2	—	0.735	0.110	0.129	0.547	—	4	—	0.541	0.159	0.120	0.371
LDP-SC	—	6	8	0.421	0.313	0.200	0.395	—	2	8	0.672	0.072	0.115	0.580	—	4	12	0.529	0.186	0.201	0.446
MNNGC	—	6	6	0.514	0.370	0.234	0.554	—	2	5	0.725	0.135	0.188	0.700	—	4	4	0.620	0.366	0.301	0.513
算法	Planning							Flags							Haberman-normal
算法	d_c	c	k	Acc	AMI	ARI	FMI	d_c	c	k	Acc	AMI	ARI	FMI	d_c	c	k	Acc	AMI	ARI	FMI
AGNES	—	2	—	0.714	-0.005	-0.007	0.762	—	8	—	0.356	0.029	0.007	0.408	—	2	—	0.735	-0.061	0.003	0.774
DPC	2%	—	—	0.462	-0.004	-0.005	0.549	2%	—	—	0.340	0.037	0.052	0.313	1%	—	—	0.533	-0.003	-0.002	0.555
K-means	—	2	—	0.714	-0.004	-0.004	0.539	—	8	—	0.412	0.076	0.136	0.323	—	2	—	0.735	-0.002	-0.004	0.551
SNN-DPC	—	—	15	0.506	-0.002	-0.008	0.543	—	—	8	0.351	0.088	0.063	0.293	—	—	6	0.735	0.000	0.000	0.781
SC	—	2	—	0.722	0.012	0.014	0.553	—	8	—	0.314	0.013	0.005	0.445	—	2	—	0.714	-0.001	-0.003	0.551
CTCEHC	—	2	—	0.714	0.002	-0.036	0.669	—	8	—	0.397	0.109	0.053	0.216	—	2	—	0.735	-0.001	0.015	0.627
LDP-SC	—	2	12	0.681	-0.007	-0.009	0.716	—	8	6	0.284	0.086	0.027	0.209	—	2	10	0.513	-0.002	-0.005	0.553
MNNGC	—	2	10	0.725	0.020	0.042	0.763	—	8	6	0.417	0.083	0.131	0.334	—	2	6	0.737	-0.005	0.025	0.730
算法	Balance-scale							Bupa
算法	d_c	c	k	Acc	AMI	ARI	FMI	d_c	c	k	Acc	AMI	ARI	FMI
AGNES	—	3	—	0.602	0.070	0.079	0.430	—	2	—	0.580	0.009	-0.010	0.687
DPC	2%	—	—	0.334	0.074	0.041	0.474	2%	—	—	0.452	-0.002	0.001	0.559
K-means	—	3	—	0.630	0.073	0.090	0.436	—	2	—	0.580	-0.002	-0.005	0.641
SNN-DPC	—	—	8	0.397	0.088	0.003	0.616	—	—	8	0.420	0.000	0.000	0.715
SC	—	3	—	0.606	0.085	0.103	0.443	—	2	—	0.579	-0.001	-0.002	0.712
CTCEHC	—	3	—	0.461	0.000	0.000	0.656	—	2	—	0.580	-0.001	0.006	0.671
LDP-SC	—	3	6	0.525	0.071	0.101	0.448	—	2	10	0.559	0.001	-0.007	0.678
MNNGC	—	3	12	0.693	0.160	0.179	0.507	—	2	6	0.587	0.010	0.007	0.711
算法	Breast-cancer							Ecoli
算法	d_c	c	k	Acc	AMI	ARI	FMI	d_c	c	k	Acc	AMI	ARI	FMI
AGNES	—	2	—	0.706	0.007	0.010	0.762	—	8	—	0.664	0.555	0.462	0.681
DPC	2%	—	—	0.703	0.000	0.000	0.762	1%	—	—	0.557	0.605	0.649	0.767
K-means	—	2	—	0.703	0.022	0.001	0.546	—	8	—	0.756	0.543	0.373	0.518
SNN-DPC	—	—	8	0.490	-0.001	-0.003	0.537	—	—	6	0.717	0.555	0.527	0.705
SC	—	2	—	0.695	0.008	-0.027	0.725	—	8	—	0.764	0.529	0.409	0.547
CTCEHC	—	2	—	0.703	-0.003	0.010	0.654	—	8	—	0.735	0.472	0.268	0.424
LDP-SC	—	2	6	0.619	-0.003	-0.013	0.634	—	8	8	0.571	0.558	0.393	0.532
MNNGC	—	2	8	0.707	0.000	0.025	0.751	—	8	4	0.767	0.633	0.684	0.770

图13 各对比算法在10个UCI数据集上的平均聚类性能

Fig. 13 Average clustering performance of each comparison algorithm on 10 UCI datasets

参考文献 30

1	ROSENBERGER C， CHEHDI K. Unsupervised clustering method with optimal estimation of the number of clusters： application to image segmentation［C］ // Proceedings of the 15th International Conference on Pattern Recognition. Piscataway： IEEE， 2000： 656-659. 10.1109/ICPR.2000.905473
2	LLOBELL F， VIGNEAU E， QANNARI E M. Clustering datasets by means of CLUSTATIS with identification of atypical datasets. Application to sensometrics［J］. Food Quality and Preference， 2019， 75： 97-104. 10.1016/j.foodqual.2019.02.017
3	OKTAR Y， TURKAN M. A review of sparsity-based clustering methods［J］. Signal Processing， 2018， 148： 20-30. 10.1016/j.sigpro.2018.02.010
4	AGGARWALC C， REDDY C K. Data Clustering： Algorithms and Applications［M］. Boca Raton： CRC Press， 2013：596-627.
5	POTHULA K R， SMYRNOVA D， SCHRÖDER G F. Clustering cryo-EM images of helical protein polymers for helical reconstructions［J］. Ultramicroscopy， 2018， 203： 132-138. 10.1016/j.ultramic.2018.12.009
6	HAN J， KAMBER M. 数据挖掘：概念与技术［M］.范明，孟小峰，译.2版.北京：机械工业出版社，2007： 263-266.
	HAN J， KAMBER M. Data Mining： Concepts and Techniques ［M］. FAN M， MENG X F， translated. 2nd ed. Beijing： China Machine Press， 2007： 263-266.
7	JAIN A K. Data clustering： 50 years beyond K-means［J］. Pattern Recognition Letters，2010， 31（8）：651-666. 10.1016/j.patrec.2009.09.011
8	SAXENA A， PRASAD M， GUPTA A， et al. A review of clustering techniques and developments［J］. Neurocomputing， 2017， 267： 664-681. 10.1016/j.neucom.2017.06.053
9	王巧玲，乔非，蒋友好.基于聚合距离参数的改进K-means算法［J］.计算机应用，2019，39（9）：2586-2590. 10.11772/j.issn.1001-9081.2019030485
	WANG Q L， QIAO F， JIANG Y H. Improved K-means algorithm with aggregation distance coefficient［J］. Journal of Computer Applications， 2019， 39（9）： 2586-2590. 10.11772/j.issn.1001-9081.2019030485
10	张朋，李小林，王李妍.基于DBSCAN的动态邻域密度聚类算法［J］.计算机科学，2023，50（6A）：220400127. 10.11896/jsjkx.220400127
	ZHANG P， LI X L， WANG L Y. Dynamic neighborhood density clustering algorithm based on DBSCAN ［J］. Computer Science， 2023，50（6A）：220400127. 10.11896/jsjkx.220400127
11	刘静姝，王莉，刘惊雷.无需特征分解的快速谱聚类算法［J］.计算机应用，2020，40（12）：3413-3422. 10.11772/j.issn.1001-9081.2020061040
	LIU J S， WANG L， LIU J L. Fast spectral clustering algorithm without eigen-decomposition［J］. Journal of Computer Applications， 2020，40（12）：3413-3422. 10.11772/j.issn.1001-9081.2020061040
12	KANUNGO T， MOUNT D M， NETANYAHU N S， et al. An efficient k-means clustering algorithm： analysis and implementation ［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2002，24（7）：881-892. 10.1109/tpami.2002.1017616
13	杨静，高嘉伟，梁吉业，等. 基于数据场的改进DBSCAN聚类算法［J］. 计算机科学与探索， 2012， 6 （10）： 903-911. 10.3778/j.issn.1673-9418.2012.10.005
	YANG J， GAO J W， LIANG J Y， et al. An improved DBSCAN clustering algorithm based on data field［J］. Journal of Frontiers of Computer Science and Technology， 2012， 6（10）： 903-911. 10.3778/j.issn.1673-9418.2012.10.005
14	RODRIGUEZ A， LAIO A. Clustering by fast search and find of density peaks［J］. Science， 2014， 344（6191）： 1492-1496. 10.1126/science.1242072
15	LIU R， WANG H， YU X. Shared-nearest-neighbor-based clustering by fast search and find of density peaks［J］. Information Sciences， 2018， 450： 200-226. 10.1016/j.ins.2018.03.031
16	LIN C-R， CHEN M-S. Combining partitional and hierarchical algorithms for robust and efficient data clustering with cohesion self-merging［J］．IEEE Transactions on Knowledge and Data Engineering， 2005，17（2）： 145-159. 10.1109/tkde.2005.21
17	NG A Y， JORDAN M I， WEISS Y. On spectral clustering： analysis and an algorithm［C］// Proceedings of the 14th International Conference on Neural Information Processing Systems. Cambridge： MIT Press， 2002：849-856.
18	KARYPIS G， HAN E H， KUMAR V. Chameleon： hierarchical clustering using dynamic modeling［J］. Computer，1999， 32（8）：68-75. 10.1109/2.781637
19	ZHONG C， MIAO D， FRÄNTI P. Minimum spanning tree based split-and-merge： a hierarchical clustering method［J］. Information Sciences， 2011， 181（16）： 3397-3410. 10.1016/j.ins.2011.04.013
20	MA Y， LIN H， WANG Y， et al. A multi-stage hierarchical clustering algorithm based on centroid of tree and cut edge constraint［J］. Information Sciences， 2021， 557： 194-219. 10.1016/j.ins.2020.12.016
21	LONG Z， GAO Y， MENG H， et al. Clustering based on local density peaks and graph cut［J］. Information Sciences， 2022， 600： 263-286. 10.1016/j.ins.2022.03.091
22	龙建武，王强.反向近邻构造连通图的聚类算法［J/OL］.计算机科学与探索， 2022 ［2023-06-01］. .
	LONG J W， WANG Q. Clustering algorithm of connected graphs constructed by reverse nearest neighbor［J/OL］. Journal of Frontiers of Computer Science and Technology， 2022 ［2023-06-01］. .
23	HUANG D， WANG C-D， PENG H， et al. Enhanced ensemble clustering via fast propagation of cluster-wise similarities［J］. IEEE Transactions on Systems， Man， and Cybernetics： Systems， 2021， 51（1）：508-520. 10.1109/tsmc.2018.2876202
24	白璐，赵鑫，孔钰婷，等.谱聚类算法研究综述［J］.计算机工程与应用，2021，57（14）： 15-26. 10.3778/j.issn.1002-8331.2103-0547
	BAI L， ZHAO X， KONG Y T， et al. Review of spectral clustering algorithms ［J］. Computer Engineering and Applications， 2021， 57（14）： 15-26. 10.3778/j.issn.1002-8331.2103-0547
25	王淞，黄浩，余果，等.一种基于 k 近邻图的稀有类检测算法［J］.软件学报，2016，27（9）：2320-2331. 10.13328/j.cnki.jos.004872
	WANG S， HUANG H， YU G， et al. Rare category detection algorithm based on k‑nearest neighbor graphs ［J］. Journal of Software，2016，27（9）：2320-2331. 10.13328/j.cnki.jos.004872
26	冯骥，冉瑞生，魏延.基于自然邻居邻域图的无参数离群检测算法［J］.智能系统学报， 2019， 14（5）： 998-1006. 10.11992/tis.201809032
	FENG J， RAN R S， WEI Y. A parameter-free outlier detection algorithm based on natural neighborhood graph ［J］. CAAI Transactions on Intelligent Systems， 2019， 14（5）： 998-1006. 10.11992/tis.201809032
27	周志华.机器学习［M］.北京：清华大学出版社，2016：319-327.
	ZHOU Z H. Machine Learning ［M］. Beijing： Tsinghua University Press， 2016： 319-327.
28	CHANG H， D-Y YEUNG. Robust path-based spectral clustering［J］. Pattern Recognition， 2008， 41 （1）： 191-203. 10.1016/j.patcog.2007.04.010
29	VEENMAN C J， REINDERS M J T， BACKER E. A maximum variance cluster algorithm ［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2002， 24（9）： 1273-1280. 10.1109/tpami.2002.1033218
30	ZHANG R， DU T， QU S N， et al. Adaptive density-based clustering algorithm with shared KNN conflict game ［J］. Information Sciences， 2021， 565： 344-369. 10.1016/j.ins.2021.02.017

[1]	张忠平, 郭鑫, 张玉停, 张睿博. 基于全息图平稳分布因子的离群点检测算法[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1705-1712.
[2]	郭景峰, 董慧, 张庭玮, 陈晓. 主题关注网络的表示学习[J]. 《计算机应用》唯一官方网站, 2020, 40(2): 441-447.
[3]	宋艳, 殷俊. 基于共享近邻的多视角谱聚类算法[J]. 计算机应用, 2020, 40(11): 3211-3216.
[4]	鲍舒婷, 孙丽萍, 郑孝遥, 郭良敏. 基于共享近邻相似度的密度峰聚类算法[J]. 计算机应用, 2018, 38(6): 1601-1607.
[5]	赵艳丽王兴. 基于双边转移概率矩阵的JPEG图像隐写检测[J]. 计算机应用, 2013, 33(04): 1074-1076.
[6]	薛文娟刘培玉刘栋. 引入共享近邻加权图的Chameleon算法[J]. 计算机应用, 2012, 32(10): 2884-2887.
[7]	李凯黄添强余养强郭躬德. 非线性降维算法及其在医院绩效考核上的应用[J]. 计算机应用, 2010, 30(4): 1004-1007.
[8]	许宁张毅坤. 基于正交分层聚类算法软件可靠性模型的预测分析[J]. 计算机应用, 2007, 27(3): 635-637.
[9]	刘敏娟柴玉梅 . 基于网格的共享近邻聚类算法[J]. 计算机应用, 2006, 26(7): 1673-1675.

融合转移概率矩阵的多阶最近邻图聚类算法

Multi-order nearest neighbor graph clustering algorithm by fusing transition probability matrix

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 18

参考文献 30

相关文章 9

编辑推荐

Metrics