《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (5): 1527-1538.DOI: 10.11772/j.issn.1001-9081.2023050727

• 数据科学与技术 • 上一篇    

融合转移概率矩阵的多阶最近邻图聚类算法

徐童童1, 解滨1,2,3(), 张春昊1, 张喜梅1   

  1. 1.河北师范大学 计算机与网络空间安全学院, 石家庄 050024
    2.河北省网络与信息安全重点实验室(河北师范大学), 石家庄 050024
    3.供应链大数据分析与数据安全河北省工程研究中心(河北师范大学), 石家庄 050024
  • 收稿日期:2023-06-06 修回日期:2023-07-09 接受日期:2023-07-14 发布日期:2023-08-01 出版日期:2024-05-10
  • 通讯作者: 解滨
  • 作者简介:徐童童(1998—),女,河北衡水人,硕士研究生,CCF会员,主要研究方向:机器学习、网络安全
    张春昊(1996—),男,河北沧州人,硕士研究生,CCF会员,主要研究方向:机器学习
    张喜梅(1998—),女,湖南益阳人,硕士研究生,CCF会员,主要研究方向:机器学习、网络安全。
    第一联系人:解滨(1976—),男,吉林通化人,教授,博士,CCF会员,主要研究方向:粒计算、机器学习、近似推理
  • 基金资助:
    国家自然科学基金资助项目(62076088);河北师范大学研究生创新项目(XCXZZSS202317)

Multi-order nearest neighbor graph clustering algorithm by fusing transition probability matrix

Tongtong XU1, Bin XIE1,2,3(), Chunhao ZHANG1, Ximei ZHANG1   

  1. 1.College of Computer and Cyber Security,Hebei Normal University,Shijiazhuang Hebei 050024,China
    2.Hebei Provincial Key Laboratory of Network and Information Security (Hebei Normal University),Shijiazhuang Hebei 050024,China
    3.Hebei Provincial Engineering Research Center for Supply Chain Big Data Analytics and Data Security (Hebei Normal University),Shijiazhuang Hebei 050024,China
  • Received:2023-06-06 Revised:2023-07-09 Accepted:2023-07-14 Online:2023-08-01 Published:2024-05-10
  • Contact: Bin XIE
  • About author:XU Tongtong, born in 1998, M. S. candidate. Her research interests include machine learning, cyber security.
    ZHANG Chunhao, born in 1996, M. S. candidate. His research interests include machine learning.
    ZHANG Ximei, born in 1998, M. S. candidate. Her research interests include machine learning, cyber security.
  • Supported by:
    National Natural Science Foundation of China(62076088);Graduate Innovation Project of Hebei Normal University(XCXZZSS202317)

摘要:

聚类是根据样本之间的相似性将数据集划分为多个类簇。现有的大多数聚类方法都存在两个挑战:一方面,在定义样本间相似性时往往没有考虑样本的空间分布结构,无法构建稳定的相似度矩阵;另一方面,图聚类构造的样本图结构过于复杂,计算成本较高。为解决这两个问题,提出融合转移概率矩阵的多阶最近邻图聚类算法(MNNGC)。首先,综合样本的近邻关系和空间分布结构,将共享近邻定义的相似度进行趋密性加权,得到节点间的趋密性亲和矩阵;其次,利用节点间多阶概率转移预测非邻接点的关联程度,并通过融合多阶转移概率矩阵得到稳定的节点间亲和矩阵;再次,为进一步增强图局部结构,重新构建节点的多阶最近邻图,并对多阶最近邻图的局部结构分层聚类;最后,优化了边缘点分配策略。定位实验结果表明,MNNGC在合成数据集上的准确率(Acc)均优于对比算法,且在8个UCI数据集上的Acc为最大值。其中在Compound数据集上,MNNGC的Acc、调整互信息(AMI)、调整兰德指数(ARI)和FM指数(FMI)相较于基于局部密度峰值的谱聚类(LDP-SC)算法分别提高38.6、27.2、45.4、35.1个百分点。

关键词: 共享近邻, 趋密性, 转移概率, 多阶最近邻, 分层聚类

Abstract:

Clustering is to divide a dataset into multiple clusters based on the similarity between samples. Most existing clustering methods face two challenges. On the one hand, when defining the similarity between samples, the spatial distribution structure of the samples is often not considered, making it difficult to construct a stable similarity matrix. On the other hand, the sample graph structure constructed by graph clustering is too complex and has high computational costs. To solve these two problems, a Multi-order Nearest Neighbor Graph Clustering algorithm by fusing transition probability matrix (MNNGC) was proposed. Firstly, the nearest neighbor relationship and spatial distribution structure of samples were comprehensively considered, the similarity defined by shared nearest neighbor was weighted for densification, and the densification affinity matrix between nodes was obtained. Secondly, by utilizing multi-order probability transition between nodes, the correlation degrees of non-adjacent nodes were predicted, and a stable inter-node affinity matrix was obtained by fusing the multi-order transition probability matrix. Then, to further enhance the local structure of the graph, the multi-order nearest neighbor graph of nodes was reconstructed, and hierarchically clustered. Finally, the edge node allocation strategy was optimized. Positioning experimental results show that MNNGC achieves the highest Accuracy (Acc) among comparison clustering algorithms on all the synthetic datasets and 8 UCI datasets. The Acc, Adjusted Mutual Information (AMI), Adjusted Rand Index (ARI) and Fowlkes and Mallows Index (FMI) of MNNGC algorithm are improved by 38.6, 27.2, 45.4 and 35.1 percentage points, respectively, compared with Local Density Peaks-based Spectral Clustering (LDP-SC) algorithm.

Key words: shared neighbor, densification, transition probability, multi-order nearest neighbor, hierarchical clustering

中图分类号: