《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (6): 1712-1723.DOI: 10.11772/j.issn.1001-9081.2024070943

• 第十二届CCF大数据学术会议 • 上一篇    

基于最大均值差异的子空间高斯混合模型聚类集成算法

何玉林1,2, 李旭1(), 贺颖婷2, 崔来中1,2, 黄哲学1,2   

  1. 1.人工智能与数字经济广东省实验室(深圳),广东 深圳 518107
    2.深圳大学 计算机与软件学院,广东 深圳 518060
  • 收稿日期:2024-07-08 修回日期:2024-08-02 接受日期:2024-08-22 发布日期:2024-09-02 出版日期:2025-06-10
  • 通讯作者: 李旭
  • 作者简介:何玉林(1982—),男,河北衡水人,研究员,博士,CCF会员,主要研究方向:大数据近似计算、多样本统计、数据挖掘、机器学习
    李旭(1996—),男,广东汕头人,工程师,硕士,CCF会员,主要研究方向:大数据分布式计算、数据挖掘、机器学习 lixu@gml.ac.cn
    贺颖婷(1997—)女,广东深圳人,工程师,硕士,主要研究方向:数据挖掘、机器学习
    崔来中(1984—),男,吉林白山人,教授,博士,CCF会员,主要研究方向:互联网体系结构、边缘计算、AI驱动的网络优化
    黄哲学(1959—),男,黑龙江哈尔滨人,教授,博士,CCF会员,主要研究方向:新型算力网络的智能计算、大数据近似计算、数据挖掘、机器学习。
  • 基金资助:
    广东省自然科学基金面上项目(2023A1515011667);深圳市科技重大专项(KJZD20230923114809020);广东省基础与应用基础研究基金粤深联合基金重点项目(2023B1515120020);深圳市基础研究重点项目(JCYJ20220818100205012)

Subspace Gaussian mixture model clustering ensemble algorithm based on maximum mean discrepancy

Yulin HE1,2, Xu LI1(), Yingting HE2, Laizhong CUI1,2, Zhexue HUANG1,2   

  1. 1.Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ),Shenzhen Guangdong 518107,China
    2.College of Computer Science and Software Engineering,Shenzhen University,Shenzhen Guangdong 518060,China
  • Received:2024-07-08 Revised:2024-08-02 Accepted:2024-08-22 Online:2024-09-02 Published:2025-06-10
  • Contact: Xu LI
  • About author:HE Yulin, born in 1982, Ph. D., research fellow. His research interests include approximate computation of big data, multi-sample statistics, data mining, machine learning.
    LI Xu, born in 1996, M. S., engineer. His research interests include distributed computation of big data, data mining, machine learning.
    HE Yingting, born in 1997, M. S., engineer. Her research interests include data mining, machine learning.
    CUI Laizhong, born in 1984, Ph. D., professor. His research interests include Internet architecture, edge computing, AI-driven network optimization.
    HUANG Zhexue, born in 1959, Ph. D., professor. His research interests include intelligent computation of new computational power network, approximation computation of big data, data mining, machine learning.
  • Supported by:
    Natural Science Foundation of Guangdong Province(2023A1515011667);Science and Technology Major Project of Shenzhen(KJZD20230923114809020);Key Project of Guangdong Shenzhen Joint Fund of Guangdong Basic and Applied Basic Research Foundation(2023B1515120020);Key Basic Research Foundation of Shenzhen(JCYJ20220818100205012)

摘要:

针对高斯混合模型(GMM)聚类算法在处理大规模高维数据聚类时出现的性能受限和参数敏感的问题,提出一种基于最大均值差异(MMD)的子空间GMM聚类集成(SGMM-CE)算法。首先,对原始大规模高维数据集进行随机样本划分(RSP)以得到多个数据子集,从样本量的角度缩小聚类问题的规模;其次,根据特征对最优GMM构件数的影响,在每一个数据子集对应的高维特征空间中进行子空间学习,得到每个高维特征空间对应的多个低维特征子空间,并在各个子空间上进行GMM聚类,从而得到一系列异构的GMM;再次,利用所提出的平均共享隶属概率(ASAP),重标记与融合来自同一个数据子集的不同特征子空间上的聚类结果;最后,利用扩展的子空间MMD(SubMMD)作为不同数据子集的聚类结果中2个簇之间的分布一致性的度量准则,据此重标记并融合这些数据子集的聚类结果,进而得到原始数据集的最终聚类集成结果。通过详尽的实验验证SGMM-CE算法的有效性,实验结果显示,相较于对比算法中最好的元簇聚类算法(MCLA),SGMM-CE算法在选用的数据集上的平均标准化互信息(NMI)、聚类精度(CA)和调整兰德系数(ARI)值分别提升了19%,20%和52%。此外,可行性和合理性的实验结果证实了SGMM-CE算法的参数收敛性与时间高效性,表明该算法具备高效处理大规模高维数据聚类问题的能力。

关键词: 无监督学习, 集成学习, 子空间学习, 最大均值差异, 高斯混合模型

Abstract:

To address the problems of limited capability and parameter sensitivity of Gaussian Mixture Model (GMM) clustering algorithms in processing large-scale high-dimensional data clustering, a Subspace GMM Clustering Ensemble (SGMM-CE) algorithm based on Maximum Mean Discrepancy (MMD) was proposed. Firstly, Random Sample Partition (RSP) was performed to the original large-scale high-dimensional dataset to obtain multiple subsets of data, thereby reducing the size of clustering problem from the perspective of sample size. Secondly, subspace learning was performed in the high-dimensional feature space corresponding to each subset of data by considering the influence of features on optimal number of GMM components, so that multiple low-dimensional feature subspaces corresponding to each high-dimensional feature space were obtained, and then GMM clustering was conducted on each subspace to obtain a series of heterogeneous GMMs. Thirdly, GMM clustering results of different subspaces from the same subset of data were relabeled and merged on the basis of the proposed Average Shared Affiliation Probability (ASAP). Finally, the expanded Subspace MMD (SubMMD) was used as a criterion to measure distributional consistency between two clusters in the clustering results of different subsets of data, so as to relabel and merge clustering results of these subsets of data based on the above, thereby obtaining the final clustering ensemble result of the original dataset. Exhaustive experiments were conducted to validate the effectiveness of SGMM-CE algorithm. Experimental results show that compared with the best-performing comparison algorithm — Meta-CLustering Algorithm (MCLA), SGMM-CE algorithm increases 19%, 20%, and 52% for Normalized Mutual Information (NMI), Clustering Accuracy (CA) and Adjusted Rand Index (ARI) values, respectively, on the given clustering datasets. Besides, the feasibility and rationality experimental results reflect that SGMM-CE algorithm has parameter convergence and time efficiency, demonstrating that this algorithm can deal with large-scale high-dimensional data clustering problems effectively.

Key words: unsupervised learning, ensemble learning, subspace learning, Maximum Mean Discrepancy (MMD), Gaussian Mixture Model (GMM)

中图分类号: