《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (7): 2153-2161.DOI: 10.11772/j.issn.1001-9081.2024070942

• CCF第39届中国计算机应用大会 (CCF NCCA 2024) • 上一篇    下一篇

求解多模概率分布Gamma混合模型的半EM算法

陈佳琪1,2, 何玉林2(), 成英超2, 黄哲学1,2   

  1. 1.深圳大学 计算机与软件学院,广东 深圳 518060
    2.人工智能与数字经济广东省实验室(深圳),广东 深圳 518107
  • 收稿日期:2024-07-08 修回日期:2024-09-05 接受日期:2024-10-09 发布日期:2025-07-10 出版日期:2025-07-10
  • 通讯作者: 何玉林
  • 作者简介:陈佳琪(1999—),女,广东普宁人,博士研究生,主要研究方向:多样本统计分析、数据挖掘、机器学习
    何玉林(1982—),男,河北衡水人,研究员,博士,CCF会员,主要研究方向:大数据系统计算、多样本统计分析、数据挖掘、机器学习 yulinhe@gml.ac.cn
    成英超(1989—),男,河北邯郸人,副研究员,博士,CCF会员,主要研究方向:人工智能、大数据智能计算、数据挖掘、机器学习
    黄哲学(1959—),男,黑龙江哈尔滨人,教授,博士,CCF会员,主要研究方向:新型算力网络的智能计算、大数据近似计算、数据挖掘、机器学习。
  • 基金资助:
    广东省基础与应用基础研究基金粤深联合基金资助项目(2023B1515120020);广东省自然科学基金资助项目(2023A1515011667);深圳市科技重大专项(202302D074);深圳市基础研究面上项目(JCYJ20210324093609026)

Semi-EM algorithm for solving Gamma mixture model of multimodal probability distribution

Jiaqi CHEN1,2, Yulin HE2(), Yingchao CHENG2, Zhexue HUANG1,2   

  1. 1.College of Computer Science and Software Engineering,Shenzhen University,Shenzhen Guangdong 518060,China
    2.Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ),Shenzhen Guangdong 518107,China
  • Received:2024-07-08 Revised:2024-09-05 Accepted:2024-10-09 Online:2025-07-10 Published:2025-07-10
  • Contact: Yulin HE
  • About author:CHEN Jiaqi, born in 1999, Ph. D. candidate. Her research interests include multi-sample statistical analysis, data mining, machine learning.
    HE Yulin, born in 1982, Ph. D., research fellow. His research interests include big data system computing, multi-sample statistical analysis, data mining, machine learning.
    CHENG Yingchao, born in 1989, Ph. D., associate research fellow. His research interests include artificial intelligence, intelligent computing for big data, data mining, machine learning.
    HUANG Zhexue, born in 1959, Ph. D., professor. His research interests include intelligent computing of new computing power network, big data approximation computing, data mining, machine learning.
  • Supported by:
    Guangdong Shenzhen Joint Fund of Guangdong Basic and Applied Basic Research Foundation(2023B1515120020);Natural Science Foundation of Guangdong Province(2023A1515011667);Science and Technology Major Project of Shenzhen(202302D074);Basic Research Foundation of Shenzhen(JCYJ20210324093609026)

摘要:

期望最大化(EM)算法在混合模型参数估计中发挥着重要作用,然而现有的EM算法在求解Gamma混合模型(GaMM)参数时存在局限性,主要体现在因近似计算导致的低质量参数估计,以及由于大量数值计算造成的计算效率低下问题。为了克服这些局限,并充分利用数据的多模性质,提出一种半EM(Semi-EM)算法求解用于估计多模概率分布的GaMM。首先,通过聚类探测数据的空间分布特性,以初始化GaMM参数,进而更准确地刻画数据的多模性;其次,在EM算法框架的基础上,对于缺乏封闭更新表达式而导致的参数更新困难问题,采用自定义的启发式策略对GaMM形状参数进行更新,使它们朝着最大化对数似然值的方向逐步调整,同时以封闭形式更新其他参数。经过一系列具有说服力的实验,验证了Semi-EM算法的可行性、合理性和有效性。实验结果表明,Semi-EM算法在精确估计多模概率分布方面优于对比的4种算法,具有更低的误差指标以及更高的对数似然值,表明该算法能提供更准确的模型参数估计,从而更精确地刻画数据的多模性质。

关键词: 多模概率密度函数, Gamma混合模型, 期望最大化算法, 聚类, 对数似然函数

Abstract:

Expectation-Maximization (EM) algorithm plays an important role in parameter estimation for mixture models. However, the existing EM algorithms for solving Gamma Mixture Model (GaMM) parameters have limitations, which mainly are the problems of low-quality parameter estimation led by approximate calculations and inefficient computation due to many numerical calculations. To address these limitations and fully exploit the multimodal nature of data, a Semi-EM algorithm was proposed to solve GaMM for estimating multimodal probability distributions. Firstly, spatial distribution characteristics of the data were explored by using clustering, thereby initializing GaMM parameters and so that a more precise characterization of data’s multimodality was obtained. Secondly, based on the framework of EM algorithm, a customized heuristic strategy was employed to address the challenge of parameter update difficulty caused by the absence of closed-updated expressions. The shape parameters of GaMM were updated by adopting this strategy towards maximizing the log-likelihood value gradually, while remaining parameters were updated in closed-form. A series of persuasive experiments were conducted to validate the feasibility, rationality, and effectiveness of the proposed Semi-EM algorithm. Experimental results demonstrate that the Semi-EM algorithm outperforms the four comparison algorithms in estimating multimodal probability distributions accurately. Specifically, the Semi-EM algorithm has lower error metrics and higher log-likelihood values, indicating that this algorithm can provide more accurate model parameter estimation and then obtain more precise representation of multimodal nature of the data.

Key words: multimodal Probability Density Function (PDF), Gamma Mixture Model (GaMM), Expectation-Maximization (EM) algorithm, clustering, log-likelihood function

中图分类号: