《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (12): 3764-3771.DOI: 10.11772/j.issn.1001-9081.2022121844

• 数据科学与技术 • 上一篇    下一篇

基于马尔可夫优化的高效用项集挖掘算法

钟新成1,2(), 刘昶2, 赵秀梅1   

  1. 1.长治学院 计算机系,山西 长治 046000
    2.中国科学院 沈阳自动化研究所,沈阳 110169
  • 收稿日期:2022-12-09 修回日期:2023-02-28 接受日期:2023-03-03 发布日期:2023-03-13 出版日期:2023-12-10
  • 通讯作者: 钟新成
  • 作者简介:钟新成(1987—),男,湖南长沙人,讲师,硕士,主要研究方向:数据挖掘、强化学习;Email:cy12016596@czc.edu.cn
    刘昶(1973—),女,辽宁沈阳人,副研究员,博士,主要研究方向:数据挖掘、调度优化
    赵秀梅(1970—),女,山西高平人,讲师,硕士,主要研究方向:软件测试。
  • 基金资助:
    2022年度山西省高等学校科技创新项目(2022L517)

High utility itemset mining algorithm based on Markov optimization

Xincheng ZHONG1,2(), Chang LIU2, Xiumei ZHAO1   

  1. 1.Department of Computer Science,Changzhi University,Changzhi Shanxi 046000,China
    2.Shenyang Institute of Automation,Chinese Academy of Sciences,Shenyang Liaoning 110169,China
  • Received:2022-12-09 Revised:2023-02-28 Accepted:2023-03-03 Online:2023-03-13 Published:2023-12-10
  • Contact: Xincheng ZHONG
  • About author:LIU Chang, born in 1973, Ph. D., associate research fellow. Her research interests include data mining, schedule optimization.
    ZHAO Xiumei, born in 1970, M. S., lecturer. Her research interests include software testing.
  • Supported by:
    Science and Technology Innovation Project of Colleges and Universities in Shanxi Province in 2022(2022L517)

摘要:

基于树型和链表结构的高效用项集挖掘(HUIM)算法通常需要指数量级的搜索空间,而基于进化类型的挖掘算法未能充分考虑变量间的相互作用,因此提出一种基于马尔可夫优化的HUIM算法(HUIM-MOA)。首先,采用位图矩阵表示数据库和使用期望向量编码,以实现对数据库的快速扫描和效用值的高效计算;其次,通过计算优势个体间的互信息估计马尔可夫网络(MN)结构,并根据它们的局部特性使用吉布斯采样以产生新的种群;最后,为防止算法过快陷入局部最优和减少高效用项集的缺失,分别采用种群多样性保持策略和精英策略。在真实数据集上的实验结果表明,相较于次优的基于粒子群优化(PSO)的生物启发式HUI框架(Bio-HUIF-PSO)算法,在给定较大最小阈值的情况下,HUIM-MOA可以找到全部的高效用项集(HUI),收敛速度平均提升12.5%,挖掘HUI数平均提高2.85个百分点,运行时间平均减少14.6%。HUIM-MOA较进化型HUIM算法有更强的搜索性能,能有效减少搜索时间和提高搜索质量。

关键词: 高效用项集挖掘, 马尔可夫网络, 位图矩阵, 吉布斯采样, 精英策略

Abstract:

To address the problems that the High Utility Itemset Mining (HUIM) algorithms based on tree and link table structures often consume search spaces of orders of magnitude, and the evolutionary type-based mining algorithms fail to fully consider the interactions between variables, an HUIM algorithm based on Markov Optimization (HUIM-MOA) was proposed. Firstly, a bitmap matrix for expressing database and expectation vector encoding were used to achieve fast scanning of the database and efficient computation of utility values, respectively. Then, the Markov Network (MN) structure was estimated by computing the mutual information among dominant individuals and new populations were generated by using Gibbs sampling according to their local characteristics. Finally, population diversity preservation strategy and elite strategy were used to prevent the algorithm from falling into local optimum too quickly and to reduce the missing of high utility itemsets, respectively. Experimental results on real datasets show that compared with Bio-inspired High Utility Itemset Framework based on Particle Swarm Optimization (Bio-HUIF-PSO) algorithm, HUIM-MOA can find all the High Utility Itemsets (HUIs) when given a larger minimum threshold, with on average 12.5% improvement in convergence speed, 2.85 percentage point improvement in mined HUI number, and 14.6% reduction in running time. It can be seen that HUIM-MOA has stronger search performance than the evolutionary HUIM algorithm, which can effectively reduce the search time and improve the search quality.

Key words: High Utility Itemset Mining (HUIM), Markov network, bitmap matrix, Gibbs sampling, elite strategy

中图分类号: