《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (8): 2572-2581.DOI: 10.11772/j.issn.1001-9081.2024071063

• 数据科学与技术 • 上一篇    

非冗余统计显著判别高效用模式挖掘算法

吴军(), 欧阳艾嘉, 王亚   

  1. 遵义师范学院 信息工程学院,贵州 遵义 563006
  • 收稿日期:2024-07-30 修回日期:2024-10-14 接受日期:2024-10-15 发布日期:2024-11-19 出版日期:2025-08-10
  • 通讯作者: 吴军
  • 作者简介:欧阳艾嘉(1975—),男,湖南娄底人,教授,博士,主要研究方向:智能计算、并行计算
    王亚(1974—),男,贵州遵义人,副教授,博士,主要研究方向:数据挖掘、机器学习。
  • 基金资助:
    国家自然科学基金资助项目(62066049);贵州省高等学校青年资助项目(QJJ2022313);贵州省科技支撑计划项目(QKHZC2023257);遵义市科技合作资助项目(ZSKHHZ2022123)

Non-redundant and statistically significant discriminative high utility pattern mining algorithm

Jun WU(), Aijia OUYANG, Ya WANG   

  1. School of Information Engineering,Zunyi Normal University,Zunyi Guizhou 563006,China
  • Received:2024-07-30 Revised:2024-10-14 Accepted:2024-10-15 Online:2024-11-19 Published:2025-08-10
  • Contact: Jun WU
  • About author:OUYANG Aijia, born in 1975, Ph. D., professor. His research interests include intelligent computing, parallel computing.
    WANG Ya, born in 1974, Ph. D., associate professor. His research interests include data mining, machine learning.
  • Supported by:
    National Natural Science Foundation of China(62066049)

摘要:

针对高效用模式挖掘任务中假阳性模式和冗余模式的判别问题,提出一种基于无限制检验和独立成长率的判别高效用模式挖掘算法UTDHU(Unlimited Testing for Discriminative High Utility pattern mining)。首先,找到目标事务集合中满足效用阈值和差异阈值的判别高效用模式;其次,建立前缀项共享树以快速计算每个模式的独立成长率,并基于独立成长率筛除未超过独立阈值的冗余判别高效用模式;最后,使用无限制检验计算余下每个模式的统计显著性度量p值,并根据错误率判断族过滤整体结果中的假阳性判别高效用模式。在4个基准事务集合和2个仿真事务集合上的实验结果表明,相较于Hamm和YBHU (Yekutieli-Benjamini resampling for High Utility pattern mining)等算法,所提算法在模式数量方面输出最少,过滤了至少97.8%的被检验模式;在模式质量方面,所提算法的假阳性判别高效用模式占比低于5.2%,且构造特征的分类准确率高于对比算法至少1.5个百分点;虽然所提算法在运行时间方面慢于Hamm算法,但快于其余3个基于统计显著性检验的算法。可见,所提算法能够有效剔除一定数量的假阳性和冗余判别高效用模式,在挖掘性能上更优,且运行效率更高。

关键词: 数据挖掘, 判别高效用模式挖掘, 模式评估, 假阳性模式过滤, 冗余模式筛除

Abstract:

Aiming at the problems of false positive patterns and redundant patterns in tasks of discriminative high utility pattern mining, a discriminative high utility pattern mining algorithm based on unlimited testing and independent growth rate technique — UTDHU (Unlimited Testing for Discriminative High Utility pattern mining) was designed. Firstly, the discriminative high utility patterns that meet utility and difference thresholds were mined from a target transaction set. Then, the redundant patterns were screened out by independent growth rates of patterns which were calculated by constructing a shared tree of prefix-items. Finally, the statistical significance measure p-value for each remaining pattern was calculated by the unlimited testing, and the false positive discriminative high utility patterns were filtered out according to the family wise error rates. Experimental results on four benchmark transaction sets and two synthetic transaction sets show that compared with Hamm, YBHU (Yekutieli-Benjamini resampling for High Utility pattern mining) and other algorithms, the proposed algorithm outputs the least in terms of the number of patterns, with more than 97.8% of tested patterns moved. In terms of mode quality, the proportions of false positive discriminative high utility patterns of the proposed algorithm are less than 5.2% and the classification accuracies of constructed features of the proposed algorithm are at least 1.5 percentage points higher than those of the compared algorithms. Additionally, in terms of running time, although the proposed algorithm is slower than Hamm algorithm, it is faster than the other three algorithms based on statistical significance testing. It can be seen that the proposed algorithm can effectively eliminate a certain number of false positive and redundant discriminative high-utility patterns, exhibits superior mining performance, and achieves higher operational efficiency.

Key words: data mining, discriminative high utility pattern mining, pattern assessment, false positive pattern filtering, redundant pattern screening

中图分类号: