Journal of Computer Applications

    Next Articles

Non-redundant and statistically significant discriminative high utility pattern mining algorithm

  

  • Received:2024-07-30 Revised:2024-10-15 Online:2024-11-19 Published:2024-11-19

非冗余统计显著判别高效用模式挖掘算法

吴军,欧阳艾嘉,王亚   

  1. 遵义师范学院
  • 通讯作者: 吴军
  • 基金资助:
    国家自然科学基金项目;贵州省教育厅青年人才项目;贵州省科技厅科技支撑计划项目;遵义市联合资金项目

Abstract: Aiming at the problems of false positive patterns and redundant patterns in the task of discriminative high utility pattern mining, a novel algorithm UTDHU (Unlimited Testing for Discriminative High Utility Pattern Mining) based on the unlimited testing and the independent growth rate technique was designed. First, the patterns that meet utility and difference thresholds were mined from a target transaction data set. Then, redundant patterns were removed by independent growth rates which calculated from a shared prefix-items tree. Finally, false positive patterns were filtered out by the unlimited testing and the family-wise error rate measure. Experiments are conducted on four benchmark data sets and two synthetic data sets. Compared with Hamm, YBHU (Yekutieli-Benjamini Resampling for High Utility Pattern Mining) and the other algorithms, the proposed algorithm removes more than 97.8% of tested patterns. The proportions of false positive results returned by UTDHU are less than 5.2% and the accuracy rates of constructed features are at least 1.5 percentage points higher than those of the compared algorithms. Additionally, although UTDHU algorithm is slower than Hamm algorithm, it is faster than the other three algorithms based on statistical significance testing. The experimental results show that the proposed algorithm can effectively eliminate a certain number of false positive and redundant discriminative high utility patterns and has better performance and efficiency.

Key words: data mining, discriminative high utility pattern mining, pattern assessment, false positive pattern filtering, redundant pattern removal

摘要: 针对判别高效用模式挖掘任务中假阳性模式和冗余模式问题,提出一种基于无限制检验和独立成长率的判别高效用模式挖掘算法UTDHU (Unlimited Testing for Discriminative High Utility Pattern Mining)。首先,找到目标事务集合中满足效用阈值和差异阈值的判别高效用模式;其次,建立前缀项共享树快速计算每个模式的独立成长率,并筛除未超过独立阈值的冗余判别高效用模式;最后,使用无限制检验计算余下每个模式的统计显著性度量p值,并根据错误率判断族过滤整体结果中的假阳性判别高效用模式。实验采用了4个基准事务集合和2个仿真事务集合,相较于Hamm、YBHU (Yekutieli-Benjamini Resampling for High Utility Pattern Mining)等算法,所提算法在模式数量方面输出最少,过滤了至少97.8%的被检验模式;在模式质量方面假阳性判别高效用模式占比率低于5.2%且构造特征分类准确率高于对比算法至少1.5个百分点;在运行时间方面慢于Hamm算法,但快于其余3个基于统计显著性检验的算法。实验结果表明所提算法能够有效剔除一定数量的假阳性和冗余判别高效用模式并且挖掘性能和效率更好。

关键词: 数据挖掘, 判别高效用模式挖掘, 模式评估, 假阳性模式过滤, 冗余模式筛除

CLC Number: