Journal of Computer Applications ›› 2015, Vol. 35 ›› Issue (7): 1921-1926.DOI: 10.11772/j.issn.1001-9081.2015.07.1921

Previous Articles     Next Articles

Frequent pattern mining algorithm from uncertain data based on pattern-growth

WANG Le, CHANG Yanfeng, WANG Shui   

  1. School of Information Engineering, Ningbo Dahongying University, Ningbo Zhejiang 315175, China
  • Received:2015-02-02 Revised:2015-04-02 Online:2015-07-10 Published:2015-07-17

基于模式增长的不确定数据的频繁模式挖掘算法

王乐, 常艳芬, 王水   

  1. 宁波大红鹰学院 信息工程学院, 浙江 宁波 315175
  • 通讯作者: 王水(1967-),男,河南南阳人,教授,主要研究方向:数据挖掘、金融数据分析,seawan@163.com
  • 作者简介:王乐(1978-),女,河南南阳人,讲师,博士,主要研究方向:数据挖掘; 常艳芬(1981-),女,河北保定人,讲师,硕士,主要研究方向:数据挖掘、软件工程
  • 基金资助:

    国家自然科学基金资助项目(61370200);宁波市自然科学基金资助项目(2013A610115, 2014A610073);浙江省教育厅一般科研项目(Y201432717);宁波大红鹰学院大宗商品专项课题(1320133004)。

Abstract:

To improve the time and space efficiency of Frequent Pattern (FP) mining algorithm over uncertain dataset, the Uncertain Frequent Pattern Mining based on Max Probability (UFPM-MP) algorithm was proposed. First, the expected support number was estimated using maximum probability of the transaction itemset. Second, by comparing this expected support number to the minimum expected support number threshold, the candidate frequent itemsets were identified. Finally, the corresponding sub-trees were built for recursively mining frequent patterns. The UFPM-MP algorithm was tested on 6 classical datasets against the state-of-the-art algorithm AT (Array based tail node Tree structure)-Mine with positive results (about 30% improvement for sparse datasets, and 3-4 times more efficient for dense datasets). The expected support number estimation strategy effectively reduces the number of sub-trees and items of header table, and improves the algorithm's time and space efficiency; and when the minimum expected support threshold is low or there are lots of potential frequent patterns, time efficiency of the proposed algorithm performs more remarkably.

Key words: uncertain data, Frequent Pattern (FP), frequent itemset, pattern-growth

摘要:

为提高不确定数据频繁模式(FP)挖掘算法的时空效率,提出了基于最大概率的不确定频繁模式挖掘(UFPM-MP)算法。首先,利用事务项集中的最大概率值预估期望支持数;然后,使用该期望支持数与最小期望支持数阈值进行比较,以确定某一项集是否为候选频繁项集,并对候选项集建立子树以递归挖掘频繁模式。实验中,UFPM-MP算法与AT-Mine算法进行了对比,并在6个典型的数据集上进行实验验证。实验结果表明,UFPM-MP算法的时空效率得到了提高,稀疏数据集上提高约30%,稠密数据集上的效率提高更为明显(约3~4倍)。预估期望支持数的策略有效地减少了子树和头表项的数量,从而提高了算法的时空效率;且最小期望支持数越小,或需要挖掘的频繁模式越多的时候,算法的时间效率提高越多。

关键词: 不确定数据, 频繁模式, 频繁项集, 模式增长

CLC Number: