Frequent pattern mining algorithm from uncertain data based on pattern-growth

doi:10.11772/j.issn.1001-9081.2015.07.1921

Abstract

Abstract:

To improve the time and space efficiency of Frequent Pattern (FP) mining algorithm over uncertain dataset, the Uncertain Frequent Pattern Mining based on Max Probability (UFPM-MP) algorithm was proposed. First, the expected support number was estimated using maximum probability of the transaction itemset. Second, by comparing this expected support number to the minimum expected support number threshold, the candidate frequent itemsets were identified. Finally, the corresponding sub-trees were built for recursively mining frequent patterns. The UFPM-MP algorithm was tested on 6 classical datasets against the state-of-the-art algorithm AT (Array based tail node Tree structure)-Mine with positive results (about 30% improvement for sparse datasets, and 3-4 times more efficient for dense datasets). The expected support number estimation strategy effectively reduces the number of sub-trees and items of header table, and improves the algorithm's time and space efficiency; and when the minimum expected support threshold is low or there are lots of potential frequent patterns, time efficiency of the proposed algorithm performs more remarkably.

Key words: uncertain data, Frequent Pattern (FP), frequent itemset, pattern-growth

摘要：

为提高不确定数据频繁模式(FP)挖掘算法的时空效率,提出了基于最大概率的不确定频繁模式挖掘(UFPM-MP)算法。首先,利用事务项集中的最大概率值预估期望支持数;然后,使用该期望支持数与最小期望支持数阈值进行比较,以确定某一项集是否为候选频繁项集,并对候选项集建立子树以递归挖掘频繁模式。实验中,UFPM-MP算法与AT-Mine算法进行了对比,并在6个典型的数据集上进行实验验证。实验结果表明,UFPM-MP算法的时空效率得到了提高,稀疏数据集上提高约30%,稠密数据集上的效率提高更为明显(约3~4倍)。预估期望支持数的策略有效地减少了子树和头表项的数量,从而提高了算法的时空效率;且最小期望支持数越小,或需要挖掘的频繁模式越多的时候,算法的时间效率提高越多。

关键词: 不确定数据, 频繁模式, 频繁项集, 模式增长

CLC Number:

TP311.13

WANG Le, CHANG Yanfeng, WANG Shui. Frequent pattern mining algorithm from uncertain data based on pattern-growth[J]. Journal of Computer Applications, 2015, 35(7): 1921-1926.

王乐, 常艳芬, 王水. 基于模式增长的不确定数据的频繁模式挖掘算法[J]. 计算机应用, 2015, 35(7): 1921-1926.

References

[1] CHUI C-K, KAO B, HUNG E. Mining frequent itemsets from uncertain data [C] // PAKDD 2007: Proceedings of the 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining, LNCS 4426. Berlin: Springer, 2007: 47-58.
[2] WANG L, CHEUNG D W, CHENG R, et al. Efficient mining of frequent itemsets on large uncertain databases [J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(12): 2170-2183.
[3] SUN X, LIM L, WANG S. An approximation algorithm of mining frequent itemsets from uncertain dataset [J]. International Journal of Advancements in Computing Technology, 2012, 4(3): 42-49.
[4] LEUNG C K, CARMICHAEL C L, HAO B. Efficient mining of frequent patterns from uncertain data [C] // ICDM Workshops 2007: Proceedings of the Seventh IEEE International Conference on Data Mining Workshops. Piscataway: IEEE, 2007: 489-494.
[5] LEUNG C K, MATEO M A F, BRAJCZUK D A. A tree-based approach for frequent pattern mining from uncertain data [C] // PAKDD 2008: Proceedings of the 12th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, LNCS 5012. Berlin: Springer, 2008: 653-661.
[6] AGGARWAL C C, LI Y, WANG J, et al. Frequent pattern mining with uncertain data [C] // KDD 2009: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2009: 29-37.
[7] PEI J, HAN J, LU H, et al. H-mine: Hyper-structure mining of frequent patterns in large databases [C]// ICDM 2001: Proceedings of the 2001 IEEE International Conference on Data Mining. Piscataway: IEEE, 2001: 441-448.
[8] LIN C W, HONG T P. A new mining approach for uncertain databases using CUFP trees [J]. Expert Systems with Applications, 2012, 39(4): 4084-4093.
[9] LEUNG C K, TANBEER S K. Fast tree-based mining of frequent itemsets from uncertain data [C]// DASFAA 2012: Proceedings of the 17th International Conference on Database Systems for Advanced Applications, LNCS 7238. Berlin: Springer, 2012: 272-287.
[10] LEUNG C K, TANBEER S K. PUF-tree: a compact tree structure for frequent pattern mining of uncertain data [C]// PAKDD 2013: Proceedings of the 17th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, LNCS 7818. Berlin: Springer, 2013: 13-25.
[11] MacKINNON R K, STRAUSS T D, LEUNG C K. DISC: efficient uncertain frequent pattern mining with tightened upper bounds [C]// ICDMW 2014: Proceedings of the 2014 IEEE International Conference on Data Mining Workshop. Piscataway: IEEE, 2014: 1038-1045.
[12] LEUNG C K, MacKINNON R K, TANBEER S K. Fast algorithms for frequent itemset mining from uncertain data [C]// ICDM 2014: Proceedings of the 2014 IEEE International Conference on Data Mining. Piscataway: IEEE, 2014: 893-898.
[13] WANG L, FENG L, WU M. AT-Mine: an efficient algorithm of frequent itemset mining on uncertain dataset [J]. Journal of Computers, 2013, 8(6): 1417-1426.
[14] WANG L, FENG L, WU M. UDS-FIM: an efficient algorithm of frequent itemsets mining over uncertain transaction data streams [J]. Journal of Software, 2014, 9(1): 44-56.
[15] LEUNG C K, JIANG F. Frequent itemset mining of uncertain data streams using the damped window model [C]// SAC 2011: Proceedings of the 26th Annual ACM Symposium on Applied Computing. New York: ACM, 2011: 950-955.
[16] LEUNG C K, JIANG F. Frequent pattern mining from time-fading streams of uncertain data [C]// DaWaK 2011: Proceedings of the 13th International Conference on Data Warehousing and Knowledge Discovery, LNCS 6862. Berlin: Springer, 2011: 252-264.
[17] LIU Y, LIU Y, CHEN C. Efficient algorithm for mining of fre-quent itemsets over uncertain data streams [J]. Journal of Computer Research and Development, 2011, 48(z2): 1-7.(刘殷雷,刘玉葆,陈程.不确定性数据流上频繁项集挖掘的有效算法[J].计算机研究与发展,2011,48(z2):1-7.)
[18] LIAO G, WU L, WAN C. Frequent patterns mining over uncertain data streams based on probability decay window model [J]. Journal of Computer Research and Development, 2012, 49(5): 1105-1115.(廖国琼,吴凌琴,万常选.基于概率衰减窗口模型的不确定数据流频繁模式挖掘[J].计算机研究与发展,2012,49(5):1105-1115.)
[19] LI H, ZHANG N, CHAI Y. Uncertain data preconditioning method in frequent itemset mining [J]. Computer Science, 2012, 39(7): 161-164.(李海峰,章宁,柴艳妹.不确定性数据上频繁项集挖掘的预处理方法[J].计算机科学,2012,39(7):161-164.)
[20] WANG S, ZHU K, WANG L. Approximation algorithm for fre-quent itemsets mining on uncertain dataset [J]. Application Research of Computers, 2014, 31(3): 725-728.(王水,祝孔涛,王乐.一种不确定数据集上频繁模式挖掘的近似算法[J].计算机应用研究,2014,31(3):725-728.)