Abstract:Frequent Itemset Mining (FIM) is one of the most important data mining tasks. The characteristics of the mined datasets have a significant effect on the performance of FIM algorithms. Sparseness of datasets is one of the attributes that characterize the essential characteristics of datasets. Different types of FIM algorithms are very different in the scalability of dataset sparseness. Aiming at the measurement of sparseness of datasets and influence of sparsity on the performance of different types of FIM algorithms, the existing measurement methods were reviewed and discussed, then two methods were proposed to quantify the sparseness of the datasets:the measurement based on transaction difference and the measurement based on FP-Tree method, both of which considered the influence of the minimum support degree on the sparseness of the datasets in the background of the FIM task, and reflected the difference between the frequent itemsets of the transaction. The scalability of different types of FIM algorithms for sparseness of datasets was studied experimentally. The experimental results show that the sparseness of datasets is inversely proportional to the minimum support, and the FIM algorithm based on vertical format has the best scalability in three kinds of typical FIM algorithms.
肖文, 胡娟. 基于数据集稀疏度的频繁项集挖掘算法性能分析[J]. 计算机应用, 2018, 38(4): 995-1000.
XIAO Wen, HU Juan. Performance analysis of frequent itemset mining algorithms based on sparseness of dataset. Journal of Computer Applications, 2018, 38(4): 995-1000.
[1] AGRAWAL R, IMIELINSKI T, SWAMI A N. Mining association rules between sets of items in large databases[C]//Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data. New York:ACM,1993:207-216. [2] AGRAWAL R, SRIKANT R. Fast algorithms for mining association rules in large databases[EB/OL].[2017-05-10]. http://www.cs.uu.nl/docs/vakken/adm/agrawalfast.pdf. [3] PARK J S, CHEN M S, YU P S. Using a hash-based method with transaction trimming for mining association rules[J]. IEEE Transactions on Knowledge & Data Engineering, 1997, 9(5):813-825. [4] OZEL S A, GUVENIR H A. An algorithm for mining association rules using perfect hashing and database pruning[C]//Proceedings of the 10th Turkish Symposium on Artificial Intelligence and Neural Networks. Berlin:Springer, 2001:257-264. [5] BRIN S, MOTWANI R, ULLMAN J D, et al. Dynamic itemset counting and implication rules for market basket data[J]. ACM Sigmod Record, 2001, 26(2):255-264. [6] HAN J, PEI J, YIN Y, et al. Mining frequent patterns without candidate generation:a frequent-pattern tree approach[J]. Data Mining & Knowledge Discovery, 2015, 8(1):53-87. [7] PYUN G, YUN U, RYU K H. Efficient frequent pattern mining based on linear prefix tree[J]. Knowledge-Based Systems, 2014, 55:125-139. [8] TSAY Y J, HSU T J, YU J R. FIUT:a new method for mining frequent itemsets[J]. Information Sciences, 2009, 179(11):1724-1737. [9] LIN K C, LIAO I E, CHEN Z S. An improved frequent pattern growth method for mining association rules[J]. Expert Systems with Applications, 2011, 38(5):5154-5161. [10] TSENG F C. An adaptive approach to mining frequent itemsets efficiently[J]. Expert Systems with Applications, 2012, 39(18):13166-13172. [11] BURDICK D, CALIMLIM M, FLANNICK J, et al. MAFIA:a maximal frequent itemset algorithm[J]. IEEE Transactions on Knowledge & Data Engineering, 2005, 17(11):1490-1504. [12] GOETHALS B, ZAKI M J. Advances in frequent itemset mining implementations:report on FIMI'03[J]. ACM Sigkdd Explorations Newsletter, 2003, 6(1):109-117. [13] BAYARDO R J J, AGRAWAL R, GUNOPULOS D. Constraint-based rule mining in large, dense databases[J]. Data Mining & Knowledge Discovery, 2000, 4(2/3):217-240. [14] GOUDA K, ZAKI M J. Efficiently mining maximal frequent itemsets[C]//ICDM 2001:Proceedings of the 2001 IEEE International Conference on Data Mining. Washington, DC:IEEE Computer Society, 2001:163-170. [15] PALMERINI P, ORLANDO S, PEREGO R. Statistical properties of transactional databases[C]//SAC 2004:Proceedings of the 2004 ACM Symposium on Applied Computing. New York:ACM, 515-519. [16] STEINBACH M, TAN P N, KUMAR V. Support envelopes:a technique for exploring the structure of association patterns[C]//Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York:ACM, 2004:296-305. [17] YAN H, CHEN K, LIU L, et al. SCALE:a scalable framework for efficiently clustering transactional data[J]. Data Mining & Knowledge Discovery, 2010, 20(1):1-27. [18] 闫珍, 皮德常, 吴文昊. 高维稀疏数据频繁项集挖掘算法的研究[J]. 计算机科学, 2011, 38(6):183-186.(YAN Z, PI D C, WU W H. Research on frequent itemsets mining algorithm for high-dimensional sparse data[J]. Computer Science, 2011, 38(6):183-186.) [19] GRAHNE G, ZHU J F. Efficiently using prefix-trees in mining frequent itemsets[EB/OL].[2017-05-10]. http://ceur-ws.org/Vol-90/grahne.pdf. [20] SALLEB-AOUISSI A, VRAIN C. A Contribution to the Use of Decision Diagrams for Loading and Mining Transaction Databases[M]. Amsterdam:IOS Press, 2007:220-242. [21] SHEPARD T H. Looking for a structural characterization of the sparseness measure of(frequent closed) itemset contexts[J]. Information Sciences, 2013, 222(3):343-361. [22] 严蔚敏, 吴伟民. 数据结构(C语言版)[M]. 北京:清华大学出版社, 2007:96-96.(YAN W M, WU W M. Data Structure(C Language Edition)[M]. Beijing:Tsinghua University Press, 2007:96-96.) [23] YAHIA S B, HAMROUNI T, NGUIFO E M. Frequent closed itemset based algorithms[J]. ACM SIGKDD Explorations Newsletter, 2006, 8(1):93-104. [24] PASQUIER N, BASTIDE Y, TAOUIL R, et al. Discovering frequent closed itemsets for association rules[C]//ICDT 1999:Proceedings of the 7th International Conference on Database Theory, LNCS 1540. Berlin:Springer, 1999:398-416. [25] 韩家炜, 范明.数据挖掘:概念与技术[M]. 北京:机械工业出版社, 2012:27-46.(HAN J W, FAN M. Data Mining:Concepts and Techniques[M]. Beijing:China Machine Press, 2012:27-46.) [26] IEEE computer society. Frequent itemset mining dataset repository[DB/OL].[2017-11-01].http://fimi.ua.ac.be/data/.