基于数据集稀疏度的频繁项集挖掘算法性能分析

doi:10.11772/j.issn.1001-9081.2017092389

计算机应用 ›› 2018, Vol. 38 ›› Issue (4): 995-1000.DOI: 10.11772/j.issn.1001-9081.2017092389

基于数据集稀疏度的频繁项集挖掘算法性能分析

肖文, 胡娟

河海大学文天学院电气信息工程系, 安徽马鞍山 243031

收稿日期:2017-10-13 修回日期:2017-11-06 发布日期:2018-04-09 出版日期:2018-04-10
通讯作者: 肖文
作者简介:肖文(1984-),男,安徽黄山人,讲师,硕士研究生,主要研究方向:分布式计算、数据挖掘;胡娟(1985-),女,江苏海门人,讲师,硕士研究生,主要研究方向:软件工程、数据库系统。
基金资助:
安徽省高校自然科学研究项目（KJ2016A623）。

Performance analysis of frequent itemset mining algorithms based on sparseness of dataset

XIAO Wen, HU Juan

Department of Electrical and Information Engineering, Hohai University Wentian College, Maanshan Anhui 243031, China

Received:2017-10-13 Revised:2017-11-06 Online:2018-04-09 Published:2018-04-10
Supported by:
This work is partially supported by the Natural Science Foundation of the Colleges and Universities in Anhui Province (KJ2016A623).

摘要/Abstract

摘要： 频繁项集挖掘（FIM）是最基础的数据挖掘任务之一，被挖掘数据集的特征对FIM算法的性能有着显著影响。数据集稀疏度是体现数据集本质特征的属性之一，不同类型的FIM算法对数据集稀疏度的可扩展性有着很大的不同。针对如何量化度量数据集稀疏度及稀疏度对不同类型FIM算法性能影响等问题，首先回顾并讨论了已有的度量方法，然后提出两种新的量化度量数据集稀疏度的方法（基于事务差异度的度量方法和基于FP-Tree的度量方法）。这两种度量方法均考虑了FIM任务背景下最小支持度对数据集稀疏度的影响，反映的是事务频繁项集之间的差异度。最后通过实验验证了不同类型FIM算法对数据集稀疏度的可扩展性。实验结果表明，数据集稀疏度与最小支持度成反比，基于垂直格式的FIM算法在三类典型FIM算法中具有最佳的稀疏度可扩展性。

关键词: 数据挖掘, 频繁项集挖掘, 稀疏度, 可扩展性

Abstract: Frequent Itemset Mining (FIM) is one of the most important data mining tasks. The characteristics of the mined datasets have a significant effect on the performance of FIM algorithms. Sparseness of datasets is one of the attributes that characterize the essential characteristics of datasets. Different types of FIM algorithms are very different in the scalability of dataset sparseness. Aiming at the measurement of sparseness of datasets and influence of sparsity on the performance of different types of FIM algorithms, the existing measurement methods were reviewed and discussed, then two methods were proposed to quantify the sparseness of the datasets:the measurement based on transaction difference and the measurement based on FP-Tree method, both of which considered the influence of the minimum support degree on the sparseness of the datasets in the background of the FIM task, and reflected the difference between the frequent itemsets of the transaction. The scalability of different types of FIM algorithms for sparseness of datasets was studied experimentally. The experimental results show that the sparseness of datasets is inversely proportional to the minimum support, and the FIM algorithm based on vertical format has the best scalability in three kinds of typical FIM algorithms.

Key words: data mining, Frequent Itemset Mining (FIM), sparseness, scalability

中图分类号:

TP311.5

肖文, 胡娟. 基于数据集稀疏度的频繁项集挖掘算法性能分析[J]. 计算机应用, 2018, 38(4): 995-1000.

XIAO Wen, HU Juan. Performance analysis of frequent itemset mining algorithms based on sparseness of dataset[J]. Journal of Computer Applications, 2018, 38(4): 995-1000.

参考文献

[1] AGRAWAL R, IMIELINSKI T, SWAMI A N. Mining association rules between sets of items in large databases[C]//Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data. New York:ACM,1993:207-216.
[2] AGRAWAL R, SRIKANT R. Fast algorithms for mining association rules in large databases[EB/OL].[2017-05-10]. http://www.cs.uu.nl/docs/vakken/adm/agrawalfast.pdf.
[3] PARK J S, CHEN M S, YU P S. Using a hash-based method with transaction trimming for mining association rules[J]. IEEE Transactions on Knowledge & Data Engineering, 1997, 9(5):813-825.
[4] OZEL S A, GUVENIR H A. An algorithm for mining association rules using perfect hashing and database pruning[C]//Proceedings of the 10th Turkish Symposium on Artificial Intelligence and Neural Networks. Berlin:Springer, 2001:257-264.
[5] BRIN S, MOTWANI R, ULLMAN J D, et al. Dynamic itemset counting and implication rules for market basket data[J]. ACM Sigmod Record, 2001, 26(2):255-264.
[6] HAN J, PEI J, YIN Y, et al. Mining frequent patterns without candidate generation:a frequent-pattern tree approach[J]. Data Mining & Knowledge Discovery, 2015, 8(1):53-87.
[7] PYUN G, YUN U, RYU K H. Efficient frequent pattern mining based on linear prefix tree[J]. Knowledge-Based Systems, 2014, 55:125-139.
[8] TSAY Y J, HSU T J, YU J R. FIUT:a new method for mining frequent itemsets[J]. Information Sciences, 2009, 179(11):1724-1737.
[9] LIN K C, LIAO I E, CHEN Z S. An improved frequent pattern growth method for mining association rules[J]. Expert Systems with Applications, 2011, 38(5):5154-5161.
[10] TSENG F C. An adaptive approach to mining frequent itemsets efficiently[J]. Expert Systems with Applications, 2012, 39(18):13166-13172.
[11] BURDICK D, CALIMLIM M, FLANNICK J, et al. MAFIA:a maximal frequent itemset algorithm[J]. IEEE Transactions on Knowledge & Data Engineering, 2005, 17(11):1490-1504.
[12] GOETHALS B, ZAKI M J. Advances in frequent itemset mining implementations:report on FIMI'03[J]. ACM Sigkdd Explorations Newsletter, 2003, 6(1):109-117.
[13] BAYARDO R J J, AGRAWAL R, GUNOPULOS D. Constraint-based rule mining in large, dense databases[J]. Data Mining & Knowledge Discovery, 2000, 4(2/3):217-240.
[14] GOUDA K, ZAKI M J. Efficiently mining maximal frequent itemsets[C]//ICDM 2001:Proceedings of the 2001 IEEE International Conference on Data Mining. Washington, DC:IEEE Computer Society, 2001:163-170.
[15] PALMERINI P, ORLANDO S, PEREGO R. Statistical properties of transactional databases[C]//SAC 2004:Proceedings of the 2004 ACM Symposium on Applied Computing. New York:ACM, 515-519.
[16] STEINBACH M, TAN P N, KUMAR V. Support envelopes:a technique for exploring the structure of association patterns[C]//Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York:ACM, 2004:296-305.
[17] YAN H, CHEN K, LIU L, et al. SCALE:a scalable framework for efficiently clustering transactional data[J]. Data Mining & Knowledge Discovery, 2010, 20(1):1-27.
[18] 闫珍, 皮德常, 吴文昊. 高维稀疏数据频繁项集挖掘算法的研究[J]. 计算机科学, 2011, 38(6):183-186.(YAN Z, PI D C, WU W H. Research on frequent itemsets mining algorithm for high-dimensional sparse data[J]. Computer Science, 2011, 38(6):183-186.)
[19] GRAHNE G, ZHU J F. Efficiently using prefix-trees in mining frequent itemsets[EB/OL].[2017-05-10]. http://ceur-ws.org/Vol-90/grahne.pdf.
[20] SALLEB-AOUISSI A, VRAIN C. A Contribution to the Use of Decision Diagrams for Loading and Mining Transaction Databases[M]. Amsterdam:IOS Press, 2007:220-242.
[21] SHEPARD T H. Looking for a structural characterization of the sparseness measure of(frequent closed) itemset contexts[J]. Information Sciences, 2013, 222(3):343-361.
[22] 严蔚敏, 吴伟民. 数据结构(C语言版)[M]. 北京:清华大学出版社, 2007:96-96.(YAN W M, WU W M. Data Structure(C Language Edition)[M]. Beijing:Tsinghua University Press, 2007:96-96.)
[23] YAHIA S B, HAMROUNI T, NGUIFO E M. Frequent closed itemset based algorithms[J]. ACM SIGKDD Explorations Newsletter, 2006, 8(1):93-104.
[24] PASQUIER N, BASTIDE Y, TAOUIL R, et al. Discovering frequent closed itemsets for association rules[C]//ICDT 1999:Proceedings of the 7th International Conference on Database Theory, LNCS 1540. Berlin:Springer, 1999:398-416.
[25] 韩家炜, 范明.数据挖掘:概念与技术[M]. 北京:机械工业出版社, 2012:27-46.(HAN J W, FAN M. Data Mining:Concepts and Techniques[M]. Beijing:China Machine Press, 2012:27-46.)
[26] IEEE computer society. Frequent itemset mining dataset repository[DB/OL].[2017-11-01].http://fimi.ua.ac.be/data/.

基于数据集稀疏度的频繁项集挖掘算法性能分析

Performance analysis of frequent itemset mining algorithms based on sparseness of dataset

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	李欢欢, 黄添强, 丁雪梅, 罗海峰, 黄丽清. 基于多尺度时空图卷积网络的交通出行需求预测[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2065-2072.
[2]	李皎, 张秀山, 宁远航. 降低跨分片交易比例的区块链分片方法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1889-1896.
[3]	董瑶, 付怡雪, 董永峰, 史进, 陈晨. 不完整多视图聚类综述[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1673-1682.
[4]	杨克帅, 武优西, 耿萌, 刘靖宇, 李艳. 一次性条件下top-k高平均效用序列模式挖掘算法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 477-484.
[5]	陈姿芊, 牛科迪, 姚中原, 斯雪明. 适用于物联网的区块链轻量化技术综述[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3688-3698.
[6]	郑浩东, 马华, 谢颖超, 唐文胜. 融合遗忘因素与记忆门的图神经网络知识追踪模型[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2747-2752.
[7]	黄硕, 李艳辉, 曹建秋. 本地化差分隐私下的频繁序列模式挖掘算法PrivSPM[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2057-2064.
[8]	蒋华, 李星, 王慧娇, 韦静海. 基于数据索引结构的跨级高效用项集挖掘算法[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2200-2208.
[9]	祁超帅, 何文思, 焦毅, 马英红, 蔡伟, 任素萍. 无人机飞行数据异常检测算法综述[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1833-1841.
[10]	李元江, 权金升, 谭阳奕, 杨田. 基于相似和差异双视角的高维数据属性约简[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1467-1472.
[11]	荀亚玲, 王林青, 蔡江辉, 杨海峰. 基于多尺度的时序数据部分周期模式增量挖掘[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 391-397.
[12]	孙栋, 王彪, 徐云. 基于RDMA的区块传输机制设计与实现[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 484-489.
[13]	邵小萌, 张猛. 融合注意力机制的时间卷积知识追踪模型[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 343-348.
[14]	李文全, 毛伊敏, 彭新东. 基于犹豫模糊集的凝聚式层次聚类算法[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3755-3763.
[15]	吴军, 欧阳艾嘉, 张琳. 基于影响度的统计显著序列模式挖掘算法[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2713-2721.