Journal of Computer Applications ›› 2018, Vol. 38 ›› Issue (4): 995-1000.DOI: 10.11772/j.issn.1001-9081.2017092389

Previous Articles     Next Articles

Performance analysis of frequent itemset mining algorithms based on sparseness of dataset

XIAO Wen, HU Juan   

  1. Department of Electrical and Information Engineering, Hohai University Wentian College, Maanshan Anhui 243031, China
  • Received:2017-10-13 Revised:2017-11-06 Online:2018-04-10 Published:2018-04-09
  • Supported by:
    This work is partially supported by the Natural Science Foundation of the Colleges and Universities in Anhui Province (KJ2016A623).

基于数据集稀疏度的频繁项集挖掘算法性能分析

肖文, 胡娟   

  1. 河海大学文天学院 电气信息工程系, 安徽 马鞍山 243031
  • 通讯作者: 肖文
  • 作者简介:肖文(1984-),男,安徽黄山人,讲师,硕士研究生,主要研究方向:分布式计算、数据挖掘;胡娟(1985-),女,江苏海门人,讲师,硕士研究生,主要研究方向:软件工程、数据库系统。
  • 基金资助:
    安徽省高校自然科学研究项目(KJ2016A623)。

Abstract: Frequent Itemset Mining (FIM) is one of the most important data mining tasks. The characteristics of the mined datasets have a significant effect on the performance of FIM algorithms. Sparseness of datasets is one of the attributes that characterize the essential characteristics of datasets. Different types of FIM algorithms are very different in the scalability of dataset sparseness. Aiming at the measurement of sparseness of datasets and influence of sparsity on the performance of different types of FIM algorithms, the existing measurement methods were reviewed and discussed, then two methods were proposed to quantify the sparseness of the datasets:the measurement based on transaction difference and the measurement based on FP-Tree method, both of which considered the influence of the minimum support degree on the sparseness of the datasets in the background of the FIM task, and reflected the difference between the frequent itemsets of the transaction. The scalability of different types of FIM algorithms for sparseness of datasets was studied experimentally. The experimental results show that the sparseness of datasets is inversely proportional to the minimum support, and the FIM algorithm based on vertical format has the best scalability in three kinds of typical FIM algorithms.

Key words: data mining, Frequent Itemset Mining (FIM), sparseness, scalability

摘要: 频繁项集挖掘(FIM)是最基础的数据挖掘任务之一,被挖掘数据集的特征对FIM算法的性能有着显著影响。数据集稀疏度是体现数据集本质特征的属性之一,不同类型的FIM算法对数据集稀疏度的可扩展性有着很大的不同。针对如何量化度量数据集稀疏度及稀疏度对不同类型FIM算法性能影响等问题,首先回顾并讨论了已有的度量方法,然后提出两种新的量化度量数据集稀疏度的方法(基于事务差异度的度量方法和基于FP-Tree的度量方法)。这两种度量方法均考虑了FIM任务背景下最小支持度对数据集稀疏度的影响,反映的是事务频繁项集之间的差异度。最后通过实验验证了不同类型FIM算法对数据集稀疏度的可扩展性。实验结果表明,数据集稀疏度与最小支持度成反比,基于垂直格式的FIM算法在三类典型FIM算法中具有最佳的稀疏度可扩展性。

关键词: 数据挖掘, 频繁项集挖掘, 稀疏度, 可扩展性

CLC Number: