计算机应用 ›› 2013, Vol. 33 ›› Issue (11): 3045-3048.

• 数据库技术 • 上一篇    下一篇

基于散列的频繁项集分组算法

王红梅1,2,胡明2   

  1. 1. 吉林大学
    2. 吉林大学 计算机科学与技术学院,长春 130012
  • 收稿日期:2013-05-24 修回日期:2013-07-19 出版日期:2013-11-01 发布日期:2013-12-04
  • 通讯作者: 胡明
  • 作者简介:王红梅(1968-),女,吉林长春人,教授,博士研究生,主要研究方向:数据挖掘、智能算法;胡明(1963-),男,江苏句容人,教授,主要研究方向:数据挖掘、分布式计算。
  • 基金资助:
    国家自然科学基金资助项目;吉林省自然科学基金资助项目

Frequent itemsets grouping algorithm based on Hash list

Wang Hongmei1,2,HU Ming1   

  1. 1. College of Computer Science and Technology, Jilin University, Changchun Jilin 130012, China;
    2. School of Computer Science and Engineering, Changchun University of Technology, Changchun Jilin 130012, China
  • Received:2013-05-24 Revised:2013-07-19 Online:2013-12-04 Published:2013-11-01
  • Contact: HU Ming
  • Supported by:
    ;Natural Science Foundation Project of Province Jilin

摘要: Apriori算法是频繁项集挖掘的经典算法。针对Apriori算法的剪枝操作和多次扫描数据集的缺点,提出了基于散列的频繁项集分组(HFG)算法。证明了2-项集剪枝性质,采用散列技术存储频繁2-项集,将Apriori算法剪枝操作的时间复杂度从O(k×|Lk|)降低到O(1);定义了首项的子项集概念,将数据集划分为以Ii为首项的数据子集并采用分组索引表存储,在求以Ii为首项的频繁项集时,只扫描以Ii为首项的数据子集,减少了对数据集扫描的时间代价。实验结果表明,由于HFG算法的剪枝操作产生了累积效益,以及分组扫描排除了无效的项集和元组,使得HFG算法在时间性能方面与Apriori算法相比有较大提高。

关键词: 频繁项集, 2-项集剪枝, 散列表, 首项分组, 索引表

Abstract: Apriori algorithm is a classic algorithm in frequent itemsets mining. In view of the shortcomings of pruning operations and multiply scanning data set in Apriori, a Hash-based Frequent itemsets Grouping algorithm, named HFG was put forward. In this paper, 2-length itemsets pruning property was proved, frequent 2-length itemsets were stored based on Hash list, the time complexity of Apriori algorithm in pruning operation was dropped from O(k×|Lk|) to O(1); the concept of sub-itemset of first term was defined, dataset was divided into subsets with Ii as first item and stored by grouping index list. Therefore, only the sub data set with Ii as the first item was scanned to find the frequent itemsets, and the time cost of scanning dataset was reduced. The experimental results show that the HFG algorithm is much more efficient than Apriori algorithm in time performance because of the cumulative benefits of pruning operations and skipping the invalid itemsets and records in HFG algorithm.

Key words: frequent itemset, 2-length itemset pruning, Hash list, first term grouping, index list

中图分类号: