基于散列的频繁项集分组算法

计算机应用 ›› 2013, Vol. 33 ›› Issue (11): 3045-3048.

基于散列的频繁项集分组算法

王红梅¹,²,胡明²

1. 吉林大学
2. 吉林大学计算机科学与技术学院，长春 130012

收稿日期:2013-05-24 修回日期:2013-07-19 出版日期:2013-11-01 发布日期:2013-12-04
通讯作者: 胡明
作者简介:王红梅（1968-），女，吉林长春人，教授，博士研究生，主要研究方向：数据挖掘、智能算法；胡明（1963-），男，江苏句容人，教授，主要研究方向：数据挖掘、分布式计算。
基金资助:
国家自然科学基金资助项目;吉林省自然科学基金资助项目

Frequent itemsets grouping algorithm based on Hash list

Wang Hongmei¹,²,HU Ming¹

1. College of Computer Science and Technology, Jilin University, Changchun Jilin 130012, China;
2. School of Computer Science and Engineering, Changchun University of Technology, Changchun Jilin 130012, China

Received:2013-05-24 Revised:2013-07-19 Online:2013-12-04 Published:2013-11-01
Contact: HU Ming
Supported by:
;Natural Science Foundation Project of Province Jilin

摘要/Abstract

摘要： Apriori算法是频繁项集挖掘的经典算法。针对Apriori算法的剪枝操作和多次扫描数据集的缺点，提出了基于散列的频繁项集分组(HFG)算法。证明了2-项集剪枝性质，采用散列技术存储频繁2-项集，将Apriori算法剪枝操作的时间复杂度从O(k×|Lk|)降低到O(1)；定义了首项的子项集概念，将数据集划分为以Ii为首项的数据子集并采用分组索引表存储，在求以Ii为首项的频繁项集时，只扫描以Ii为首项的数据子集，减少了对数据集扫描的时间代价。实验结果表明，由于HFG算法的剪枝操作产生了累积效益，以及分组扫描排除了无效的项集和元组，使得HFG算法在时间性能方面与Apriori算法相比有较大提高。

关键词: 频繁项集, 2-项集剪枝, 散列表, 首项分组, 索引表

Abstract: Apriori algorithm is a classic algorithm in frequent itemsets mining. In view of the shortcomings of pruning operations and multiply scanning data set in Apriori, a Hash-based Frequent itemsets Grouping algorithm, named HFG was put forward. In this paper, 2-length itemsets pruning property was proved, frequent 2-length itemsets were stored based on Hash list, the time complexity of Apriori algorithm in pruning operation was dropped from O(k×|Lk|) to O(1); the concept of sub-itemset of first term was defined, dataset was divided into subsets with Ii as first item and stored by grouping index list. Therefore, only the sub data set with Ii as the first item was scanned to find the frequent itemsets, and the time cost of scanning dataset was reduced. The experimental results show that the HFG algorithm is much more efficient than Apriori algorithm in time performance because of the cumulative benefits of pruning operations and skipping the invalid itemsets and records in HFG algorithm.

Key words: frequent itemset, 2-length itemset pruning, Hash list, first term grouping, index list

中图分类号:

TP311

王红梅胡明. 基于散列的频繁项集分组算法[J]. 计算机应用, 2013, 33(11): 3045-3048.

Wang Hongmei HU Ming. Frequent itemsets grouping algorithm based on Hash list[J]. Journal of Computer Applications, 2013, 33(11): 3045-3048.

[1]	于永斌, 戚敏惠, 尼玛扎西, 王琳. 基于阈值自适应忆阻器Hopfield神经网络的关联规则挖掘算法[J]. 计算机应用, 2019, 39(3): 728-733.
[2]	杜媛, 张世伟. 基于重构的改进自然排序树算法[J]. 计算机应用, 2019, 39(2): 441-445.
[3]	肖文, 胡娟. 基于数据集稀疏度的频繁项集挖掘算法性能分析[J]. 计算机应用, 2018, 38(4): 995-1000.
[4]	尹远, 张昌, 文凯, 郑云俊. 基于DiffNodeset结构的最大频繁项集挖掘算法[J]. 计算机应用, 2018, 38(12): 3438-3443.
[5]	顾军华, 武君艳, 许馨匀, 谢志坚, 张素琪. 基于Spark的并行FP-Growth算法优化及实现[J]. 计算机应用, 2018, 38(11): 3069-3074.
[6]	李校林, 杜托, 刘彪. 基于B-list的快速频繁模式挖掘算法[J]. 计算机应用, 2017, 37(8): 2357-2361.
[7]	王鑫, 刘方爱. 改进的多数据流协同频繁项集挖掘算法[J]. 计算机应用, 2016, 36(7): 1988-1992.
[8]	徐永秀, 刘旭敏, 徐维祥. 基于间隔链表改进的频繁项集挖掘算法[J]. 计算机应用, 2016, 36(4): 997-1001.
[9]	王乐, 常艳芬, 王水. 基于模式增长的不确定数据的频繁模式挖掘算法[J]. 计算机应用, 2015, 35(7): 1921-1926.
[10]	刘浩然, 刘方爱, 李旭, 王记伟. 有效的不确定数据概率频繁项集挖掘算法[J]. 计算机应用, 2015, 35(6): 1757-1761.
[11]	杨鹏坤, 彭慧, 周晓锋, 孙玉庆. 改进的基于频繁模式树的最大频繁项集挖掘算法——FP-MFIA[J]. 计算机应用, 2015, 35(3): 775-778.
[12]	刘慧婷, 沈盛霞, 赵鹏, 姚晟. 不确定数据频繁闭项集挖掘算法[J]. 计算机应用, 2015, 35(10): 2911-2914.
[13]	周秀梅黄名选. 有效的矩阵加权正负关联规则挖掘算法——MWARM-SRCCCI[J]. 计算机应用, 2014, 34(10): 2820-2826.
[14]	林瑞钟诚华蓓. 隐私保护的一站多表跨多表频繁项集挖掘[J]. 计算机应用, 2013, 33(12): 3437-3440.
[15]	段磊唐常杰 Guozhu Dong 杨宁苟驰. 基于显露模式的对比挖掘研究及应用进展[J]. 计算机应用, 2012, 32(02): 304-308.