Journal of Computer Applications ›› 2018, Vol. 38 ›› Issue (12): 3438-3443.DOI: 10.11772/j.issn.1001-9081.2018040913

Previous Articles     Next Articles

Maximal frequent itemset mining algorithm based on DiffNodeset structure

YIN Yuan1,2, ZHANG Chang1, WEN Kai1,3, ZHENG Yunjun1   

  1. 1. Institute of Applied Communication Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China;
    2. Chongqing Branch, China Telecom Company Limited, Chongqing 401121, China;
    3. Chongqing Information Technology Designing Company Limited, Chongqing 401121, China
  • Received:2018-04-23 Revised:2018-07-20 Online:2018-12-10 Published:2018-12-15
  • Contact: 张昌

基于DiffNodeset结构的最大频繁项集挖掘算法

尹远1,2, 张昌1, 文凯1,3, 郑云俊1   

  1. 1. 重庆邮电大学 通信新技术应用研究中心, 重庆 400065;
    2. 中国电信股份有限公司 重庆分公司, 重庆 401121;
    3. 重庆信科设计有限公司, 重庆 401121
  • 通讯作者: 张昌
  • 作者简介:尹远(1963-),男,重庆人,高级工程师,硕士,主要研究方向:移动通信、大数据;张昌(1993-),男,湖北孝感人,硕士研究生,主要研究方向:数据挖掘、大数据、云计算;文凯(1972-),男,重庆人,正高级工程师,博士,主要研究方向:移动通信、大数据;郑云俊(1992-),男,重庆人,硕士研究生,主要研究方向:大数据、推荐系统。

Abstract: In data mining, mining maximum frequent itemsets instead of mining frequent itemsets can greatly improve the operating efficiency of system. The running time consumption of existing maximum frequent itemset mining algorithms is still very large. In order to solve the problem, a new DiffNodeset Maxmal Frequent Itemset Mining (DNMFIM) algorithm was proposed. Firstly, a new data structure DiffNodeset was adopted to realize the fast calculation of intersection and support degree. Secondly, the connection method with linear time complexity was adopted to reduce the complexity of connecting two DiffNodesets and avoid multiple invalid calculations. Then, the set-enumeration tree was adopted as the search space, and a variety of optimal pruning strategies were used to reduce the search space. Finally, the superset detection technology used in the MAximal Frequent Itemset Algorithm (MAFIA) algorithm was adopted to improve the accuracy of algorithm effectively. The experimental results show that, DNMFIM algorithm outperforms MAFIA and N-list based MAFIA (NB-MAFIA) in terms of time efficiency. The proposed algorithm has a good performance when mining the maximal frequent itemsets in different types of datasets.

Key words: maximal frequent itemset mining, association rule, set-enumeration tree, optimized pruning, superset detection

摘要: 在数据挖掘中,通过挖掘最大频繁项集来代替挖掘频繁项集可以大大地提升系统的运行效率。针对现有的最大频繁项集挖掘算法的运行时间消耗仍然很大的问题,提出了一种基于DiffNodeset结构的最大频繁项集挖掘(DNMFIM)算法。首先,采用了一种新的数据结构DiffNodeset来实现求交集以及支持度的快速计算;其次,引入一种新的线性复杂度的连接方法来降低两个DiffNodeset在连接过程中的复杂度,避免了多次的无效计算;然后,将集合枚举树作为搜索空间,同时采用多种优化剪枝策略来缩小搜索空间;最后,再结合最大频繁项集挖掘算法(MAFIA)中所使用的超集检测技术来有效地提高算法的准确性。实验结果表明,DNMFIM算法在时间效率方面性能优于MAFIA与基于N-list的MAFIA(NB-MAFIA),该算法在不同类型数据集中进行最大频繁项集挖掘时均有良好的效果。

关键词: 最大频繁项集挖掘, 关联规则, 集合枚举树, 优化剪枝, 超集检测

CLC Number: