Journal of Computer Applications ›› 2015, Vol. 35 ›› Issue (6): 1757-1761.DOI: 10.11772/j.issn.1001-9081.2015.06.1757

Previous Articles     Next Articles

Efficient mining algorithm for uncertain data in probabilistic frequent itemsets

LIU Haoran, LIU Fang'ai, LI Xu, WANG Jiwei   

  1. College of Information Science and Engineering, Shandong Normal University, Jinan Shandong 250014, China
  • Received:2014-12-22 Revised:2015-03-23 Published:2015-06-12

有效的不确定数据概率频繁项集挖掘算法

刘浩然, 刘方爱, 李旭, 王记伟   

  1. 山东师范大学 信息科学与工程学院, 济南 250014
  • 通讯作者: 刘浩然(1989-),男,山东德州人,硕士研究生,CCF会员,主要研究方向:数据挖掘、大数据分析;sdnuhaoran0420@163.com
  • 作者简介:刘方爱(1962-),男,山东青岛人,教授,博士生导师,主要研究方向:无线网络、分布式计算;李旭(1990-),女,山东德州人,硕士研究生,主要研究方向:大数据分析;王记伟(1988-),男,山东日照人,硕士研究生,主要研究方向:社交网络、大数据分析。
  • 基金资助:

    国家自然科学基金资助项目(90612003);山东省自然科学基金资助项目(ZR2013FM008);山东省科技发展计划项目(2011GGH20123)。

Abstract:

When using the way of pattern growth to construct tree structure, the exiting algorithms for mining probabilistic frequent itemsets suffer many problems, such as generating large number of tree nodes, occupying large memory space and having low efficiency. In order to solve these problems, a Progressive Uncertain Frequent Pattern Growth algorithm named PUFP-Growth was proposed. By the way of reading data in the uncertain database tuple by tuple, the proposed algorithm constructed tree structure as compact as Frequent Pattern Tree (FP-Tree) and updated dynamic array of expected value whose header table saved the same itemsets. When all transactions were inserted into the Progressive Uncertain Frequent Pattern tree (PUFP-Tree), all the probabilistic frequent itemsets could be mined by traversing the dynamic array. The experimental results and theoretical analysis show that PUFP-Growth algorithm can find the probabilistic frequent itemsets effectively. Compared with the Uncertain Frequent pattern Growth (UF-Growth) algorithm and Compressed Uncertain Frequent-Pattern Mine (CUFP-Mine) algorithm, the proposed PUFP-Growth algorithm can improve mining efficiency of probabilistic frequent itemsets on uncertain dataset and reduce memory usage to a certain degree.

Key words: data mining, uncertain data, possible world model, probabilistic frequent itemset, frequent pattern

摘要:

针对已有概率频繁项集挖掘算法采用模式增长的方式构建树时产生大量树节点,导致内存空间占用较大以及发现概率频繁项集效率低等问题,提出了改进的不确定数据频繁模式增长(PUFP-Growth)算法。该算法通过逐条读取不确定事务数据库中数据,构造类似频繁模式树(FP-Tree)的紧凑树结构,同时更新项头表中保存所有尾节点相同项集的期望值的动态数组。当所有事务数据插入到改进的不确定数据频繁模式树(PUFP-Tree)中以后,通过遍历数组得到所有的概率频繁项集。最后通过实验结果和理论分析表明:PUFP-Growth算法可以有效地发现概率频繁项集;与不确定数据频繁模式增长(UF-Growth)算法和压缩的不确定频繁模式挖掘(CUFP-Mine)算法相比,提出的PUFP-Growth算法能够提高不确定数据概率频繁项集挖掘的效率,并且减少了内存空间的使用。

关键词: 数据挖掘, 不确定数据, 可能世界模型, 概率频繁项集, 频繁模式

CLC Number: