Journal of Computer Applications ›› 2015, Vol. 35 ›› Issue (10): 2911-2914.DOI: 10.11772/j.issn.1001-9081.2015.10.2911

Previous Articles     Next Articles

Frequent closed itemset mining algorithm over uncertain data

LIU Huiting, SHEN Shengxia, ZHAO Peng, YAO Sheng   

  1. College of Computer Science and Technology, Anhui University, Hefei Anhui 230601, China
  • Received:2015-05-08 Revised:2015-08-07 Online:2015-10-10 Published:2015-10-14

不确定数据频繁闭项集挖掘算法

刘慧婷, 沈盛霞, 赵鹏, 姚晟   

  1. 安徽大学 计算机科学与技术学院, 合肥 230601
  • 通讯作者: 沈盛霞(1992-),女,安徽马鞍山人,硕士研究生,主要研究方向:数据挖掘,sxshen@iflytek.com
  • 作者简介:刘慧婷(1978-),女,安徽阜阳人,副教授,博士,主要研究方向:数据挖掘、推荐系统;赵鹏(1976-),女,安徽合肥人,副教授,博士,主要研究方向:信息检索、机器学习;姚晟(1979-),女,安徽安庆人,讲师,博士,主要研究:数据挖掘。
  • 基金资助:
    国家自然科学基金资助项目(61202227);安徽省自然科学基金资助项目(1408085MF122)。

Abstract: Due to the downward closure property over uncertain data, existing solutions of mining all the frequent itemsets may lead an exponential number of results. In order to obtain a reasonable result set with small size, frequent closed itemsets discovering over uncertain data were studied, and a new algorithm called Normal Approximation-based Probabilistic Frequent Closed Itemsets Mining (NA-PFCIM) was proposed. The new method regarded the itemset mining process as a probability distribution function, and mined frequent itemsets by using the normal distribution model which supports large databases and can extract frequent itemsets with a high degree of accuracy. Then, the algorithm adopted the depth-first search strategy to obtain all probabilistic frequent closed itemsets, so as to reduce the search space and avoid redundant computation. Two probabilistic pruning techniques including superset pruning and subset pruning were also used in this method. Finally, the effectiveness and efficiency of the proposed methods were verified by comparing with the Possion distribution based algorithm called A-PFCIM. The experimental results show that NA-PFCIM can decrease the number of extending itemsets and reduce the complexity of calculation, it has better performance than the compared algorithm.

Key words: uncertain data, frequent itemset, frequent closed itemset, pruning strategy, normal distribution

摘要: 由于不确定数据的向下封闭属性,挖掘全部频繁项集的方法会得到一个指数级的结果。为获得一个较小的合适的结果集,研究了在不确定数据上挖掘频繁闭项集,并提出了一种新的频繁闭项集挖掘算法——NA-PFCIM。该算法将项集挖掘过程看作一个概率分布函数,考虑到基于正态分布模型的方法提取的频繁项集精确度较高,而且支持大型数据库,采用了正态分布模型提取频繁项集。同时,为了减少搜索空间以及避免冗余计算,利用基于深度优先搜索的策略来获得所有的概率频繁闭项集。该算法还设计了两个剪枝策略:超集修剪和子集修剪。最后,在常用的数据集(T10I4D100K、Accidents、Mushroom、Chess)上,将提出的NA-PFCIM算法和基于泊松分布的A-PFCIM算法进行比较。实验结果表明,NA-PFCIM算法能够减少所要扩展的项集,同时减少项集频繁概率的计算,其性能优于对比算法。

关键词: 不确定数据, 频繁项集, 频繁闭项集, 剪枝策略, 正态分布

CLC Number: