计算机应用 ›› 2016, Vol. 36 ›› Issue (8): 2202-2206.DOI: 10.11772/j.issn.1001-9081.2016.08.2202

• 先进计算 • 上一篇    下一篇

基于聚类划分的高效用模式并行挖掘算法

邢淑凝1,2, 刘方爱1,2, 赵晓晖1,2   

  1. 1. 山东师范大学 信息科学与工程学院, 济南 250014;
    2. 山东省分布式计算机软件新技术重点实验室(山东师范大学), 济南 250014
  • 收稿日期:2016-01-11 修回日期:2016-02-27 出版日期:2016-08-10 发布日期:2016-08-10
  • 通讯作者: 刘方爱
  • 作者简介:邢淑凝(1992-),女,山东青岛人,硕士研究生,CCF会员,主要研究方向:数据挖掘、大数据分析;刘方爱(1962-),男,山东青岛人,教授,博士生导师,博士,主要研究方向:并行计算模型、分布式网络、数据挖掘;赵晓晖(1981-),女,河南范县人,讲师,博士研究生,主要研究方向:复杂网络、数据挖掘。
  • 基金资助:
    国家自然科学基金资助项目(90612003,61572301)。

Parallel high utility pattern mining algorithm based on cluster partition

XING Shuning1,2, LIU Fang'ai1,2, ZHAO Xiaohui1,2   

  1. 1. College of Information Science and Engineering, Shandong Normal University, Jinan Shandong 250014, China;
    2. Shandong Provincial Key Laboratory for Novel Distributed Computer Software Technology(Shandong Normal University), Jinan Shandong 250014, China
  • Received:2016-01-11 Revised:2016-02-27 Online:2016-08-10 Published:2016-08-10
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (90612003, 61572301).

摘要: 针对在大规模数据库中挖掘高效用模式产生大量基于内存的效用模式树,从而导致内存空间占用较大以及丢失一些高效用项集的问题,提出在Hadoop分布式计算平台下的基于聚类划分的高效用模式并行挖掘算法PUCP。首先,采用聚类的方法把数据库中相似的事务划分为若干数据子集;然后,把若干划分好的数据子集分配到Hadoop平台的各个节点中构造效用模式树;最后,把各个节点中相同项的条件模式基分配到同一个节点中进行挖掘,以减少各个节点交叉操作的次数。通过实验结果和理论分析表明:PUCP算法在不影响挖掘结果可靠性的前提下,与主流串行高效用模式挖掘——效用模式增长挖掘算法(UP-Growth)和现有的并行高效用模式挖掘算法PHUI-Growth相比,挖掘效率分别提高了61.2%和16.6%;并且使用了Hadoop计算平台,能有效缓解挖掘大规模数据的内存压力。

关键词: 大数据, 高效用模式挖掘, 聚类, 并行计算, Hadoop

Abstract: The exiting algorithms generate a lot of utility pattern trees based on memory when mining high utility patterns in large-scale database, leading to occupying more memory spaces and losing some high utility itemsets. Using Hadoop platform, a parallel high utility pattern mining algorithm, named PUCP, based on cluster partition was proposed. Firstly, the clustering method was introduced to divide the transaction database into several sub-datasets. Secondly, sub-datasets were allocated to each node of Hadoop to construct utility pattern tree. Finally, the conditional pattern bases of the same item which generated from utility pattern trees were allocated to the same node, reducing the crossover operation times of each node. The theoretical analysis and experimental results show that, compared with the mainstream serial high utility pattern mining algorithm named UP-Growth (Utility Pattern Growth) and parallel high utility pattern mining algorithm named HUI-Growth (Parallel mining High Utility Itemsets by pattern-Growth), the mining efficiency of PUCP is increased by 61.2% and 16.6% respectively without affecting the reliability of the mining results; and the memory pressure of large data mining can be effectively relieved by using Hadoop platform.

Key words: big data, high utility pattern mining, clustering, parallel computing, Hadoop

中图分类号: