计算机应用 ›› 2016, Vol. 36 ›› Issue (12): 3280-3284.DOI: 10.11772/j.issn.1001-9081.2016.12.3280

• 先进计算 • 上一篇    下一篇

基于Hadoop平台的并行DHP数据分析方法

杨燕霞1, 冯林1,2   

  1. 1. 四川师范大学 计算机科学学院, 成都 610101;
    2. 四川师大科技园发展有限公司, 成都 610066
  • 收稿日期:2016-05-31 修回日期:2016-08-03 出版日期:2016-12-10 发布日期:2016-12-08
  • 通讯作者: 冯林
  • 作者简介:杨艳霞(1989-),女,四川西昌人,硕士研究生,主要研究方向:数据挖掘、软件工程;冯林(1972-),男,四川巴中人,教授,博士,CCF会员,主要研究方向:粗糙集、粒计算、数据挖掘。
  • 基金资助:
    国家科技支撑计划项目(2014BAH11F01,2014BAH11F02);四川省科技支撑计划项目(15GZ0079)。

Data analysis method for parallel DHP based on Hadoop

YANG Yanxia1, FENG Lin1,2   

  1. 1. College of Computer Science, Sichuan Normal University, Chengdu Sichuan 610101, China;
    2. Science and Technology Park Development Company Limited of Sichuan Normal University, Chengdu Sichuan 610066, China
  • Received:2016-05-31 Revised:2016-08-03 Online:2016-12-10 Published:2016-12-08
  • Supported by:
    This work is partially supported by the National Science and Technology Support Program of China (2014BAH11F01, 2014BAH11F02), the Science and Technology Support Program of Sichuan Province (15GZ0079).

摘要: 由候选项集C2生成频繁2-项集L2是关联规则Apriori算法的一个瓶颈。直接哈希修剪(DHP)算法利用一个生成的Hash表H2删减C2中无用的候选项集,以此提高L2的生成效率。但传统DHP算法是一个串行算法,不能有效处理较大规模数据。针对这一问题,提出DHP的并行化算法——H_DHP。首先,对DHP算法并行化策略的可行性进行了理论分析与证明;其次,基于Hadoop平台,把Hash表H2的生成以及频繁项集L1L3~Lk的生成方法进行了并行实现,并借助Hbase数据库生成关联规则。仿真实验结果表明:与传统DHP算法相比,H_DHP算法在数据的处理时间效率、处理数据集的规模大小,以及加速比和可扩展性等方面都有较好的性能。

关键词: Hadoop, Hash表, Apriori算法, 直接哈希修剪算法

Abstract: It is a bottleneck of Apriori algorithm for mining association rules that the candidate set C2 is used to generate the frequent 2-item set L2. In the Direct Hashing and Pruning (DHP) algorithm, a generated Hash table H2 is used to delete the unused candidate item sets in C2 for improving the efficiency of generating L2. However,the traditional DHP is a serial algorithm, which cannot effectively deal with large scale data. In order to solve the problem, a DHP parallel algorithm, termed H_DHP algorithm, was proposed. First, the feasibility of parallel strategy in DHP was analyzed and proved theoretically. Then, the generation method for the Hash table H2 and frequent item sets L1, L3-Lk was developed in parallel based on Hadoop, and the association rules were generated by Hbase database. The simulation experimental results show that, compared with the DHP algorithm, the H_DHP algorithm has better performance in the processing efficiency of data, the size of the data set, the speedup and scalability.

Key words: Hadoop, Hash table, Apriori algorithm, Direct Hashing and Pruning (DHP) algorithm

中图分类号: