Journal of Computer Applications ›› 2018, Vol. 38 ›› Issue (11): 3069-3074.DOI: 10.11772/j.issn.1001-9081.2018041219

Previous Articles     Next Articles

Optimization and implementation of parallel FP-Growth algorithm based on Spark

GU Junhua1,2, WU Junyan2, XU Xinyun2, XIE Zhijian2, ZHANG Suqi3   

  1. 1. School of Artificial Intelligence and Data Science, Hebei University of Technology, Tianjin 300401, China;
    2. Hebei Province Key Laboratory of Big Data Computing(Hebei University of Technology), Tianjin 300401, China;
    3. School of Information Engineering, Tianjin University of Commerce, Tianjin 300134, China
  • Received:2018-04-30 Revised:2018-05-31 Online:2018-11-10 Published:2018-11-10
  • Supported by:
    This work is partially supported by the Science-Technology Program of Hebei Province (17210305D), the Science-Technology Program of Tianjin (16ZXHLSF0023), the Natural Science Foundation of Tianjin (15JCQNJC00600), the Science-Technology Program of Tianjin (15ZXHLGX00130)

基于Spark的并行FP-Growth算法优化及实现

顾军华1,2, 武君艳2, 许馨匀2, 谢志坚2, 张素琪3   

  1. 1. 河北工业大学 人工智能与数据科学学院, 天津 300401;
    2. 河北省大数据计算重点实验室(河北工业大学), 天津 300401;
    3. 天津商业大学 信息工程学院, 天津 300134
  • 通讯作者: 张素琪
  • 作者简介:顾军华(1966-),男,河北石家庄人,教授,博士,CCF会员,主要研究方向:数据挖掘、智能信息处理、信息采集与集成、智能计算与优化、功能与信息展示、软件工程、项目管理;武君艳(1994-),女,山西晋中人,硕士研究生,主要研究方向:数据挖掘、计算机仿真、机器学习;许馨匀(1995-),女,河北石家庄人,硕士研究生,主要研究方向:情感分析、自然语言处理、深度学习;谢志坚(1995-),男,河北沧州人,硕士研究生,主要研究方向:机器学习、数据挖掘;张素琪(1980-),女,河北邢台人,讲师,博士,CCF会员,主要研究方向:机器学习、数据挖掘。
  • 基金资助:
    河北省科技计划项目(17210305D);天津市科技计划项目(16ZXHLSF0023);天津市自然科学基金资助项目(15JCQNJC00600);天津市科技计划项目(15ZXHLGX00130)。

Abstract: In order to further improve the execution efficiency of Frequent Pattern-Growth (FP-Growth) algorithm on Spark platform, a new parallel FP-Growth algorithm based on Spark, named BFPG (Better Frequent Pattern-Growth), was presented. Firstly, the grouping strategy F-List was improved in the size of the Frequent Pattern-Tree (FP-Tree) and the amount of partition calculation to ensure that the load sum of each partition was approximately equal. Secondly, the data set partitioning strategy was optimized by creating a list P-List, and then the time complexity was reduced by reducing the traversal times. The experimental results show that the BFPG algorithm improves the mining efficiency of the parallel FP-Growth algorithm, and the algorithm has good scalability.

Key words: big data platform, association rule, frequent itemset, Frequent Pattern-Growth (FP-Growth) algorithm, Spark

摘要: 为了进一步提高在Spark平台上的频繁模式增长(FP-Growth)算法执行效率,提出一种新的基于Spark的并行FP-Growth算法——BFPG。首先,从频繁模式树(FP-Tree)规模大小和分区计算量对F-List分组策略进行改进,保证每个分区负载总和近似相等;然后,通过创建列表P-List对数据集划分策略进行优化,减少遍历次数,降低时间复杂度。实验结果表明,BFPG算法提高了并行FP-Growth算法挖掘效率,且算法具有良好的扩展性。

关键词: 大数据平台, 关联规则, 频繁项集, 频繁模式增长算法, Spark

CLC Number: