计算机应用 ›› 2016, Vol. 36 ›› Issue (2): 353-357.DOI: 10.11772/j.issn.1001-9081.2016.02.0353

• 第三届CCF大数据学术会议(CCF BigData 2015) • 上一篇    下一篇

Spark环境下基于多维布隆过滤器的星型连接算法

周国亮1, 萨初日拉2, 朱永利2   

  1. 1. 国网冀北电力有限公司 技能培训中心, 河北 保定 071051;
    2. 华北电力大学 控制与计算机工程学院, 河北 保定 071003
  • 收稿日期:2015-09-15 修回日期:2015-09-23 出版日期:2016-02-10 发布日期:2016-02-03
  • 通讯作者: 周国亮(1978-),男,河北保定人,副教授,博士,主要研究方向:智能电网、联机分析处理。
  • 作者简介:萨初日拉(1992-),男(蒙古族),内蒙古通辽人,硕士研究生,主要研究方向:云计算、数据挖掘;朱永利(1963-),男,河北衡水人,教授,博士生导师,博士,CCF高级会员,主要研究方向:人工智能、电力调度自动化系统。
  • 基金资助:
    中央高校基本科研业务费专项资金资助项目(13MS103);河北省自然科学基金资助项目(F2014502069)。

Star join algorithm based on multi-dimensional Bloom filter in Spark

ZHOU Guoliang1, SA Churila2, ZHU Yongli2   

  1. 1. Skill Training Center, State Grid Jibei Electric Power Company, Baoding Hebei 071051, China;
    2. School of Control and Computer Engineering, North China Electric Power University, Baoding Hebei 071003, China
  • Received:2015-09-15 Revised:2015-09-23 Online:2016-02-10 Published:2016-02-03

摘要: 为了适应联机分析处理(OLAP)系统中实时数据高性能分析需求不断提高的需求,提出一种能够适合Spark环境并结合多维Bloom Filter(MDBF)的星型连接算法SMDBFSJ。首先,根据多个维表构建MDBF,利用其占用空间小的特点,广播到所有节点;然后,在本地节点完成事实表过滤操作,事实表不需要在节点间移动数据;最后,过滤后的事实表与维表采用重划分方式进行连接,进而得到最终结果。SMDBFSJ算法避免了事实表数据移动,通过MDBF减小了需要广播的数据量,充分结合了广播连接和重划分连接的优势。实验结果表明了该算法的有效性,在单机和集群环境下,该算法相比重划分连接均获得了3倍左右的性能提升。

关键词: 布隆过滤器, 星型连接, 联机分析处理, Spark

Abstract: To meet the high performance analysis requirements for real-time data in On-Line Analytical Processing (OLAP) system, a new star join algorithm which is suitable for Spark platform was proposed based on Multi-Dimensional Bloom Filter (MDBF), namely SMDBFSJ (Spark Multi-Dimensional Bloom Filter Star Join). First of all, MDBF was built according to the dimension tables and broadcasted to all the nodes based on the feature of small size. Then the fact table was filtered completely on the local node, and there was no data movement between nodes. Finally, the filtered fact table and dimension tables were joined using repartition join model to get the final result. SMDBFSJ algorithm avoides the data moving of fact table, reduces the size of broadcasting data using MDBF, as well as fully combines the advantages of broadcast join and repartition join. The experimental results prove the validity of SMDBFSJ algorithm, in stand-alone and cluster environments. SMDBFSJ algorithm can obtain about three times of performance improvement compared with repartition join in Spark.

Key words: Bloom filter, star join, On-Line Analytical Processing(OLAP), Spark

中图分类号: