Spark环境下并行立方体计算方法

doi:10.11772/j.issn.1001-9081.2016.02.0348

计算机应用 ›› 2016, Vol. 36 ›› Issue (2): 348-352.DOI: 10.11772/j.issn.1001-9081.2016.02.0348

• 第三届CCF大数据学术会议(CCF BigData 2015) • 上一篇下一篇

Spark环境下并行立方体计算方法

萨初日拉, 周国亮, 时磊, 王刘旺, 石鑫, 朱永利

华北电力大学控制与计算机工程学院, 河北保定 071003

收稿日期:2015-09-15 修回日期:2015-09-22 发布日期:2016-02-03 出版日期:2016-02-10
通讯作者: 萨初日拉(1992-),男(蒙古族),内蒙古通辽人,硕士研究生,主要研究方向:云计算、数据挖掘。
作者简介:周国亮(1978-),男,河北保定人,副教授,博士,主要研究方向:智能电网、联机分析处理;时磊(1991-),男,黑龙江鹤岗人,硕士研究生,主要研究方向:计算机视觉、模式识别;王刘旺(1988-),男,安徽安庆人,博士研究生,主要研究方向:电力系统大数据处理;石鑫(1988-),男,河北邯郸人,硕士研究生,主要研究方向:人工智能;朱永利(1963-),男,河北衡水人,教授,博士生导师,博士,CCF高级会员,主要研究方向:人工智能、电力调度自动化系统。
基金资助:
河北省自然科学基金资助项目(F2014502069)。

Parallel cube computing in Spark

SA Churila, ZHOU Guoliang, SHI Lei, WANG Liuwang, SHI Xin, ZHU Yongli

School of Control and Computer Engineering, North China Electric Power University, Baoding Hebei 071003, China

Received:2015-09-15 Revised:2015-09-22 Online:2016-02-03 Published:2016-02-10

摘要/Abstract

摘要： 针对传统联机分析处理(OLAP)处理大数据时实时响应能力差的问题,研究基于分布式内存计算框架Spark加速的数据立方体计算方法,设计基于Spark内存集群的自底向上构造(BUC)算法——BUCPark,来提高BUC的并行度和大数据适应能力。在此基础上,为避免内存中迭代的立方体单元膨胀,基于内存重复利用和共享的思想设计改进的BUCPark算法——LBUCPark。实验结果表明:LBUCPark算法性能优于BUC算法和BUCPark算法,能够胜任大数据背景下的快速数据立方体计算任务。

关键词: Spark, 联机分析处理, 数据立方体, 自底向上构造

Abstract: In view of the poor real-time response capability of traditional OnLine Analytical Processing (OLAP) when processing big data, how to accelerate computation of data cubes based on Spark was investigated, and a memory-based distributed computing framework was put forward. To improve parallelism degree and performance of Bottom-Up Construction (BUC), a novel algorithm for computation of data cubes was designed based on Spark and BUC, referred to as BUCPark (BUC on Spark). Moreover, to avoid the expansion of iterative data cube in memory, BUCPark was fruther improved to LBUCPark (Layered BUC on Spark) which could take full advantage of reused and shared memory mechanism. The experimental results show that LBUCpark outperforms BUC and BUCPark algorithms in terms of computing performace, and it is capable of computing data cube efficiently in big data era.

Key words: Spark, OnLine Analytical Processing(OLAP), data cube, Bottom-Up Construction(BUC)

中图分类号:

TP393.027

萨初日拉, 周国亮, 时磊, 王刘旺, 石鑫, 朱永利. Spark环境下并行立方体计算方法[J]. 计算机应用, 2016, 36(2): 348-352.

SA Churila, ZHOU Guoliang, SHI Lei, WANG Liuwang, SHI Xin, ZHU Yongli. Parallel cube computing in Spark[J]. Journal of Computer Applications, 2016, 36(2): 348-352.

参考文献

[1] GRAY J, LIU D T, NIETO-SANTISTEBAN M, et al. Scientific data management in the coming decade[J]. ACM SIGMOD Record, 2005, 34(4): 34-41.
[2] 崔杰,李陶深,兰红星.基于Hadoop的海量数据存储平台设计与开发[J].计算机研究与发展,2012,49(Suppl.):12-18. (CUI J, LI T S, LAN H X. Design and development of the mass data storage platform based on Hadoop[J]. Journal of Computer Research and Development, 2012, 49(Suppl.): 12-18.)
[3] 贺瑶,王文庆,薛飞.基于云计算的海量数据挖掘研究[J].计算机技术与发展,2013,23(2):69-72. (HE Y,WANG W Q, XUE F. Study of massive data mining based on cloud computing[J].Computer Technology and Development, 2013, 23(2): 69-72.)
[4] 余永红,向晓军,高阳,等.面向服务的云数据挖掘引擎的研究[J].计算机科学与探索,2012,6(1):46-57. (YU Y H, XIANG X J, GAO Y, et al. Research on service-oriented data mining engine based on cloud computing[J]. Journal of Frontiers of Computer Science and Technology, 2012, 6(1): 46-57.)
[5] HAN J, KAMBER M, PEI J. Data mining: concepts and techniques[M]. 3rd edition. San Francisco, CA: Morgan Kaufmann, 2011: 89-98.
[6] 陆戌辰,王梅,乐嘉锦.列存储中的OLAP多查询优化方法[J].计算机科学与探索,2012,6(9):852-864. (LU X C, WANG M, LE J J. Multi-query optimization strategy in column-based OLAP system[J]. Journal of Frontiers of Computer Science and Technology, 2012, 6(9): 852-864.)
[7] 周国亮,王桂兰,朱永利.多核处理器上的并行联机分析处理算法研究[J].计算机科学与探索,2013, 7(2):180-190. (ZHOU G L, WANG G L, ZHU Y L. Parallel on-line analysis processing algorithms research on multi-core CPUs[J]. Journal of Frontiers of Computer Science and Technology, 2013, 7(2): 180-190.)
[8] 奚建清,游进国,汤德佑,等.基于MapReduce的封闭立方体并行计算方法[J].华南理工大学学报(自然科学版),2009,37(1):91-95,112. (XI J Q, YOU J G, TANG D Y, et al. A parallel closed-cubing algorithm based on MapReduce[J]. Journal of South China University of Technology (Natural Science Edition), 2009, 37(1): 91-95, 112.)
[9] 宋杰,郭朝鹏,王智,等.大数据分析的分布式MOLAP技术[J].软件学报,2014,25(4):731-752. (SONG J, GUO C P, WANG Z, et al. Distributed MOLAP technique for big data analysis[J]. Journal of Software, 2014, 25(4): 731-752.)
[10] 张娟.基于Hadoop的商立方体研究与实现[D].上海:华东师范大学,2013:11-15. (ZHANG J. The research and implementation of quotient cube based on Hadoop [D]. Shanghai: East China Normal University, 2013: 11-15.)
[11] 梁彦.基于分布式平台Spark和YARN的数据挖掘算法的并行化研究[D].广州:中山大学,2014:8-12. (LIANG Y. Research on parallelization of data mining algorithm based on distributed platforms Spark and YARN [D]. Guangzhou: Sun Yat-sen University, 2014: 8-12.)
[12] 李成华,张新访,金海,等.MapReduce:新型的分布式并行计算编程模型[J].计算机工程与科学,2011,33(3):129-135. (LI C H, ZHANG X F, JIN H, et al. MapReduce: a new programming model for distributed parallel computing[J]. Computer Engineering & Science, 2011, 33(3): 129-135.)
[13] KARAU H. Fast data processing with Spark: high-speed distributed computing made easy with Spark[M]. Bermingham, UK: Packt Publishing, 2013: 5-13.
[14] DEAN J, GHEMAWAT S. MapReduce: simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1): 107-113.
[15] ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark: cluster computing with working sets[C]//HotCloud '10: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. Berkeley, CA: USENIX Association, 2010: 10-10.

Spark环境下并行立方体计算方法

Parallel cube computing in Spark

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	李旭, 何玉林, 崔来中, 黄哲学, PHILIPPE Fournier‑Viger. 基于大数据随机样本划分的分布式观测点分类器[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1727-1733.
[2]	吴仁彪, 张振驰, 贾云飞, 乔晗. 云平台下基于截止时间的自适应调度策略[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 176-184.
[3]	冯钧, 王秉发, 陆佳民. 分布式资源描述框架数据管理系统查询性能评价[J]. 《计算机应用》唯一官方网站, 2022, 42(2): 440-448.
[4]	刘专, 韩瑞琛, 张延松, 陈跃国, 张宇. 面向多核CPU和GPU平台的数据库星形连接优化[J]. 计算机应用, 2021, 41(3): 611-617.
[5]	顾军华, 王锋, 戚永军, 孙哲然, 田泽培, 张亚娟. 基于多尺度卷积特征融合的肺结节图像检索方法[J]. 《计算机应用》唯一官方网站, 2020, 40(2): 561-565.
[6]	刘斌, 何进荣, 李远成, 韩宏. 基于分布式神经网络的苹果价格预测方法[J]. 《计算机应用》唯一官方网站, 2020, 40(2): 369-374.
[7]	章夏杰, 朱敬华, 陈杨. Spark下的分布式粗糙集属性约简算法[J]. 《计算机应用》唯一官方网站, 2020, 40(2): 518-523.
[8]	程文亮, 王志宏, 周虞, 过弋, 赵俊锋. 面向外汇市场监测的分布式计算框架设计[J]. 计算机应用, 2020, 40(1): 173-180.
[9]	崔艺馨, 陈晓东. Spark框架优化的大规模谱聚类并行算法[J]. 计算机应用, 2020, 40(1): 168-172.
[10]	刘靖, 肖冠烽. 基于Spark与粒子滤波算法的公交到站时间预测系统[J]. 计算机应用, 2019, 39(2): 429-435.
[11]	刘子豪, 李凌, 叶枫. 基于SparkR的水文传感器数据的异常检测方法[J]. 计算机应用, 2019, 39(2): 436-440.
[12]	李龙洋, 董一鸿, 施炜杰, 潘剑飞. SQM:基于Spark的大规模单图上的子图匹配算法[J]. 计算机应用, 2019, 39(1): 46-50.
[13]	赵文芳, 王京丽, 尚敏, 刘亚楠. 基于粒子群优化和支持向量机的花粉浓度预测模型[J]. 计算机应用, 2019, 39(1): 98-104.
[14]	崔晨, 郑林江, 韩凤萍, 何牧君. 基于内存的HBase二级索引设计[J]. 计算机应用, 2018, 38(6): 1584-1590.
[15]	顾军华, 霍士杰, 武君艳, 尹君, 张素琪. 求解最大团问题的并行多层图划分方法[J]. 计算机应用, 2018, 38(12): 3425-3432.