Parallel cube computing in Spark

doi:10.11772/j.issn.1001-9081.2016.02.0348

Journal of Computer Applications ›› 2016, Vol. 36 ›› Issue (2): 348-352.DOI: 10.11772/j.issn.1001-9081.2016.02.0348

Previous Articles Next Articles

Parallel cube computing in Spark

SA Churila, ZHOU Guoliang, SHI Lei, WANG Liuwang, SHI Xin, ZHU Yongli

School of Control and Computer Engineering, North China Electric Power University, Baoding Hebei 071003, China

Received:2015-09-15 Revised:2015-09-22 Online:2016-02-03 Published:2016-02-10

Spark环境下并行立方体计算方法

萨初日拉, 周国亮, 时磊, 王刘旺, 石鑫, 朱永利

华北电力大学控制与计算机工程学院, 河北保定 071003

通讯作者: 萨初日拉(1992-),男(蒙古族),内蒙古通辽人,硕士研究生,主要研究方向:云计算、数据挖掘。
作者简介:周国亮(1978-),男,河北保定人,副教授,博士,主要研究方向:智能电网、联机分析处理;时磊(1991-),男,黑龙江鹤岗人,硕士研究生,主要研究方向:计算机视觉、模式识别;王刘旺(1988-),男,安徽安庆人,博士研究生,主要研究方向:电力系统大数据处理;石鑫(1988-),男,河北邯郸人,硕士研究生,主要研究方向:人工智能;朱永利(1963-),男,河北衡水人,教授,博士生导师,博士,CCF高级会员,主要研究方向:人工智能、电力调度自动化系统。
基金资助:
河北省自然科学基金资助项目(F2014502069)。

Abstract

Abstract: In view of the poor real-time response capability of traditional OnLine Analytical Processing (OLAP) when processing big data, how to accelerate computation of data cubes based on Spark was investigated, and a memory-based distributed computing framework was put forward. To improve parallelism degree and performance of Bottom-Up Construction (BUC), a novel algorithm for computation of data cubes was designed based on Spark and BUC, referred to as BUCPark (BUC on Spark). Moreover, to avoid the expansion of iterative data cube in memory, BUCPark was fruther improved to LBUCPark (Layered BUC on Spark) which could take full advantage of reused and shared memory mechanism. The experimental results show that LBUCpark outperforms BUC and BUCPark algorithms in terms of computing performace, and it is capable of computing data cube efficiently in big data era.

Key words: Spark, OnLine Analytical Processing(OLAP), data cube, Bottom-Up Construction(BUC)

摘要： 针对传统联机分析处理(OLAP)处理大数据时实时响应能力差的问题,研究基于分布式内存计算框架Spark加速的数据立方体计算方法,设计基于Spark内存集群的自底向上构造(BUC)算法——BUCPark,来提高BUC的并行度和大数据适应能力。在此基础上,为避免内存中迭代的立方体单元膨胀,基于内存重复利用和共享的思想设计改进的BUCPark算法——LBUCPark。实验结果表明:LBUCPark算法性能优于BUC算法和BUCPark算法,能够胜任大数据背景下的快速数据立方体计算任务。

关键词: Spark, 联机分析处理, 数据立方体, 自底向上构造

CLC Number:

TP393.027

SA Churila, ZHOU Guoliang, SHI Lei, WANG Liuwang, SHI Xin, ZHU Yongli. Parallel cube computing in Spark[J]. Journal of Computer Applications, 2016, 36(2): 348-352.

萨初日拉, 周国亮, 时磊, 王刘旺, 石鑫, 朱永利. Spark环境下并行立方体计算方法[J]. 计算机应用, 2016, 36(2): 348-352.

References

[1] GRAY J, LIU D T, NIETO-SANTISTEBAN M, et al. Scientific data management in the coming decade[J]. ACM SIGMOD Record, 2005, 34(4): 34-41.
[2] 崔杰,李陶深,兰红星.基于Hadoop的海量数据存储平台设计与开发[J].计算机研究与发展,2012,49(Suppl.):12-18. (CUI J, LI T S, LAN H X. Design and development of the mass data storage platform based on Hadoop[J]. Journal of Computer Research and Development, 2012, 49(Suppl.): 12-18.)
[3] 贺瑶,王文庆,薛飞.基于云计算的海量数据挖掘研究[J].计算机技术与发展,2013,23(2):69-72. (HE Y,WANG W Q, XUE F. Study of massive data mining based on cloud computing[J].Computer Technology and Development, 2013, 23(2): 69-72.)
[4] 余永红,向晓军,高阳,等.面向服务的云数据挖掘引擎的研究[J].计算机科学与探索,2012,6(1):46-57. (YU Y H, XIANG X J, GAO Y, et al. Research on service-oriented data mining engine based on cloud computing[J]. Journal of Frontiers of Computer Science and Technology, 2012, 6(1): 46-57.)
[5] HAN J, KAMBER M, PEI J. Data mining: concepts and techniques[M]. 3rd edition. San Francisco, CA: Morgan Kaufmann, 2011: 89-98.
[6] 陆戌辰,王梅,乐嘉锦.列存储中的OLAP多查询优化方法[J].计算机科学与探索,2012,6(9):852-864. (LU X C, WANG M, LE J J. Multi-query optimization strategy in column-based OLAP system[J]. Journal of Frontiers of Computer Science and Technology, 2012, 6(9): 852-864.)
[7] 周国亮,王桂兰,朱永利.多核处理器上的并行联机分析处理算法研究[J].计算机科学与探索,2013, 7(2):180-190. (ZHOU G L, WANG G L, ZHU Y L. Parallel on-line analysis processing algorithms research on multi-core CPUs[J]. Journal of Frontiers of Computer Science and Technology, 2013, 7(2): 180-190.)
[8] 奚建清,游进国,汤德佑,等.基于MapReduce的封闭立方体并行计算方法[J].华南理工大学学报(自然科学版),2009,37(1):91-95,112. (XI J Q, YOU J G, TANG D Y, et al. A parallel closed-cubing algorithm based on MapReduce[J]. Journal of South China University of Technology (Natural Science Edition), 2009, 37(1): 91-95, 112.)
[9] 宋杰,郭朝鹏,王智,等.大数据分析的分布式MOLAP技术[J].软件学报,2014,25(4):731-752. (SONG J, GUO C P, WANG Z, et al. Distributed MOLAP technique for big data analysis[J]. Journal of Software, 2014, 25(4): 731-752.)
[10] 张娟.基于Hadoop的商立方体研究与实现[D].上海:华东师范大学,2013:11-15. (ZHANG J. The research and implementation of quotient cube based on Hadoop [D]. Shanghai: East China Normal University, 2013: 11-15.)
[11] 梁彦.基于分布式平台Spark和YARN的数据挖掘算法的并行化研究[D].广州:中山大学,2014:8-12. (LIANG Y. Research on parallelization of data mining algorithm based on distributed platforms Spark and YARN [D]. Guangzhou: Sun Yat-sen University, 2014: 8-12.)
[12] 李成华,张新访,金海,等.MapReduce:新型的分布式并行计算编程模型[J].计算机工程与科学,2011,33(3):129-135. (LI C H, ZHANG X F, JIN H, et al. MapReduce: a new programming model for distributed parallel computing[J]. Computer Engineering & Science, 2011, 33(3): 129-135.)
[13] KARAU H. Fast data processing with Spark: high-speed distributed computing made easy with Spark[M]. Bermingham, UK: Packt Publishing, 2013: 5-13.
[14] DEAN J, GHEMAWAT S. MapReduce: simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1): 107-113.
[15] ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark: cluster computing with working sets[C]//HotCloud '10: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. Berkeley, CA: USENIX Association, 2010: 10-10.

Parallel cube computing in Spark

Spark环境下并行立方体计算方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	Xu LI, Yulin HE, Laizhong CUI, Zhexue HUANG, Fournier‑Viger PHILIPPE. Distributed observation point classifier for big data with random sample partition [J]. Journal of Computer Applications, 2024, 44(6): 1727-1733.
[2]	WU Renbiao, ZHANG Zhenchi, JIA Yunfei, QIAO Han. Adaptive scheduling strategy based on deadline under cloud platform [J]. Journal of Computer Applications, 2023, 43(1): 176-184.
[3]	Jiagui XIE, Zhiping LI, Jian JIN. Cross-chain mechanism based on Spark blockchain [J]. Journal of Computer Applications, 2022, 42(2): 519-527.
[4]	Jun FENG, Bingfa WANG, Jiamin LU. Query performance evaluation of distributed resource description framework data management systems [J]. Journal of Computer Applications, 2022, 42(2): 440-448.
[5]	Bin LIU, Jinrong HE, Yuancheng LI, Hong HAN. Apple price prediction method based on distributed neural network [J]. Journal of Computer Applications, 2020, 40(2): 369-374.
[6]	Xiajie ZHANG, Jinghua ZHU, Yang CHEN. Distributed rough set attribute reduction algorithm under Spark [J]. Journal of Computer Applications, 2020, 40(2): 518-523.
[7]	Junhua GU, Feng WANG, Yongjun QI, Zheran SUN, Zepei TIAN, Yajuan ZHANG. Retrieval method of pulmonary nodule images based on multi-scale convolution feature fusion [J]. Journal of Computer Applications, 2020, 40(2): 561-565.
[8]	CHENG Wenliang, WANG Zhihong, ZHOU Yu, GUO Yi, ZHAO Junfeng. Design of distributed computing framework for foreign exchange market monitoring [J]. Journal of Computer Applications, 2020, 40(1): 173-180.
[9]	CUI Yixin, CHEN Xiaodong. Spark framework based optimized large-scale spectral clustering parallel algorithm [J]. Journal of Computer Applications, 2020, 40(1): 168-172.
[10]	LIU Jing, XIAO Guanfeng. Bus arrival time prediction system based on Spark and particle filter algorithm [J]. Journal of Computer Applications, 2019, 39(2): 429-435.
[11]	LIU Zihao, LI Ling, YE Feng. Anomaly detection method for hydrologic sensor data based on SparkR [J]. Journal of Computer Applications, 2019, 39(2): 436-440.
[12]	LI Longyang, DONG Yihong, SHI Weijie, PAN Jianfei. SQM: subgraph matching algorithm for single large-scale graphs under Spark [J]. Journal of Computer Applications, 2019, 39(1): 46-50.
[13]	ZHAO Wenfang, WANG Jingli, SHANG Min, LIU Yanan. Forecasting model of pollen concentration based on particle swarm optimization and support vector machine [J]. Journal of Computer Applications, 2019, 39(1): 98-104.
[14]	CUI Chen, ZHENG Linjiang, HAN Fengping, HE Mujun. Design of secondary indexes in HBase based on memory [J]. Journal of Computer Applications, 2018, 38(6): 1584-1590.
[15]	GU Junhua, HUO Shijie, WU Junyan, YIN Jun, ZHANG Suqi. Parallel multi-layer graph partitioning method for solving maximum clique problems [J]. Journal of Computer Applications, 2018, 38(12): 3425-3432.