轻量级大数据运算系统Helius

doi:10.11772/j.issn.1001-9081.2017.02.0305

计算机应用 ›› 2017, Vol. 37 ›› Issue (2): 305-310.DOI: 10.11772/j.issn.1001-9081.2017.02.0305

• 第33届中国数据库学术会议（NDBC 2016） • 下一篇

轻量级大数据运算系统Helius

丁梦苏, 陈世敏

计算机体系结构国家重点实验室(中国科学院计算技术研究所), 北京 100190

收稿日期:2016-08-12 修回日期:2016-10-22 发布日期:2017-02-11 出版日期:2017-02-10
通讯作者: 陈世敏,chensm@ict.ac.cn
作者简介:丁梦苏(1993-),女,江西吉安人,硕士研究生,主要研究方向:大数据处理、并行分布式计算;陈世敏(1973-),男,北京人,研究员,博士,主要研究方向:数据管理系统、大数据处理、计算机体系结构。
基金资助:
中国科学院“百人计划”项目；国家自然科学基金面上项目（61572468）；国家自然科学基金创新群体项目（61521092）。

Helius: a lightweight big data processing system

DING Mengsu, CHEN Shimin

Key Laboratory of Computer System and Architecture(Institute of Computing Technology, Chinese Academy of Sciences), Beijing 100190, China

Received:2016-08-12 Revised:2016-10-22 Online:2017-02-11 Published:2017-02-10
Supported by:
This work is partially supported by the CAS Hundred Talents Program, the General Project of the National Natural Science Foundation of China (61572468), the Innovative Community Project of the National Natural Science Foundation of China (61521092).

摘要/Abstract

摘要：

针对Spark数据集不可变，以及Java虚拟机（JVM）依赖环境引起的代码执行、内存管理、数据序列化/反序列化等开销过多的不足，采用C/C++语言，设计并实现了一种轻量级的大数据运算系统——Helius。Helius支持Spark的基本操作，同时允许数据集整体修改；同时，Helius利用C/C++优化内存管理和网络传输，并采用stateless worker机制简化分布式计算平台的容错恢复过程。实验结果显示：5次迭代中，Helius运行PageRank算法的时间仅为Spark的25.12%~53.14%，运行TPCH Q6的时间仅为Spark的57.37%；在PageRank迭代1次的基础上，运行在Helius系统下时，master节点IP接收和发送数据量约为运行于Spark系统的40%和15%，而且200 s的运行过程中，Helius占用的总内存约为Spark的25%。实验结果与分析表明，与Spark相比，Helius具有节约内存、不需要序列化和反序列化、减少网络交互以及容错简单等优点。

关键词: 内存计算, 大数据运算, 分布式计算, 有向无环图调度, 容错恢复

Abstract:

Concerning the limitations of Spark, including immutable datasets and significant costs of code execution, memory management and data serialization/deserialization caused by running environment of Java Virtual Machine (JVM), a light-weight big data processing system, named Helius, was implemented in C/C++. Helius supports the basic operations of Spark, while allowing the data set to be modified as a whole. In Helius, the C/C++ is utilized to optimize the memory management and network communication, and a stateless worker mechanism is utilized to simplify the fault tolerance and recovery process of the distributed computing platform. The experimental results showed that in 5 iterations, the running time in Helius was only 25.12% to 53.14% of that in Spark when running PageRank iterative jobs, and the running time in Helius was only 57.37% of that in Spark when processing TPCH Q6. On the basis of one iteration of PageRank, the IP incoming and outcoming data sizes of master node in Helius were about 40% and 15% of those in Sparks, and the total memory consumed in the worker node in Helius was only 25% of that in Spark.Compared with Spark, Helius has the advantages of saving memory, eliminating the need for serialization and deserialization, reducing network interaction and simplifying fault tolerance.

Key words: in-memory computation, big data processing, distributed computation, Directed Acyclic Graph (DAG) scheduling, fault tolerance and recovery

中图分类号:

TP311.133.1

丁梦苏, 陈世敏. 轻量级大数据运算系统Helius[J]. 计算机应用, 2017, 37(2): 305-310.

DING Mengsu, CHEN Shimin. Helius: a lightweight big data processing system[J]. Journal of Computer Applications, 2017, 37(2): 305-310.

参考文献

[1] DEAN J, GHEMAWAT S. MapReduce:simplified data processing on large cluster[J]. Communication of the ACM-50th Anniversary Issue:1958-2008, 2008, 51(1):107-113.
[2] ZAHARIA M. An architecture for fast and general data processing on large clusters, UCB/EECS-2014-12[R]. Berkeley:University of California at Berkeley, 2014.
[3] ZAHARIA M, CHOWDHURY M, DAS T, et al. Resilient distributed datasets:a fault-tolerant abstraction for in-memory cluster computing[C]//NSDI'12:Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. Berkeley, CA:USENIX Association, 2012:15-28.
[4] The Apache Software Foundation. Apache Spark[EB/OL].[2016-05-30]. http://spark.apache.org/.
[5] The Apache Software Foundation. Apache Hadoop[EB/OL].[2016-05-30]. http://hadoop.apache.org/.
[6] SARIMBEKOV A, STADLER L, BULEJ L, et al. Workload characterization of JVM languages[J]. Software:Practice and Experience, 2016, 46(8):1053-1089.
[7] ISARD M, BUDIU M, YU Y, et al. Dryad:distributed data-parallel programs for sequential building blocks[C]//EuroSys'07:Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007. New York:ACM, 2007:59-72.
[8] BERKHIUT J. Google's PageRank algorithm for ranking nodes in general networks[C]//Proceedings of the 201613th International Workshop on Discrete Event Systems. Piscataway, NJ:IEEE, 2016:163-172.
[9] PAGE L, BRIN S, MOTWANI R, et al. The PageRank citation ranking:bringing order to the Web, Technical Report 1999-66[R/OL]. California:Stanford University, 1999[2016-04-11]. http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf.
[10] Transaction Processing Performance Council. TPC Benchmark^TM H Standard Specification Revision 2.17.1[S/OL].[2016-05-30]. http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf.
[11] MALEWIEZ G, AUSTEM M H, BIK A J C, et al. Pregel:a system for large-scale graph processing[C]//SIGMOD'10:Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. NewYork:ACM, 2010:135-146.
[12] CARSTOIU D, LEPADATU E, GASPAR M. Hbase-non-SQL database, performances evaluation[J]. International Journal of Advancements in Computing Technology, 2010:2(5):42-52.

[1]	华夏, 朱铮皓, 徐聪, 张曦煌, 柴志雷, 陈闻杰. 基于精准通信建模的脉冲神经网络工作负载自动映射器[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 827-834.
[2]	王周恺, 张炯, 马维纲, 王怀军. 面向高速列车监测数据的并行解压缩算法[J]. 计算机应用, 2021, 41(9): 2586-2593.
[3]	赵永柱, 黎卫东, 唐斌, 梅峰, 卢文达. 面向期限感知分布式矩阵相乘的高效存储方案[J]. 《计算机应用》唯一官方网站, 2020, 40(2): 311-315.
[4]	韩俊樱, 张振宇, 孔德仕. 移动群智感知中面向用户区域的分布式多任务分配方法[J]. 《计算机应用》唯一官方网站, 2020, 40(2): 358-362.
[5]	潘鸣宇, 张禄, 龙国标, 李香龙, 马冬雪, 徐亮. 用于重复充电运营记录的基于块采样的高效聚集查询算法[J]. 计算机应用, 2018, 38(6): 1596-1600.
[6]	曾沁, 李永生. 基于分布式计算框架的风暴三维追踪方法[J]. 计算机应用, 2017, 37(4): 941-944.
[7]	卞琛, 于炯, 修位蓉, 英昌甜, 钱育蓉. 基于迭代填充的内存计算框架分区映射算法[J]. 计算机应用, 2017, 37(3): 647-653.
[8]	赵永彬, 陈硕, 刘明, 王佳楠, 贲驰. 流计算与内存计算架构下的运营状态监测分析[J]. 计算机应用, 2017, 37(10): 3029-3033.
[9]	王冠, 王宇新, 陈鑫, 王飞, 郭禾. 基于直接后继节点完成时间的异构调度算法[J]. 计算机应用, 2017, 37(1): 12-17.
[10]	王桂兰, 周国亮, 萨初日拉, 朱永利. Spark环境下的并行模糊C均值聚类算法[J]. 计算机应用, 2016, 36(2): 342-347.
[11]	赵军, 徐晓燕. 基于GraphX的分布式幂迭代聚类[J]. 计算机应用, 2016, 36(10): 2710-2714.
[12]	孙霞, 禹龙, 田生伟, 闫奕霖, 林江丽. 基于一致性Hash的分布式海量分子检索模型[J]. 计算机应用, 2015, 35(4): 956-959.
[13]	王宇新, 曹仕杰, 郭禾, 陈征, 陈鑫. 兼顾费用与公平的带通信开销的多有向无环图调度[J]. 计算机应用, 2015, 35(11): 3017-3020.
[14]	张钰陈靖王涌天周琪. 增强现实浏览器的密集热点定位与显示[J]. 计算机应用, 2014, 34(5): 1435-1438.
[15]	杨辉华任洪军李灵巧段礼新郭拓杜玲玲漆小泉. 基于Sector/Sphere的气相色谱-质谱联用多样本并行对齐算法[J]. 计算机应用, 2013, 33(01): 215-218.

轻量级大数据运算系统Helius

Helius: a lightweight big data processing system

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics