计算机应用 ›› 2017, Vol. 37 ›› Issue (2): 305-310.DOI: 10.11772/j.issn.1001-9081.2017.02.0305

• 第33届中国数据库学术会议(NDBC 2016) •    下一篇

轻量级大数据运算系统Helius

丁梦苏, 陈世敏   

  1. 计算机体系结构国家重点实验室(中国科学院计算技术研究所), 北京 100190
  • 收稿日期:2016-08-12 修回日期:2016-10-22 出版日期:2017-02-10 发布日期:2017-02-11
  • 通讯作者: 陈世敏,chensm@ict.ac.cn
  • 作者简介:丁梦苏(1993-),女,江西吉安人,硕士研究生,主要研究方向:大数据处理、并行分布式计算;陈世敏(1973-),男,北京人,研究员,博士,主要研究方向:数据管理系统、大数据处理、计算机体系结构。
  • 基金资助:

    中国科学院“百人计划”项目;国家自然科学基金面上项目(61572468);国家自然科学基金创新群体项目(61521092)。

Helius: a lightweight big data processing system

DING Mengsu, CHEN Shimin   

  1. Key Laboratory of Computer System and Architecture(Institute of Computing Technology, Chinese Academy of Sciences), Beijing 100190, China
  • Received:2016-08-12 Revised:2016-10-22 Online:2017-02-10 Published:2017-02-11
  • Supported by:

    This work is partially supported by the CAS Hundred Talents Program, the General Project of the National Natural Science Foundation of China (61572468), the Innovative Community Project of the National Natural Science Foundation of China (61521092).

摘要:

针对Spark数据集不可变,以及Java虚拟机(JVM)依赖环境引起的代码执行、内存管理、数据序列化/反序列化等开销过多的不足,采用C/C++语言,设计并实现了一种轻量级的大数据运算系统——Helius。Helius支持Spark的基本操作,同时允许数据集整体修改;同时,Helius利用C/C++优化内存管理和网络传输,并采用stateless worker机制简化分布式计算平台的容错恢复过程。实验结果显示:5次迭代中,Helius运行PageRank算法的时间仅为Spark的25.12%~53.14%,运行TPCH Q6的时间仅为Spark的57.37%;在PageRank迭代1次的基础上,运行在Helius系统下时,master节点IP接收和发送数据量约为运行于Spark系统的40%和15%,而且200 s的运行过程中,Helius占用的总内存约为Spark的25%。实验结果与分析表明,与Spark相比,Helius具有节约内存、不需要序列化和反序列化、减少网络交互以及容错简单等优点。

关键词: 内存计算, 大数据运算, 分布式计算, 有向无环图调度, 容错恢复

Abstract:

Concerning the limitations of Spark, including immutable datasets and significant costs of code execution, memory management and data serialization/deserialization caused by running environment of Java Virtual Machine (JVM), a light-weight big data processing system, named Helius, was implemented in C/C++. Helius supports the basic operations of Spark, while allowing the data set to be modified as a whole. In Helius, the C/C++ is utilized to optimize the memory management and network communication, and a stateless worker mechanism is utilized to simplify the fault tolerance and recovery process of the distributed computing platform. The experimental results showed that in 5 iterations, the running time in Helius was only 25.12% to 53.14% of that in Spark when running PageRank iterative jobs, and the running time in Helius was only 57.37% of that in Spark when processing TPCH Q6. On the basis of one iteration of PageRank, the IP incoming and outcoming data sizes of master node in Helius were about 40% and 15% of those in Sparks, and the total memory consumed in the worker node in Helius was only 25% of that in Spark.Compared with Spark, Helius has the advantages of saving memory, eliminating the need for serialization and deserialization, reducing network interaction and simplifying fault tolerance.

Key words: in-memory computation, big data processing, distributed computation, Directed Acyclic Graph (DAG) scheduling, fault tolerance and recovery

中图分类号: