计算机应用 ›› 2020, Vol. 40 ›› Issue (6): 1638-1647.DOI: 10.11772/j.issn.1001-9081.2019101793

• 数据科学与技术 • 上一篇    下一篇

模型驱动的大数据流水线框架PiFlow

朱小杰1, 赵子豪1,2, 杜一1,2   

  1. 1.中国科学院 计算机网络信息中心,北京 100190
    2.中国科学院大学,北京 100049
  • 收稿日期:2019-10-22 修回日期:2020-01-13 出版日期:2020-06-10 发布日期:2020-06-18
  • 通讯作者: 杜一(1988—)
  • 作者简介:朱小杰(1985—),女,天津人,工程师,硕士,主要研究方向:大数据处理、大数据流水线。赵子豪(1994—),男,辽宁阜新人,硕士研究生,CCF会员,主要研究方向:大数据处理、图数据系统、人工智能数据库。杜一(1988—),男,山东聊城人,副研究员,博士,CCF会员,主要研究方向:可视分析、数据挖掘。
  • 基金资助:
    国家重点研发计划云计算与大数据重点专项(2018YFB1004001);国家自然科学基金重点项目(61836013);中国烟草总公司科技重大专项(110201801019(SJ-01))。

PiFlow: model driven big data pipeline framework

ZHU Xiaojie1, ZHAO Zihao1,2, DU Yi1,2   

  1. 1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
    2. University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2019-10-22 Revised:2020-01-13 Online:2020-06-10 Published:2020-06-18
  • Contact: DU Yi,born in 1988,Ph. D.,associate research fellow. His research interests include visual analysis,data mining
  • About author:ZHU Xiaojie, born in 1985, M. S., engineer. Her research interests include big data processing,big data pipeline.ZHAO Zihao,born in 1994,M. S. candidate. His research interests include big data processing,graph data system,artificial intelligence database.DU Yi,born in 1988,Ph. D.,associate research fellow. His research interests include visual analysis,data mining.
  • Supported by:
    Cloud Computing and Big Data Key Program of the National Key Research and Development Plan of China(2018YFB1004001), the Key Project of National Natural Science Foundation of China (61836013), the Science and Technology Major Project of China National Tobacco Corporation (110201801019(SJ-01)).

摘要: 复杂流程的大数据处理多依托于流水线系统,但大数据处理的流水线系统在易用性、功能复用性、扩展性以及处理性能等方面存在不足。针对上述问题,为提高大数据处理环境的构建与开发效率,优化处理流程,提出了一种模型驱动的大数据流水线框架PiFlow。首先,将大数据处理过程抽象为有向无环图;然后,开发一系列组件用于构建数据处理流水线,并设计了流水线任务执行机制。同时,为规范和简化流水线框架的描述,设计了基于模型驱动的大数据流水线描述语言——PiFlowDL,该语言以模块化、层次化的方式对大数据处理任务进行描述。PiFlow以所见即所得(WYSIWYG)的方式配置流水线,集成了状态监控、模板配置、组件集成等功能,与Apache NiFi相比有2~7倍的性能提升。

关键词: 大数据, 流水线, 流水线调度, 模型驱动的开发方法, 数据处理

Abstract: Big data processing with complex process mostly relies on pipeline systems. However, the pipeline systems of big data processing have some shortcomings in usability, function reusability, expansibility and processing performance. In order to solve the problems and improve the construction and development efficiency of big data processing environment and optimize the processing flow, a model driven big data pipeline framework called PiFlow was proposed. Firstly, the big data processing process was abstracted as a directed acyclic graph. Then, a series of components were developed to construct the data processing pipeline, and the pipeline task execution mechanism was designed. At the same time, in order to standardize and simplify the pipeline framework description, a model driven big data pipeline description language called PiFlowDL was designed, which described the big data processing tasks in a modular and hierarchical way. PiFlow configures the pipeline in a What You See Is What You Get (WYSIWYG) way, and integrates the functions such as status monitoring, template configuration, and component integration. Compared with Apache NiFi, it has the performance improvement of 2-7 times.

Key words: big data, pipeline, pipeline scheduling, model driven development method, data processing

中图分类号: