Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (11): 3337-3345.DOI: 10.11772/j.issn.1001-9081.2021122108

• CCF Bigdata 2021 • Previous Articles    

Efficient failure recovery method for stream data processing system

Yang LIU1,2,3, Yangyang ZHANG1,2, Haoyi ZHOU1,2,4()   

  1. 1.Beijing Advanced Innovation Center for Big Data and Brain Computing,Beihang University,Beijing 100191,China
    2.School of Computer Science and Engineering,Beihang University,Beijing 100191,China
    3.ShenYuan Honors College,Beihang University,Beijing 100191,China
    4.College of Software,Beihang University,Beijing 100191,China
  • Received:2021-12-15 Revised:2022-02-27 Accepted:2022-03-04 Online:2022-04-18 Published:2022-11-10
  • Contact: Haoyi ZHOU
  • About author:LIU Yang, born in 1999, Ph. D. candidate. His research interests include distributed systems, graph processing systems.
    ZHANG Yangyang, born in 1991, Ph. D. candidate. His research interests include distributed systems, machine learning, graph processing.
    ZHOU Haoyi, born in 1991, Ph. D., lecturer. His research interests include big data system, machine learning.
  • Supported by:
    National Natural Science Foundation of China(U20B2053);Open Project of State Key Laboratory of Software Development Environment(SKLSDE?2020ZX?12)

面向流式数据处理系统的高效故障恢复方法

刘阳1,2,3, 张扬扬1,2, 周号益1,2,4()   

  1. 1.北京航空航天大学 大数据科学与脑机智能高精尖创新中心, 北京 100191
    2.北京航空航天大学 计算机学院, 北京 100191
    3.北京航空航天大学 未来空天技术学院/高等理工学院, 北京 100191
    4.北京航空航天大学 软件学院, 北京 100191
  • 通讯作者: 周号益
  • 作者简介:刘阳(1999—),男,山西大同人,博士研究生,CCF会员,主要研究方向:分布式系统、图计算系统
    张扬扬(1991—),男,河北保定人,博士研究生,CCF会员,主要研究方向:分布式系统、机器学习、图计算
    周号益(1991—),男,四川德阳人,讲师,博士,CCF会员,主要研究方向:大数据系统、机器学习。haoyi@buaa.edu.cn
  • 基金资助:
    国家自然科学基金资助项目(U20B2053);软件开发环境国家重点实验室开放课题(SKLSDE?2020ZX?12)

Abstract:

Focusing on the issue that the single point of failure cannot be efficiently handled by streaming data processing system Flink, a new fault?tolerant system based on incremental state and backup, Flink+, was proposed. Firstly, backup operators and data paths were established in advance. Secondly, the output data in the data flow diagram was cached, and disks were used if necessary. Thirdly, task state synchronization was performed during system snapshots. Finally, backup tasks and cached data were used to recover calculation in case of system failure. In the system experiment and test, Flink+ dose not significantly increase the additional fault tolerance overhead during fault?free operation; when dealing with the single point of failure in both single?machine and distributed environments, compared with Flink system, the proposed system has the failure recovery time reduced by 96.98% in single?machine 8?task parallelism and by 88.75% in distributed 16?task parallelism. Experimental results show that using incremental state and backup method together can effectively reduce the recovery time of the single point of failure of the stream system and enhance the robustness of the system.

Key words: stream data processing system, failure recovery, distributed checkpoint, state backup, Apache Flink

摘要:

针对流式数据处理系统Flink无法高效处理单点故障的问题,提出了一种基于增量状态和备份的故障容错系统Flink+。首先,提前建立备份算子和数据通路;然后,对数据流图中的输出数据进行缓存,必要时使用磁盘;其次,在系统快照时进行任务状态同步;最后,在系统故障时使用备份任务和缓存的数据恢复计算。在系统实验测试中,Flink+在无故障运行时没有显著增加额外容错开销;而在单机和分布式环境下处理单点故障时,与Flink系统相比,所提系统在单机8任务并行度下故障恢复时间减少了96.98%,在分布式16任务并行度下故障恢复时间减少了88.75%。实验结果表明,增量状态和备份方法一起使用可以有效减少流式系统单点故障的恢复时间,增强系统的鲁棒性。

关键词: 流式数据处理系统, 故障恢复, 分布式检查点, 状态备份, Apache Flink

CLC Number: