《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (6): 1922-1933.DOI: 10.11772/j.issn.1001-9081.2024050697

• 先进计算 • 上一篇    

面向多样计算场景的检查点技术综述

陈筱琳(), 张亚强, 史宏志   

  1. 山东海量信息技术研究院,济南 250101
  • 收稿日期:2024-05-31 修回日期:2024-08-28 接受日期:2024-09-10 发布日期:2024-09-13 出版日期:2025-06-10
  • 通讯作者: 陈筱琳
  • 作者简介:陈筱琳(1995—),女,山东泰安人,博士,CCF会员,主要研究方向:容错理论、虚拟现实人机交互 cxl95@163.com
    张亚强(1990 —),男,山西太原人,高级工程师,博士,CCF会员,主要研究方向:边缘计算、容错计算、异构计算
    史宏志(1988—),男,河北唐山人,硕士,CCF会员,主要研究方向:计算机体系结构、容错计算、深度学习。
  • 基金资助:
    山东省自然科学基金资助项目(ZR2021QF104)

Review of checkpoint technology for multiple computing scenarios

Xiaolin CHEN(), Yaqiang ZHANG, Hongzhi SHI   

  1. Shandong Massive Information Technology Research Institute,Jinan Shandong 250101,China
  • Received:2024-05-31 Revised:2024-08-28 Accepted:2024-09-10 Online:2024-09-13 Published:2025-06-10
  • Contact: Xiaolin CHEN
  • About author:CHEN Xiaolin, born in 1995, Ph. D. Her research interests include fault-tolerance theory, human-computer interaction in virtual reality.
    ZHANG Yaqiang, born in 1990, Ph. D., senior engineer. His research interests include edge computing, fault-tolerant computing, heterogeneous computing.
    SHI Hongzhi, born in 1988, M. S. His research interests include computer architecture, fault-tolerant computing, deep learning.
  • Supported by:
    Shandong Provincial Natural Science Foundation(ZR2021QF104)

摘要:

检查点技术是一种在计算系统中保存当前计算任务和系统状态的方法,可应用于系统故障恢复、作业迁移和作业抢占等诸多场景。随着技术的发展,计算场景更多元,计算规模更大,计算系统的结构层次更复杂,且计算环境更多变,这些会导致故障发生的概率增加。同时,平均故障间隔时间(MTBT)从[6.50 h, 40.00 h]缩短至1.25 h。因此,作为典型容错手段的检查点技术显得越来越重要。首先,介绍多样计算场景的检查点技术近年来的发展概况,并基于现有技术的特点对它们进行分类;其次,回顾包括增量检查点、多级异步检查点、最优检查点间隔和基于故障感知的检查点这4个方向在内的最新研究进展,并总结检查点技术在面向多样计算场景时的发展趋势——动态化、智能化和主动化,以及该技术面临的挑战;最后,通过梳理优化检查点策略的主要思路和最新方法,帮助研究人员快速掌握检查点技术的现状和未来发展趋势。

关键词: 增量检查点, 多级异步检查点, 最优检查点间隔, 动态检查点, 基于故障感知的检查点

Abstract:

Checkpoint technology is a method of saving the current computing task and system state in a computing system in order to roll back the system to the previously saved state when needed. It is commonly used in multiple scenarios such as system failure recovery, job migration, and job preemption. With the development of technology, there are more computing scenarios, larger computing scales, more complex structural hierarchy of computing systems, and more variable computing environments, which increase the probability of failure occurrence. At the same time, the Mean Time Between Failures (MTBT) is reduced from [6.50 h, 40.00 h] to 1.25 h. Therefore, checkpoint technology is becoming increasingly critical as a commonly used fault-tolerant method. Firstly, the development overview of checkpoint technology was introduced, and the existing checkpoint technologies were classified based on their technical characteristics. Then, the latest research progress was reviewed in four directions: incremental checkpoint, multi-level asynchronous checkpoint, optimal checkpoint interval, and fault perception-based checkpoint. And the current trends in checkpoint technology — dynamic, intelligent, and proactive trends, as well as challenges faced by this technology were summarized. Finally, main ideas and latest methods of optimizing checkpoint strategies were sorted out to help researchers grasp checkpoint technology’s current development status and future development trends quickly.

Key words: incremental checkpoint, multi-level asynchronous checkpoint, optimal checkpoint interval, dynamic checkpoint, fault perception-based checkpoint

中图分类号: