Journal of Computer Applications

    Next Articles

Review of Checkpoint Technology for Multiple Computing Scenarios

  

  • Received:2024-05-31 Revised:2024-08-28 Accepted:2024-09-10 Online:2024-09-13 Published:2024-09-13
  • Supported by:
    Shandong Provincial Natural Science Foundation

面向多样计算场景的检查点技术综述

陈筱琳,张亚强,史宏志   

  1. 山东海量信息技术研究院
  • 通讯作者: 陈筱琳
  • 基金资助:
    山东省自然科学基金

Abstract: Checkpoint technology is a method of saving the current computing task and system state in a computing system in order to roll back the system to the previously saved state when needed. It is commonly used in system failure recovery, job migration, and job preemption scenarios. With the development of technology, there are more computing scenarios, larger computing scales, more complex structural levels of computing systems, and more variable computing environments, which increase the probability of failures occurring; the average time between failures is reduced from 6.5-40 hours to 1.25 hours. Therefore, checkpoint technology is becoming increasingly critical as a commonly used fault-tolerant method. This article introduces the development overview of checkpoint technology and classifies existing checkpoint technologies based on their technical characteristics. It reviews the latest research progress in four research directions: incremental checkpoint, multi-level asynchronous checkpoint, optimal checkpoint interval, and fault perception-based checkpoint. It summarizes the current trends in checkpoint technology's dynamic, intelligent, and proactive. It also proposed the possible important future development directions and challenges of checkpoint technology. This article introduces the main ideas and latest methods of optimizing checkpoint strategies to help researchers quickly understand checkpoint technology's current development status and future trends.

Key words: incremental checkpoint, multi-level asynchronous checkpoint, optimal checkpoint interval, dynamic checkpoint, fault perception-based checkpoint

摘要: 检查点技术是一种在计算系统中保存当前计算任务和系统状态的方法,以便在需要时将系统回滚到先前保存的状态,通常应用于系统故障恢复、作业迁移抢占等场景。随着技术发展,计算场景更多元,计算规模更大,计算系统的结构层次更复杂,计算环境更多变,故障发生的概率随之增加,平均故障间隔时间从6.5-40小时缩短至1.25小时,检查点技术作为典型的容错方法也愈加重要。针对面向多样计算场景的检查点技术近年来发展概况,对现有技术进行分类,并回顾了增量检查点、多级异步检查点、最优检查点间隔、基于故障感知的检查点等技术方向的最新研究进展,总结了面向多样计算场景的检查点技术的动态化、智能化和主动化的发展趋势以及面临的挑战。通过梳理优化检查点策略的主要思路和最新方法,帮助研究人员快速掌握检查点技术的现状和未来发展趋势。

关键词: 增量检查点, 多级异步检查点, 最优检查点间隔, 动态检查点, 基于故障感知的检查点

CLC Number: