面向多样计算场景的检查点技术综述

doi:10.11772/j.issn.1001-9081.2024050697

《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (6): 1922-1933.DOI: 10.11772/j.issn.1001-9081.2024050697

• 先进计算 • 上一篇

面向多样计算场景的检查点技术综述

陈筱琳(), 张亚强, 史宏志

山东海量信息技术研究院，济南 250101

收稿日期:2024-05-31 修回日期:2024-08-28 接受日期:2024-09-10 发布日期:2024-09-13 出版日期:2025-06-10
通讯作者: 陈筱琳
作者简介:陈筱琳（1995—），女，山东泰安人，博士，CCF会员，主要研究方向：容错理论、虚拟现实人机交互 cxl95@163.com
张亚强（1990 —），男，山西太原人，高级工程师，博士，CCF会员，主要研究方向：边缘计算、容错计算、异构计算
史宏志（1988—），男，河北唐山人，硕士，CCF会员，主要研究方向：计算机体系结构、容错计算、深度学习。
基金资助:
山东省自然科学基金资助项目(ZR2021QF104)

Review of checkpoint technology for multiple computing scenarios

Xiaolin CHEN(), Yaqiang ZHANG, Hongzhi SHI

Shandong Massive Information Technology Research Institute，Jinan Shandong 250101，China

Received:2024-05-31 Revised:2024-08-28 Accepted:2024-09-10 Online:2024-09-13 Published:2025-06-10
Contact: Xiaolin CHEN
About author:CHEN Xiaolin， born in 1995， Ph. D. Her research interests include fault-tolerance theory， human-computer interaction in virtual reality.
ZHANG Yaqiang， born in 1990， Ph. D.， senior engineer. His research interests include edge computing， fault-tolerant computing， heterogeneous computing.
SHI Hongzhi， born in 1988， M. S. His research interests include computer architecture， fault-tolerant computing， deep learning.
Supported by:
Shandong Provincial Natural Science Foundation(ZR2021QF104)

摘要/Abstract

摘要：

检查点技术是一种在计算系统中保存当前计算任务和系统状态的方法，可应用于系统故障恢复、作业迁移和作业抢占等诸多场景。随着技术的发展，计算场景更多元，计算规模更大，计算系统的结构层次更复杂，且计算环境更多变，这些会导致故障发生的概率增加。同时，平均故障间隔时间（MTBT）从［6.50 h， 40.00 h］缩短至1.25 h。因此，作为典型容错手段的检查点技术显得越来越重要。首先，介绍多样计算场景的检查点技术近年来的发展概况，并基于现有技术的特点对它们进行分类；其次，回顾包括增量检查点、多级异步检查点、最优检查点间隔和基于故障感知的检查点这4个方向在内的最新研究进展，并总结检查点技术在面向多样计算场景时的发展趋势——动态化、智能化和主动化，以及该技术面临的挑战；最后，通过梳理优化检查点策略的主要思路和最新方法，帮助研究人员快速掌握检查点技术的现状和未来发展趋势。

关键词: 增量检查点, 多级异步检查点, 最优检查点间隔, 动态检查点, 基于故障感知的检查点

Abstract:

Checkpoint technology is a method of saving the current computing task and system state in a computing system in order to roll back the system to the previously saved state when needed. It is commonly used in multiple scenarios such as system failure recovery， job migration， and job preemption. With the development of technology， there are more computing scenarios， larger computing scales， more complex structural hierarchy of computing systems， and more variable computing environments， which increase the probability of failure occurrence. At the same time， the Mean Time Between Failures （MTBT） is reduced from ［6.50 h， 40.00 h］ to 1.25 h. Therefore， checkpoint technology is becoming increasingly critical as a commonly used fault-tolerant method. Firstly， the development overview of checkpoint technology was introduced， and the existing checkpoint technologies were classified based on their technical characteristics. Then， the latest research progress was reviewed in four directions： incremental checkpoint， multi-level asynchronous checkpoint， optimal checkpoint interval， and fault perception-based checkpoint. And the current trends in checkpoint technology — dynamic， intelligent， and proactive trends， as well as challenges faced by this technology were summarized. Finally， main ideas and latest methods of optimizing checkpoint strategies were sorted out to help researchers grasp checkpoint technology’s current development status and future development trends quickly.

Key words: incremental checkpoint, multi-level asynchronous checkpoint, optimal checkpoint interval, dynamic checkpoint, fault perception-based checkpoint

中图分类号:

TP302.8

陈筱琳, 张亚强, 史宏志. 面向多样计算场景的检查点技术综述[J]. 计算机应用, 2025, 45(6): 1922-1933.

Xiaolin CHEN, Yaqiang ZHANG, Hongzhi SHI. Review of checkpoint technology for multiple computing scenarios[J]. Journal of Computer Applications, 2025, 45(6): 1922-1933.

图/表 10

参考文献 48

1	JI Z， WANG C L. Compiler-directed incremental checkpointing for low latency GPU preemption［C］// Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium. Piscataway： IEEE， 2022： 751-761.
2	RODRÍGUEZ-PASCUAL M， CAO J， MORÍÑIGO J A， et al. Job migration in HPC clusters by means of checkpoint/restart［J］. The Journal of Supercomputing， 2019， 75（10）： 6517-6541.
3	陈轶阳，王小宁，闫晓婷，等. 基于CRIU的高性能计算容器检查点技术研究［J］. 计算机科学， 2024， 51（9）： 40-50.
	CHEN Y Y， WANG X N， YAN X T， et al. Study on high performance computing container checkpoint technology based on CRIU［J］. Computer Science， 2024， 51（9）： 40-50.
4	WANG F， WEI G Y， LIU Q， et al. Boost neural networks by checkpoints［C］// Proceedings of the 35th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2021： 19719-19729.
5	ROJAS E， PÉREZ D， CALHOUN J C， et al. Understanding soft error sensitivity of deep learning models and frameworks through checkpoint alteration［C］// Proceedings of the 2021 IEEE International Conference on Cluster Computing. Piscataway： IEEE， 2021： 492-503.
6	ASSOGBA K， NICOLAE B， VAN DAM H， et al. Asynchronous multi-level checkpointing： an enabler of reproducibility using checkpoint history analytics［C］// Proceedings of the SC’23 Workshops of the International Conference on High Performance Computing， Network， Storage， and Analysis. New York： ACM， 2023： 1748-1756.
7	闫晓婷，王小宁，董盛，等. 高性能计算检查点技术发展与应用综述［J］. 计算机科学， 2024， 51（9）： 1-14.
	YAN X T， WANG X N， DONG S， et al. Review of the development and application of checkpointing technology in high-performance computing［J］. Computer Science， 2024， 51（9）： 1-14.
8	冯杨洋，汪庆，谢旻晖，等. 从BERT到ChatGPT：大模型训练中的存储系统挑战与技术发展［J］. 计算机研究与发展， 2024， 61（4）： 809-823.
	FENG Y Y， WANG Q， XIE M H， et al. From BERT to ChatGPT： challenges and technical development of storage systems for large model training ［J］. Journal of Computer Research and Development， 2024， 61（4）： 809-823.
9	WANG C， MUELLER F， ENGELMANN C， et al. Proactive process-level live migration in HPC environments［C］// Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. Piscataway： IEEE， 2008： 1-12.
10	TAN N， LUETTGAU J， MARQUEZ J， et al. Scalable incremental checkpointing using GPU-accelerated de-duplication［C］// Proceedings of the 52nd International Conference on Parallel Processing. New York： ACM， 2023： 665-674.
11	GOSSMAN M J， NICOLAE B， CALHOUN J C. Modeling multi-threaded aggregated I/O for asynchronous checkpointing on HPC systems［C］// Proceedings of the 22nd International Symposium on Parallel and Distributed Computing. Piscataway： IEEE， 2023： 101-105.
12	YAOTHANEE J， CHANCHIO K. A pipelined multi-level checkpoint storage system for virtual cluster checkpointing［C］// Proceedings of the 8th International Conference on Cloud Computing and Big Data Analytics. Piscataway： IEEE， 2023： 239-246.
13	MAROTTA R， MONTESANO F， PELLEGRINI A， et al. Incremental checkpointing of large state simulation models with write-intensive events via memory update correlation on buddy pages［C］// Proceedings of the IEEE/ACM 27th International Symposium on Distributed Simulation and Real Time Applications. Piscataway： IEEE， 2023： 40-47.
14	ZHANG L， WANG Z， KONG F. Optimal checkpointing strategy for real-time systems with both logical and timing correctness［J］. ACM Transactions on Embedded Computing Systems， 2023， 22（4）： No.66.
15	ZHANG Z， LIU T， SHU Y， et al. Dynamic adaptive checkpoint mechanism for streaming applications based on reinforcement learning［C］// Proceedings of the 28th IEEE International Conference on Parallel and Distributed Systems. Piscataway： IEEE， 2023： 538-545.
16	LIN C Y， WANG L C， CHANG S P. Incremental checkpointing for fault-tolerant stream processing systems： a data structure approach［J］. IEEE Transactions on Emerging Topics in Computing， 2022， 10（1）： 124-136.
17	BEHERA S， WAN L， MUELLER F， et al. P-ckpt： coordinated prioritized checkpointing［C］// Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium. Piscataway： IEEE， 2022： 436-446.
18	RAHMAN S， KALATZIS A， WITTIE M P， et al. Dynamic checkpoint initiation in serverless MEC［C］// Proceedings of the 2022 IEEE International Conference on Omni-layer Intelligent Systems. Piscataway： IEEE， 2022： 1-6.
19	GELDENHUYS M K， PFISTER B J J， SCHEINERT D， et al. Khaos： dynamically optimizing checkpointing for dependable distributed stream processing［C］// Proceedings of the 17th Conference on Computer Science and Intelligence Systems. Piscataway： IEEE， 2022： 553-561.
20	YANG Y， YANG Z， XU C. Exploiting unblocking checkpoint for fault-tolerance in Pregel-like systems［C］// Proceedings of the 2021 International Conference on Web Information Systems Engineering， LNCS 13080. Cham： Springer， 2021： 71-86.
21	YANG Y， XU C， KONG C， et al. Hybrid checkpointing for iterative processing in BSP-based systems［C］// Proceedings of the 2021 Web Information Systems and Applications， LNCS 12999. Cham： Springer， 2021： 693-705.
22	MAURYA A， NICOLAE B， RAFIQUE M M， et al. Towards efficient I/O scheduling for collaborative multi-level checkpointing［C］// Proceedings of the 29th International Symposium on Modeling， Analysis， and Simulation of Computer and Telecommunication Systems. Piscataway： IEEE， 2021： 1-8.
23	ANTHONY Q， DAI D. Evaluating multi-level checkpointing for distributed deep neural network training［C］// Proceedings of the 2021 SC Workshops Supplementary Proceedings. Piscataway： IEEE， 2021： 60-67.
24	PARASYRIS K， GEORGAKOUDIS G， BAUTISTA-GOMEZ L， et al. Co-designing multi-level checkpoint restart for MPI applications［C］// Proceedings of the 21st IEEE/ACM International Symposium on Cluster， Cloud and Internet Computing. Piscataway： IEEE， 2021： 103-112.
25	FRANK A， BAUMGARTNER M， SALKHORDEH R， et al. Improving checkpointing intervals by considering individual job failure probabilities［C］// Proceedings of the 2021 IEEE International Parallel and Distributed Processing Symposium. Piscataway： IEEE， 2021： 299-309.
26	ZHU Y， INTERLANDI M， ROY A， et al. Phoebe： a learning-based checkpoint optimizer［J］. Proceedings of the VLDB Endowment， 2021， 14（11）： 2505-2518.
27	BERTHOU G， MARQUET K， RISSET T， et al. MPU-based incremental checkpointing for transiently-powered systems［C］// Proceedings of the 23rd Euromicro Conference on Digital System Design. Piscataway： IEEE， 2020： 89-96.
28	DEY T， SATO K， NICOLAE B， et al. Optimizing asynchronous multi-level checkpoint/restart configurations with machine learning［C］// Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium Workshops. Piscataway： IEEE， 2020： 1036-1043.
29	JAYASEKARA S， HARWOOD A， KARUNASEKERA S. A utilization model for optimization of checkpoint intervals in distributed stream processing systems［J］. Future Generation Computer Systems， 2020， 110： 68-79.
30	TIWARI D， GUPTA S， VAZHKUDAI S S. Lazy checkpointing： exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems［C］// Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. Piscataway： IEEE， 2014： 25-36.
31	KUMAR R， JHA S， MAHGOUB A， et al. The mystery of the failing jobs： insights from operational data from two university-wide computing systems［C］// Proceedings of the 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. Piscataway： IEEE， 2020： 158-171.
32	SINGLA P， SARANGI S R. A survey and experimental analysis of checkpointing techniques for energy harvesting devices［J］. Journal of Systems Architecture， 2022， 126： No.102464.
33	AOUDA F A， MARQUET K， SALAGNAC G. Incremental checkpointing of program state to NVRAM for transiently-powered systems［C］// Proceedings of the 9th International Symposium on Reconfigurable and Communication-Centric Systems-on-Chip. Piscataway： IEEE， 2014： 1-4.
34	刘阳，张扬扬，周号益. 面向流式数据处理系统的高效故障恢复方法［J］. 计算机应用， 2022， 42（11）：3337-3345.
	LIU Y， ZHANG Y Y， ZHOU H Y. Efficient failure recovery method for stream data processing system［J］. Journal of Computer Applications， 2022， 42（11）：3337-3345.
35	MOODY A， BRONEVETSKY G， MOHROR K， et al. Design， modeling， and evaluation of a scalable multi-level checkpointing system［C］// Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing， Networking， Storage and Analysis. Piscataway： IEEE， 2010： 1-11.
36	BAUTISTA-GOMEZ L， TSUBOI S， KOMATITSCH D， et al. FTI： high performance fault tolerance interface for hybrid systems［C］// Proceedings of the 2011 International Conference for High Performance Computing， Networking， Storage and Analysis. Seattle Washington： ACM， 2011： 1-32.
37	NICOLAE B， MOODY A， GONSIOROWSKI E， et al. VeloC： towards high performance adaptive asynchronous checkpointing at large scale［C］// Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium. Piscataway： IEEE， 2019： 911-920.
38	JOHN J， ARAYA I D N， GERNDT M. iCheck： leveraging RDMA and malleability for application-level checkpointing in HPC systems［C］// Proceedings of the 28th IEEE International Conference on Parallel and Distributed Systems. Piscataway： IEEE， 2023： 467-474.
39	MAURYA A， RAFIQUE M M， TONELLOT T， et al. GPU-enabled asynchronous multi-level checkpoint caching and prefetching ［C］// Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing. New York： ACM， 2023： 73-85.
40	YOUNG J W. A first order approximation to the optimum checkpoint interval［J］. Communications of the ACM， 1974， 17（9）： 530-531.
41	DALY J T. A higher order estimate of the optimum checkpoint interval for restart dumps［J］. Future Generation Computer Systems， 2006， 22（3）： 303-312.
42	EL-SAYED N， SCHROEDER B. Checkpoint/restart in practice： When ‘simple is better’［C］// Proceedings of the 2014 IEEE International Conference on Cluster Computing. Piscataway： IEEE， 2014： 84-92.
43	SUBASI O， KESTOR G， KRISHNAMOORTHY S. Toward a general theory of optimal checkpoint placement ［C］// Proceedings of the 2017 IEEE International Conference on Cluster Computing. Piscataway： IEEE， 2017： 464-474.
44	LI L， ZNATI T. AtFP： attention-based failure predictor for extreme-scale computing［C］// Proceedings of the 13th International Conference on Reliability， Maintainability， and Safety. Piscataway： IEEE， 2022： 23-27.
45	ALHARTHI K A， JHUMKA A， DI S， et al. Time machine： generative real-time model for failure （and lead time） prediction in HPC systems［C］// Proceedings of the 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks. Piscataway： IEEE， 2023： 508-521.
46	DAS A， MUELLER F， ROUNTREE B. Aarohi： making real-time node failure prediction feasible ［C］// Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium. Piscataway： IEEE， 2020： 1092-1101.
47	DAS A， MUELLER F， SIEGEL C， et al. Desh： deep learning for system health prediction of lead times to failure in HPC［C］// Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. New York： ACM， 2018： 40-51.
48	DAS A， MUELLER F， HARGROVE P， et al. Doomsday： predicting which node will fail when on supercomputers［C］// Proceedings of the 2018 International Conference for High Performance Computing， Networking， Storage and Analysis. Piscataway： IEEE， 2018： 108-121.

文献序号	技术分类				技术方案	优化效果
文献序号	全量/增量	阻塞/ 非阻塞	单级/多级	周期/ 动态	技术方案	优化效果
［10］	增量	阻塞	多级	周期	GPU加速的去重技术识别和消除数据冗余	空间开销降低
［11］	全量	非阻塞	单级	周期	多线程I/O聚合策略	I/O争用缓解
［12］	全量	阻塞	多级	周期	引入链式复制（Backup Chain， BC）图和多级检查点存储机制	I/O争用缓解
［13］	增量	非阻塞	多级	动态	基于内存更新相关性和伙伴页面实现增量检查点	空间开销降低
［6］	全量	非阻塞	多级	周期	引入多级检查点方案	时间开销降低
［14］	全量	阻塞	单级	动态	在有向无环图中识别关键路径并优化检查点设置	时间开销降低
［15］	全量	阻塞	单级	动态	基于强化学习的动态自适应检查点机制（Dynamic Adaptive Checkpoint Mechanism， DACM）	时间开销降低
［16］	增量	阻塞	单级	周期	基于数据结构的增量检查点机制（Data Structure based Incremental Checkpointing， DSIC）	空间开销降低
［17］	全量	非阻塞	多级	动态	优先处理即将发生故障的节点的检查点	时间开销降低
［1］	增量	阻塞	单级	动态	基于TripleC编译器的增量检查点技术	时间开销降低
［18］	全量	阻塞	单级	动态	基于惩罚的多元线性回归模型（Penalty-Based Multiple Linear Regression， PB-MLR）动态预测作业检查点持续时间结果设置检查点	时间开销降低
［19］	全量	阻塞	单级	动态	利用云编排技术自动优化分布式流处理作业容错配置	时间开销降低
［20］	全量	非阻塞	单级	周期	结合陈旧度和延迟感知的跳过策略的检查点设置策略	时间开销降低
［21］	全量	阻塞/非阻塞	单级	周期	根据时间成本评估动态选择阻塞/非阻塞检查点	时间开销降低
［22］	全量	非阻塞	多级	周期	利用进程间的空间不平衡，减少了I/O开销	I/O争用缓解
［23］	全量	非阻塞	多级	周期	分布式深度学习训练中引入多级检查点方案	I/O争用缓解
［24］	全量	非阻塞	多级	周期	传递故障拓扑信息加速检查点检索过程	时间开销降低
［25］	全量	阻塞	单级	周期	考虑个别作业的失败概率优化检查点设置	时间开销降低
［26］	全量	非阻塞	多级	周期	利用机器学习预测和可扩展的启发式算法确定最佳检查点设置	时间开销降低
［27］	增量	阻塞	单级	动态	利用内存保护单元实现增量式检查点	空间开销降低
［28］	全量	非阻塞	多级	周期	结合模拟方法与机器学习技术优化多级检查点配置	时间开销降低
［29］	全量	非阻塞	多级	周期	基于利用率模型优化检查点间隔	时间开销降低

文献序号	技术分类				技术方案	优化效果
文献序号	全量/增量	阻塞/ 非阻塞	单级/多级	周期/ 动态	技术方案	优化效果
［10］	增量	阻塞	多级	周期	GPU加速的去重技术识别和消除数据冗余	空间开销降低
［11］	全量	非阻塞	单级	周期	多线程I/O聚合策略	I/O争用缓解
［12］	全量	阻塞	多级	周期	引入链式复制（Backup Chain， BC）图和多级检查点存储机制	I/O争用缓解
［13］	增量	非阻塞	多级	动态	基于内存更新相关性和伙伴页面实现增量检查点	空间开销降低
［6］	全量	非阻塞	多级	周期	引入多级检查点方案	时间开销降低
［14］	全量	阻塞	单级	动态	在有向无环图中识别关键路径并优化检查点设置	时间开销降低
［15］	全量	阻塞	单级	动态	基于强化学习的动态自适应检查点机制（Dynamic Adaptive Checkpoint Mechanism， DACM）	时间开销降低
［16］	增量	阻塞	单级	周期	基于数据结构的增量检查点机制（Data Structure based Incremental Checkpointing， DSIC）	空间开销降低
［17］	全量	非阻塞	多级	动态	优先处理即将发生故障的节点的检查点	时间开销降低
［1］	增量	阻塞	单级	动态	基于TripleC编译器的增量检查点技术	时间开销降低
［18］	全量	阻塞	单级	动态	基于惩罚的多元线性回归模型（Penalty-Based Multiple Linear Regression， PB-MLR）动态预测作业检查点持续时间结果设置检查点	时间开销降低
［19］	全量	阻塞	单级	动态	利用云编排技术自动优化分布式流处理作业容错配置	时间开销降低
［20］	全量	非阻塞	单级	周期	结合陈旧度和延迟感知的跳过策略的检查点设置策略	时间开销降低
［21］	全量	阻塞/非阻塞	单级	周期	根据时间成本评估动态选择阻塞/非阻塞检查点	时间开销降低
［22］	全量	非阻塞	多级	周期	利用进程间的空间不平衡，减少了I/O开销	I/O争用缓解
［23］	全量	非阻塞	多级	周期	分布式深度学习训练中引入多级检查点方案	I/O争用缓解
［24］	全量	非阻塞	多级	周期	传递故障拓扑信息加速检查点检索过程	时间开销降低
［25］	全量	阻塞	单级	周期	考虑个别作业的失败概率优化检查点设置	时间开销降低
［26］	全量	非阻塞	多级	周期	利用机器学习预测和可扩展的启发式算法确定最佳检查点设置	时间开销降低
［27］	增量	阻塞	单级	动态	利用内存保护单元实现增量式检查点	空间开销降低
［28］	全量	非阻塞	多级	周期	结合模拟方法与机器学习技术优化多级检查点配置	时间开销降低
［29］	全量	非阻塞	多级	周期	基于利用率模型优化检查点间隔	时间开销降低

层级	功能	容错特性
本地（L₁）	复制检查点数据到节点本地存储	可从任意数量的软错误中恢复
伙伴（L₂）	复制检查点数据到伙伴节点与本地节点的存储	只要节点与伙伴节点没有同时发生故障便可恢复
异或（L₃）	将异或编码后的检查点数据存储到指定异或节点集	只要同一异或集中的至少有2个节点没有同时故障便可恢复
RS码（L₃）	将RS编码后的检查点数据存储到指定的RS节点集	只要同一RS集中不超过一般节点同时发生故障便可恢复
并行文件系统（L₄）	将检查点数据复制到网络存储	可从任何故障中恢复

层级	功能	容错特性
本地（L₁）	复制检查点数据到节点本地存储	可从任意数量的软错误中恢复
伙伴（L₂）	复制检查点数据到伙伴节点与本地节点的存储	只要节点与伙伴节点没有同时发生故障便可恢复
异或（L₃）	将异或编码后的检查点数据存储到指定异或节点集	只要同一异或集中的至少有2个节点没有同时故障便可恢复
RS码（L₃）	将RS编码后的检查点数据存储到指定的RS节点集	只要同一RS集中不超过一般节点同时发生故障便可恢复
并行文件系统（L₄）	将检查点数据复制到网络存储	可从任何故障中恢复

[1]	李诗扬, 倪少杰, 邓丁, 陈雷, 林红磊. 基于非正交离散变换的物理不可克隆函数可靠性提升算法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2116-2122.
[2]	解峥, 王子豪, 唐聃, 张航, 蔡红亮. 低编译复杂度的双容错阵列码[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2766-2774.
[3]	刘靖宇, 牛秋霞, 李萧言, 史巧硕, 武优西. 基于局部冗余混合编码的故障快速恢复方法[J]. 《计算机应用》唯一官方网站, 2022, 42(4): 1244-1252.
[4]	徐志强, 袁德砦, 陈亮. 基于稀疏随机矩阵的再生码构造方法[J]. 计算机应用, 2017, 37(7): 1948-1952.
[5]	杭彦希, 徐金甫, 南龙梅, 郭朋飞. 端口故障粒度划分的虚通道动态分配式容错路由器设计[J]. 计算机应用, 2017, 37(6): 1560-1568.
[6]	唐聃, 杨昊澎, 王福超. 基于多斜率码链的阵列纠删码[J]. 计算机应用, 2017, 37(4): 936-940.
[7]	孙宗奇, 臧海娟, 张春花, 潘勇. 基于delta码的乘除法运算错误检测改进算法[J]. 计算机应用, 2017, 37(4): 975-979.
[8]	吴天舒, 陈蜀宇, 张涵翠, 周真. 基于检测域划分的虚拟机异常检测算法[J]. 计算机应用, 2016, 36(4): 1066-1069.
[9]	邹宇, 薛小平, 张芳, 潘勇, 潘腾. 用于程序循环控制的错误检测算法[J]. 计算机应用, 2015, 35(12): 3450-3455.
[10]	刘晓霞, 刘靖. 基于虚拟机部署策略的云平台容错即服务方法[J]. 计算机应用, 2015, 35(12): 3530-3535.
[11]	张拥军徐新海. 面向计算流体力学应用开发框架的容错周期优化方法[J]. 计算机应用, 2014, 34(2): 382-386.
[12]	唐柳黄樟钦侯义斌方凤才张会兵. 利用冗余核的MPSoC故障检测方法[J]. 计算机应用, 2014, 34(1): 41-45.
[13]	陈国林，章立生. 一种基于FPGA的容错嵌入式系统设计[J]. 计算机应用, 2005, 25(08): 1916-1918.
[14]	吴娟，马永强，刘影. 一种基于主备机快速切换的双机容错系统[J]. 计算机应用, 2005, 25(08): 1948-1951.
[15]	王国豪，陈文智，石教英. 基于Linux的高可用系统中IP接管的设计与实现[J]. 计算机应用, 2005, 25(07): 1695-1697.

面向多样计算场景的检查点技术综述

Review of checkpoint technology for multiple computing scenarios

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 48

相关文章 15

编辑推荐

Metrics