[1] DONGARRA J, BECKMAN P, MOORE T, et al. The international exascale software project roadmap[J]. International Journal of High Performance Computing Applications, 2011, 25(1):3-60. [2] SCHROEDER B, GIBSON G A. A large-scale study of failures in high-performance computing systems[J]. IEEE Transactions on Dependable and Secure Computing, 2010, 7(4):337-350. [3] YOUNG J W. A first order approximation to the optimum checkpoint interval[J]. Communications of the ACM, 1974, 17(9):530-531. [4] DALY J T. A higher order estimate of the optimum checkpoint interval for restart dumps[J]. Future Generation Computer Systems, 2006, 22(3):303-312. [5] 鄢喜爱,杨金民,田华.双机容错系统中最佳检查点间隔的分析[J].计算机工程,2007,33(5):283-285.(YAN X A, YANG J M, TIAN H. Analysis of best checkpoint interval of duplicated fault tolerance system[J]. Computer Engineering, 2007, 33(5):283-285.) [6] GE Y, YANG Y, ZHU C. Study of the best checkpoint interval in the distributed simulation system based on virtualization technology[C]//AMCCE 2015:Proceedings of 2015 International Conference on Automation, Mechanical Control and Computational Engineering. Amsterdam:Atlantis Press, 2015:193-197. [7] 黄琼,尚利宏,周密,等.一种面向大规模并行系统的分组协同检查点算法[J].计算机研究与发展,2010,47(S1):158-163.(HUANG Q, SHANG L H, ZHOU M, et al. A group-based coordinated checkpointing algorithm for large-scale parallel system[J]. Journal of Computer Research and Development, 2010, 47(S1):158-163.) [8] JIN H, CHEN Y, ZHU H, et al. Optimizing HPC fault-tolerant environment:An analytical approach[C]//ICPP 2010:Proceedings of the 201039th International Conference on Parallel Processing. Piscataway, NJ:IEEE, 2010:525-534. [9] ZHENG Z, LAN Z. Reliability-aware scalability models for high performance computing[C]//CLUSTER' 09:Proceedings of 2009 IEEE International Conference on Cluster Computing and Workshops. Piscataway, NJ:IEEE, 2009:1-9. [10] WANG L, PATTABIRAMAN K, KALBARCZYK Z, et al. Modeling coordinated checkpointing for large-scale supercomputers[C]//DSN 2005:Proceedings of the 2005 International Conference on Dependable Systems and Networks. Piscataway, NJ:IEEE, 2005:812-821. [11] GUERMOUCHE A, ROPARS T, SNIR M, et al. HydEE:failure containment without event logging for large scale send-deterministic mpi applications[C]//IPDPS 2012:Proceedings of 2012 IEEE 26th International Conference on Parallel & Distributed Processing Symposium. Piscataway, NJ:IEEE, 2012:1216-1227. [12] BOUTEILLER A, HERAULT T, BOSILCA G, et al. Correlated set coordination in fault tolerant message logging protocols[C]//Euro-Par' 11:Proceedings of the 17th International Conference on Parallel Processing. Berlin:Springer, 2011:51-64. [13] BOUTEILLER A, BOSILCA G, DONGARRA J. Redesigning the message logging model for high performance[J]. Concurrency and Computation:Practice and Experience, 2010, 22(16):2196-2211. |