分层检查点的近似最优周期计算模型

doi:10.11772/j.issn.1001-9081.2017.01.0103

计算机应用 ›› 2017, Vol. 37 ›› Issue (1): 103-107.DOI: 10.11772/j.issn.1001-9081.2017.01.0103

• 2016年全国开放式分布与并行计算学术年会(DPCS2016)论文 • 上一篇下一篇

分层检查点的近似最优周期计算模型

吕宏武, 谷雷, 王慧强, 邹世辰, 冯光升

哈尔滨工程大学计算机科学与技术学院, 哈尔滨 150001

收稿日期:2016-07-20 修回日期:2016-08-05 发布日期:2017-01-09 出版日期:2017-01-10
通讯作者: 谷雷
作者简介:吕宏武(1983-),男,山东日照人,讲师,博士,CCF会员,主要研究方向:可用性、性能评价、云计算;谷雷(1991-),男,河南安阳人,硕士研究生,主要研究方向:高可用系统、网络安全;王慧强(1960-),男,黑龙江哈尔滨人,教授,博士,CCF会员,主要研究方向:网络安全、未来网络;邹世辰(1988-),男,黑龙江哈尔滨人,博士研究生,CCF会员,主要研究方向:可信性保障、信任管理;冯光升(1980-),男,山东禹城人,讲师,博士,CCF会员,主要研究方向:网络安全、认知网络。
基金资助:
国家自然科学基金资助项目（61370212，61402127，61502118）；黑龙江省自然科学基金资助项目（F2015029）。

Quasi-optimal period computation model for hierarchical checkpoint protocol

LYU Hongwu, GU Lei, WANG Huiqiang, ZOU Shichen, FENG Guangsheng

College of Computer Science and Technology, Harbin Engineering University, Harbin Heilongjiang 150001, China

Received:2016-07-20 Revised:2016-08-05 Online:2017-01-09 Published:2017-01-10
Supported by:
This work is partially supported by National Natural Science Foundation of China (61370212, 61402127, 61502118), the Natural Science Foundation of Heilongjiang Province (F2015029).

摘要/Abstract

摘要： 针对大规模高性能计算（HPC）系统中检查点效率提升问题，提出一种面向分层检查点近似最优周期计算模型。首先，通过分析一个HPC系统中应用程序的执行过程，将检查点周期优化抽象为一个非线性的检查点成本模型；其次，通过分析可能故障位置推导出分层检查点成本公式，并引入两个减速因子和一个加速因子来模拟消息日志对分层检查点造成的影响。仿真实验结果表明，所提模型与理论近似最优周期检查点成本平均误差在5%以下，相对传统检查点周期优化模型的平均误差降低了20%，能够有效提高检查点的效率，提升HPC系统可用性。

关键词: 高性能计算, 容错, 分层检查点, 检查点周期, 近似最优解

Abstract: With the increase of High Performance Computation (HPC) system scale, it's very important to increase the efficiency of the checkpoint. A model to compute the quasi-optimal period for hierarchical checkpoint protocol was proposed. First, the execution of an application in HPC system was assessed, and checkpoint period optimization problem was abstracted as the nonlinear checkpoint cost model. Second, the hierarchical checkpoint cost formula was derived by simulating the possible fault location; two deceleration parameters and an acceleration parameter were introduced to reflect the impact of message logging on the hierarchical checkpoint. The simulation results show that, compared with the quasi-optimal period checkpoint cost, the average error value of the proposed model is below 5%, which is 20% less than that of the traditional model based on Markov chain. The proposed model can signally increase the efficiency of the hierarchical checkpoint protocol; meanwhile enhance the availability of the HPC system.

Key words: High Performance Computation (HPC), fault tolerance, hierarchical checkpoint, checkpoint period, quasi-optimal solution

中图分类号:

TP399
TP302

吕宏武, 谷雷, 王慧强, 邹世辰, 冯光升. 分层检查点的近似最优周期计算模型[J]. 计算机应用, 2017, 37(1): 103-107.

LYU Hongwu, GU Lei, WANG Huiqiang, ZOU Shichen, FENG Guangsheng. Quasi-optimal period computation model for hierarchical checkpoint protocol[J]. Journal of Computer Applications, 2017, 37(1): 103-107.

参考文献

[1] DONGARRA J, BECKMAN P, MOORE T, et al. The international exascale software project roadmap[J]. International Journal of High Performance Computing Applications, 2011, 25(1):3-60.
[2] SCHROEDER B, GIBSON G A. A large-scale study of failures in high-performance computing systems[J]. IEEE Transactions on Dependable and Secure Computing, 2010, 7(4):337-350.
[3] YOUNG J W. A first order approximation to the optimum checkpoint interval[J]. Communications of the ACM, 1974, 17(9):530-531.
[4] DALY J T. A higher order estimate of the optimum checkpoint interval for restart dumps[J]. Future Generation Computer Systems, 2006, 22(3):303-312.
[5] 鄢喜爱,杨金民,田华.双机容错系统中最佳检查点间隔的分析[J].计算机工程,2007,33(5):283-285.(YAN X A, YANG J M, TIAN H. Analysis of best checkpoint interval of duplicated fault tolerance system[J]. Computer Engineering, 2007, 33(5):283-285.)
[6] GE Y, YANG Y, ZHU C. Study of the best checkpoint interval in the distributed simulation system based on virtualization technology[C]//AMCCE 2015:Proceedings of 2015 International Conference on Automation, Mechanical Control and Computational Engineering. Amsterdam:Atlantis Press, 2015:193-197.
[7] 黄琼,尚利宏,周密,等.一种面向大规模并行系统的分组协同检查点算法[J].计算机研究与发展,2010,47(S1):158-163.(HUANG Q, SHANG L H, ZHOU M, et al. A group-based coordinated checkpointing algorithm for large-scale parallel system[J]. Journal of Computer Research and Development, 2010, 47(S1):158-163.)
[8] JIN H, CHEN Y, ZHU H, et al. Optimizing HPC fault-tolerant environment:An analytical approach[C]//ICPP 2010:Proceedings of the 201039th International Conference on Parallel Processing. Piscataway, NJ:IEEE, 2010:525-534.
[9] ZHENG Z, LAN Z. Reliability-aware scalability models for high performance computing[C]//CLUSTER' 09:Proceedings of 2009 IEEE International Conference on Cluster Computing and Workshops. Piscataway, NJ:IEEE, 2009:1-9.
[10] WANG L, PATTABIRAMAN K, KALBARCZYK Z, et al. Modeling coordinated checkpointing for large-scale supercomputers[C]//DSN 2005:Proceedings of the 2005 International Conference on Dependable Systems and Networks. Piscataway, NJ:IEEE, 2005:812-821.
[11] GUERMOUCHE A, ROPARS T, SNIR M, et al. HydEE:failure containment without event logging for large scale send-deterministic mpi applications[C]//IPDPS 2012:Proceedings of 2012 IEEE 26th International Conference on Parallel & Distributed Processing Symposium. Piscataway, NJ:IEEE, 2012:1216-1227.
[12] BOUTEILLER A, HERAULT T, BOSILCA G, et al. Correlated set coordination in fault tolerant message logging protocols[C]//Euro-Par' 11:Proceedings of the 17th International Conference on Parallel Processing. Berlin:Springer, 2011:51-64.
[13] BOUTEILLER A, BOSILCA G, DONGARRA J. Redesigning the message logging model for high performance[J]. Concurrency and Computation:Practice and Experience, 2010, 22(16):2196-2211.

[1]	穆凌霞, 周政君, 王斑, 张友民, 薛向宏, 宁凯凯. 多无人机编队避障和编队重构方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2938-2946.
[2]	牛科迪, 李敏, 姚中原, 斯雪明. 面向物联网的区块链共识算法综述[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3678-3687.
[3]	解峥, 王子豪, 唐聃, 张航, 蔡红亮. 低编译复杂度的双容错阵列码[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2766-2774.
[4]	王春东, 姜鑫. 基于可验证延迟函数的改进实用拜占庭容错算法[J]. 《计算机应用》唯一官方网站, 2023, 43(11): 3484-3489.
[5]	王谨东, 李强. 基于Raft算法改进的实用拜占庭容错共识算法[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 122-129.
[6]	任秀丽, 张雷. 基于实用拜占庭容错的改进的多主节点共识机制[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1500-1507.
[7]	杨龙海, 王学渊, 蒋和松. 改进SM2签名方法的区块链数字签名方案[J]. 计算机应用, 2021, 41(7): 1983-1988.
[8]	李静, 罗金飞, 李炳超. 主动容错副本存储系统的可靠性分析模型[J]. 计算机应用, 2021, 41(4): 1113-1121.
[9]	刘宇, 朱朝阳, 李金泽, 劳源基, 覃团发. 检测型的联盟区块链共识算法d-PBFT[J]. 计算机应用, 2021, 41(3): 756-762.
[10]	林腾涛, 查思明, 陈蕾, 龙显忠. 图趋势过滤诱导的噪声容错多标记学习模型[J]. 计算机应用, 2021, 41(1): 8-14.
[11]	李靖, 景旭, 杨会君. 基于实用拜占庭容错算法的区块链电子计票方案[J]. 计算机应用, 2020, 40(4): 954-960.
[12]	王可可, 陈志德, 徐健. 基于联盟区块链的农产品质量安全高效追溯体系[J]. 计算机应用, 2019, 39(8): 2438-2443.
[13]	甘俊, 李强, 陈子豪, 张超. 区块链实用拜占庭容错共识算法的改进[J]. 计算机应用, 2019, 39(7): 2148-2155.
[14]	龚鸣清, 叶煌, 张鉴, 卢兴敬, 陈伟. 基于ARMv8架构的面向机器翻译的单精度浮点通用矩阵乘法优化[J]. 计算机应用, 2019, 39(6): 1557-1562.
[15]	赵士操, 肖永浩, 段博文, 李于锋. HSWAP:适用于高性能计算环境的数值模拟工作流管理平台[J]. 计算机应用, 2019, 39(6): 1569-1576.

分层检查点的近似最优周期计算模型

Quasi-optimal period computation model for hierarchical checkpoint protocol

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics