计算机应用 ›› 2014, Vol. 34 ›› Issue (2): 382-386.

• 先进计算 • 上一篇    下一篇

面向计算流体力学应用开发框架的容错周期优化方法

张拥军,徐新海   

  1. 国防科学技术大学 计算机学院,长沙 410073
  • 收稿日期:2013-08-15 修回日期:2013-10-16 出版日期:2014-02-01 发布日期:2014-03-01
  • 通讯作者: 张拥军
  • 作者简介:张拥军(1972-),男,湖南新邵人,副研究员,博士,主要研究方向:高性能计算、容错;徐新海(1984-),男,江苏镇江人,助理研究员,博士,CCF会员,主要研究方向:高性能计算、容错。
  • 基金资助:
    国家自然科学基金资助项目;广州市科信局基金资助项目

Fault-tolerance period optimization method for computational fluid dynamics-oriented application development frameworks

ZHANG Yongjun,XU Xinhai   

  1. School of Computer, National University of Defense Technology, Changsha Hunan 410073, China
  • Received:2013-08-15 Revised:2013-10-16 Online:2014-02-01 Published:2014-03-01
  • Contact: ZHANG Yongjun
  • Supported by:
    National Natural Science Foundation

摘要: 针对计算流体力学应用开发框架容错支持能力的不足,提出了一种新的容错周期优化方法。该方法基于系统故障的概率建模,计算得到理想最优容错周期;并结合计算流体力学应用场数据输出的特点,在线确定实际检查点备份时机。三个典型应用的实验结果表明,在不同平均无故障时间的系统上,与固定时间步进行容错的方法相比,该方法总能够得到最优的容错开销。用户可以基于该方法通过框架接口便捷地设置容错周期,并有效降低容错所引起的开销。

关键词: 容错, 周期优化, 检查点, 计算流体力学, 开发框架

Abstract: For the fault-tolerance shortage of CFD (Computational Fluid Dynamics)-oriented application development framework, a new fault-tolerance period optimization method was proposed. The method computed the ideal best fault-tolerance period based on the probability model of system's faults, and online determined the occasion of real check points with the consideration of CFD fields output characteristic. The experimental results of three applications show that on the systems with different mean time between faults, compared with the fault-tolerance method based on performing fault-tolerance between fixed steps, the proposed method can always get the best fault-tolerance overheads. Based on this method, user can set the fault-tolerance period with framework interfaces conveniently and reduce the fault-tolerance overheads.

Key words: fault-tolerance, period optimization, check point, Computational Fluid Dynamics (CFD), development framework

中图分类号: