计算机应用 ›› 2018, Vol. 38 ›› Issue (1): 44-49.DOI: 10.11772/j.issn.1001-9081.2017071948

• 2017年全国开放式分布与并行计算学术年会(DPCS 2017)论文 • 上一篇    下一篇

面向高性能计算的分布式故障定位框架

高剑, 于康, 卿鹏, 尉红梅   

  1. 江南计算技术研究所, 江苏 无锡 214083
  • 收稿日期:2017-08-08 修回日期:2017-08-24 出版日期:2018-01-10 发布日期:2018-01-22
  • 通讯作者: 高剑
  • 作者简介:高剑(1992-),男,云南富宁人,硕士研究生,主要研究方向:并行计算、运行时系统;于康(1987-),男,江西景德镇人,助理工程师,博士,主要研究方向:并行计算;卿鹏(1979-),男,四川资阳人,高级工程师,硕士,主要研究方向:并行编译、运行时系统;尉红梅(1968-),女,江苏无锡人,高级工程师,博士,主要研究方向:并行计算、并行编译。
  • 基金资助:
    国家重点研发计划项目(2016YFB0200502)。

Distributed fault localization framework for high performance computing

GAO Jian, YU Kang, QING Peng, WEI Hongmei   

  1. Jiangnan Institute of Computing Technology, Wuxi Jiangsu 214083, China
  • Received:2017-08-08 Revised:2017-08-24 Online:2018-01-10 Published:2018-01-22
  • Supported by:
    This work is partially supported by the National Key Research and Development Program of China (2016YFB0200502).

摘要: 针对高性能计算系统中故障定位难度高且实时性差的问题,提出了一种基于消息传递的故障定位框架(MPFL),包括基于树形拓扑的故障检测(TFD)和故障分析(TFA)算法。首先,在并行作业初始化时,将所有参与计算的节点进行逻辑上的树形划分,生成故障定位树(FLT),并将故障定位任务分布到节点上;然后,当消息库、操作系统等组件检测到节点异常状态时,基于TFD算法分析作业的FLT结构,根据负载平衡、性能开销等因素选择接收异常状态的节点;最后,节点利用TFA算法对接收到的异常状态进行推理得出故障,TFA算法使用基于规则的事件关联,并基于消息传递设计轻量级的主动探测,将两种方式相结合,提高了故障分析的准确性。实验以模拟节点停机故障为定位目标,并以NPB-FT与NPB-IS为基准测试,在集群上对MPFL框架进行了评估。实验结果表明,MPFL框架在故障定位能力与开销节省方面表现突出。

关键词: 高性能计算, 消息传递, 故障定位, 事件关联, 主动探测

Abstract: To solve the problem of high difficulty and poor real-time in fault localization for high performance computing system, a Message-Passing based Fault Localization (MPFL) framework was proposed, which included Tree-based Fault Detection (TFD) and Tree-based Fault Analysis (TFA) algorithms. Firstly, when the parallel application was initialized, the Fault Localization Tree (FLT) was obtained by logically dividing all the nodes participating in the computing, and the fault localization tasks were distributed to different nodes. Secondly, if the abnormal state of a node was detected by system components such as message-passing library and operating system, the TFD algorithm was used to analyze the FLT structure, and the node responsible for receiving the abnormal state was selected according to factors such as load balancing and performance cost. Finally, the fault was derived from the received abnormal state, which was reasoned by the node that used TFA algorithm. The rule-based event correlation and the lightweight active probing based on message-passing were used in TFA algorithm, and the accuracy of fault analysis was improved by combining these two approaches. The experimental evaluation was performed on a typical cluster, which demonstrated the capability of MPFL by locating the shutdown simulation nodes. The experimental results on the NPB-FT and NPB-IS benchmarks show that the MPFL framework has excellent performance on fault localization capability and cost saving.

Key words: high performance computing, message-passing, fault localization, event correlation, active probing

中图分类号: