• •    

DPCS2017+26+一种面向高性能计算的分布式故障定位框架

高剑,于康   

  1. 江南计算技术研究所
  • 收稿日期:2017-08-08 修回日期:2017-08-24 发布日期:2017-08-24
  • 通讯作者: 高剑

A Distributed Fault Localization Framework for High Performance Computing

  • Received:2017-08-08 Revised:2017-08-24 Online:2017-08-24
  • Contact: Jian GAO

摘要: 针对高性能计算系统中故障定位难度高且实时性差的问题,提出了一种基于消息传递的故障定位框架(MPFL),包括基于树形拓扑的故障检测(TFD)和故障分析(TFA)算法。首先,在并行作业初始化时,将所有参与计算的节点进行逻辑上的树形划分,生成故障定位树(FLT),并将故障定位任务分布到节点上;然后,当消息库、操作系统等组件检测到节点异常状态时,基于TFD算法分析作业的FLT结构,根据负载平衡、性能开销等因素选择接收异常状态的节点;最后,节点利用TFA算法对接收到的异常状态进行推理得出故障,TFA算法使用基于规则的事件关联,并基于消息传递设计轻量级的主动探测,将两种方式相结合,提高了故障分析的准确性。实验以模拟节点停机故障为定位目标,并以NPB-FT与NPB-IS为基准测试,在集群上对MPFL框架进行了评估。实验结果表明,MPFL框架在故障定位能力与性能开销方面表现突出。

关键词: 高性能计算, 消息传递, 故障定位, 事件关联, 主动探测

Abstract: In order to solve the problem of high difficulty and poor real-time in fault localization for high performance computing system, a message-passing based fault localization framework (MPFL) is proposed, which includes tree-based fault detection (TFD) and tree-based fault analysis (TFA) algorithms. Firstly, when the parallel application is initialized, the fault localization tree (FLT) is obtained by logically dividing the nodes, and the fault localization tasks are distributed to different nodes. Then, if the components such as message-passing library, operating system detect the abnormal state of a node, the TFD algorithm is used to analyze the FLT, and the node responsible for receiving the abnormal state is selected according to factors such as load balancing and performance cost. Finally, the node uses TFA algorithm to derive fault by reasoning the received abnormal state. The TFA algorithm uses rule-based event correlation and designs lightweight active probing based on message-passing, which improves the accuracy of fault analysis. The experimental evaluation is performed on a typical cluster, which demonstrates the capability of MPFL by locating the simulated node outages fault. Additionally, the results on the NPB-FT and NPB-IS benchmarks show that the MPFL service does not affect the performance of an application in practice.

Key words: Keywords: high performance computing, message-passing, fault localization, event correlation, active probing

中图分类号: