DPCS2017+26+一种面向高性能计算的分布式故障定位框架

• •

DPCS2017+26+一种面向高性能计算的分布式故障定位框架

高剑,于康

江南计算技术研究所

收稿日期:2017-08-08 修回日期:2017-08-24 发布日期:2017-08-24
通讯作者: 高剑

A Distributed Fault Localization Framework for High Performance Computing

Received:2017-08-08 Revised:2017-08-24 Online:2017-08-24
Contact: Jian GAO

摘要/Abstract

摘要： 针对高性能计算系统中故障定位难度高且实时性差的问题，提出了一种基于消息传递的故障定位框架(MPFL)，包括基于树形拓扑的故障检测(TFD)和故障分析(TFA)算法。首先，在并行作业初始化时，将所有参与计算的节点进行逻辑上的树形划分，生成故障定位树(FLT)，并将故障定位任务分布到节点上；然后，当消息库、操作系统等组件检测到节点异常状态时，基于TFD算法分析作业的FLT结构，根据负载平衡、性能开销等因素选择接收异常状态的节点；最后，节点利用TFA算法对接收到的异常状态进行推理得出故障，TFA算法使用基于规则的事件关联，并基于消息传递设计轻量级的主动探测，将两种方式相结合，提高了故障分析的准确性。实验以模拟节点停机故障为定位目标，并以NPB-FT与NPB-IS为基准测试，在集群上对MPFL框架进行了评估。实验结果表明，MPFL框架在故障定位能力与性能开销方面表现突出。

关键词: 高性能计算, 消息传递, 故障定位, 事件关联, 主动探测

Abstract: In order to solve the problem of high difficulty and poor real-time in fault localization for high performance computing system, a message-passing based fault localization framework (MPFL) is proposed, which includes tree-based fault detection (TFD) and tree-based fault analysis (TFA) algorithms. Firstly, when the parallel application is initialized, the fault localization tree (FLT) is obtained by logically dividing the nodes, and the fault localization tasks are distributed to different nodes. Then, if the components such as message-passing library, operating system detect the abnormal state of a node, the TFD algorithm is used to analyze the FLT, and the node responsible for receiving the abnormal state is selected according to factors such as load balancing and performance cost. Finally, the node uses TFA algorithm to derive fault by reasoning the received abnormal state. The TFA algorithm uses rule-based event correlation and designs lightweight active probing based on message-passing, which improves the accuracy of fault analysis. The experimental evaluation is performed on a typical cluster, which demonstrates the capability of MPFL by locating the simulated node outages fault. Additionally, the results on the NPB-FT and NPB-IS benchmarks show that the MPFL service does not affect the performance of an application in practice.

Key words: Keywords: high performance computing, message-passing, fault localization, event correlation, active probing

中图分类号:

TP302.8

高剑于康. DPCS2017+26+一种面向高性能计算的分布式故障定位框架[J]. 计算机应用.

[1]	宋祥帅, 杨伏长, 谢江, 张武. Graphlet Degree Vector方法的优化与并行[J]. 《计算机应用》唯一官方网站, 2020, 40(2): 398-403.
[2]	龚鸣清, 叶煌, 张鉴, 卢兴敬, 陈伟. 基于ARMv8架构的面向机器翻译的单精度浮点通用矩阵乘法优化[J]. 计算机应用, 2019, 39(6): 1557-1562.
[3]	赵士操, 肖永浩, 段博文, 李于锋. HSWAP:适用于高性能计算环境的数值模拟工作流管理平台[J]. 计算机应用, 2019, 39(6): 1569-1576.
[4]	孙佳敏, 朱嘉富, 杨伏长, 谢江. 大规模生物网络马尔可夫聚类的并行化算法[J]. 计算机应用, 2019, 39(1): 66-71.
[5]	杨伏长, 朱嘉富, 孙佳敏, 谢江. 生物复杂网络motif发现的并行算法[J]. 计算机应用, 2019, 39(1): 72-77.
[6]	徐佳庆, 万文, 蔡东京, 唐付桥, 何杰, 张磊. 高维胖树系统中确定性路由容错策略实现[J]. 计算机应用, 2018, 38(5): 1393-1398.
[7]	王鹏, 周岩. 面向高性能应用的MPI大数据处理[J]. 计算机应用, 2018, 38(12): 3496-3499.
[8]	高剑, 于康, 卿鹏, 尉红梅. 面向高性能计算的分布式故障定位框架[J]. 计算机应用, 2018, 38(1): 44-49.
[9]	吕宏武, 谷雷, 王慧强, 邹世辰, 冯光升. 分层检查点的近似最优周期计算模型[J]. 计算机应用, 2017, 37(1): 103-107.
[10]	赵灿明, 李祝红, 陶磊, 张信明. 基于故障传播模型与监督学习的电力通信网络故障定位[J]. 计算机应用, 2016, 36(4): 905-908.
[11]	刘盼盼, 洪旭东, 郭剑毅, 余正涛, 文永华, 陈玮. 基于灰色关联分析的中文新闻事件关联性识别[J]. 计算机应用, 2016, 36(2): 408-413.
[12]	熊壬浩, 刘羽. A^*算法的改进及并行化[J]. 计算机应用, 2015, 35(7): 1843-1848.
[13]	吴洁璇, 陈振杰, 张云倩, 骈宇哲, 周琛. 多核CPU下的K-means遥感影像分类并行方法[J]. 计算机应用, 2015, 35(5): 1296-1301.
[14]	王克朝, 王甜甜, 任向民, 贾宗福. 失效上下文统计分析的软件故障定位方法[J]. 计算机应用, 2015, 35(3): 882-885.
[15]	付朝江, 陈洪均. 基于残余平滑预处理共轭梯度算法的有限元并行计算[J]. 计算机应用, 2015, 35(12): 3387-3391.