计算机应用 ›› 2018, Vol. 38 ›› Issue (5): 1393-1398.DOI: 10.11772/j.issn.1001-9081.2017103024

• 先进计算 • 上一篇    下一篇

高维胖树系统中确定性路由容错策略实现

徐佳庆1, 万文2, 蔡东京1, 唐付桥1, 何杰1, 张磊1   

  1. 1. 国防科技大学 计算机学院, 长沙 410073;
    2. 中山大学 国家超级计算广州中心, 广州 510006
  • 收稿日期:2017-12-25 修回日期:2017-12-25 出版日期:2018-05-10 发布日期:2018-05-24
  • 通讯作者: 徐佳庆
  • 作者简介:徐佳庆(1982-),男,江苏如东人,助理研究员,博士,CCF会员,主要研究方向:高速互连、高性能计算;万文(1987-),男,湖南永州人,工程师,硕士,主要研究方向:高性能计算、计算机体系结构;蔡东京(1986-),男,湖北浠水人,工程师,硕士,主要研究方向:高速互连、高性能计算;唐付桥(1989-),男,湖南永州人,工程师,硕士,主要研究方向:高速互连、高性能计算;何杰(1988-),男,湖南常德人,工程师,硕士,主要研究方向:高速互连、高性能计算;张磊(1984-),男,湖南浏阳人,工程师,主要研究方向:高速互连、高性能计算。
  • 基金资助:
    国家重点研发计划项目(2016YFB0200203);国家自然科学基金面上项目(61572509)。

Implementation of deterministic routing fault-tolerant strategies for K-Ary N-Bridge system

XU Jiaqing1, WAN Wen2, CAI Dongjing1, TANG Fuqiao1, HE Jie1, ZHANG Lei1   

  1. 1. School of Computer, National University of Defense Technology, Changsha Hunan 410073, China;
    2. National Supercomputer Center in Guangzhou, Sun Yat-sen University, Guangzhou Guangdong 510006, China
  • Received:2017-12-25 Revised:2017-12-25 Online:2018-05-10 Published:2018-05-24
  • Contact: 徐佳庆
  • Supported by:
    This work is partially supported by The National Key Research and Development Program of China (2016YFB0200203), the National Natural Science Foundation of China (61572509).

摘要: 由于采用高维胖树拓扑结构的高性能计算机系统中叶交换机故障将严重影响系统使用,为了提高系统的可用性和可维性,基于误路由的思想提出了一套适用于高维胖树拓扑的确定性路由容错策略。其基本思路是通过误路由绕过发生故障的叶交换机,跳转至同维中其他叶交换机后,再通过正常路由到达目的节点。该容错策略可在不影响系统使用的情况下,实现故障叶交换机的屏蔽,并在实际的高维胖树系统中进行了容错实验。实验结果表明,该容错策略取得了可快速屏蔽故障叶交换机的预期效果,可以有效地提高系统维护的效率。

关键词: 高维胖树拓扑, 互连故障, 路由容错策略, 高性能计算, 网络维护

Abstract: The leaf switch failure would seriously affect the use of high performance computer system with K-Ary N-Bridge topology. In order to improve the usability and maintainability of that topology, a routing fault-tolerant strategy based on misrouting algorithm was proposed. The basic idea was to bypass the failed leaf switch leveraging misrouting, jump to other leaf switches in the same dimension, and then reached the destination node through the normal route. The proposed fault-tolerant strategy could shield the failed leaf switch without affecting the system usage. A fault-tolerant experiment was carried out in a practical K-Ary N-Bridge topology. The result shows that this fault-tolerant strategy can quickly shield the failed leaf switch as expected and can effectively improve the efficiency of system maintenance.

Key words: K-Ary N-Bridge, interconnection fault, routing fault-tolerance strategy, high performance computing, network maintenance

中图分类号: