Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (4): 1113-1121.DOI: 10.11772/j.issn.1001-9081.2020071067

Special Issue: 数据科学与技术

• Data science and technology • Previous Articles     Next Articles

Reliability analysis models for replication-based storage systems with proactive fault tolerance

LI Jing1, LUO Jinfei2, LI Bingchao1   

  1. 1. College of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China;
    2. College of Computer Science, Nankai University, Tianjin 300350, China
  • Received:2020-07-21 Revised:2020-09-15 Online:2021-04-10 Published:2020-10-20
  • Supported by:
    This work is partially supported by the Youth Program of National Natural Science Foundation of China (61702521), the Fundamental Research Funds for the Central Universities (3122019122).


李静1, 罗金飞2, 李炳超1   

  1. 1. 中国民航大学 计算机科学与技术学院, 天津 300300;
    2. 南开大学 计算机学院, 天津 300350
  • 通讯作者: 李静
  • 作者简介:李静(1982—),女,山东德州人,讲师,博士,主要研究方向:大规模数据存储、机器学习;罗金飞(1995—),男,河南周口人,硕士研究生,主要研究方向:分布式存储系统、纠删码存储系统;李炳超(1987—),男,河北沧州人,讲师,博士,主要研究方向:计算机体系结构。
  • 基金资助:

Abstract: Proactive fault tolerance mechanism, which predicts disk failures and prompts the system to perform migration and backup for the data in danger in advance, can be used to enhance the storage system reliability. In view of the problem that the reliability of the replication-based storage systems with proactive fault tolerance cannot be evaluated by the existing research accurately, several state transition models were proposed for replication-based storage systems; then the models were implemented based on Monte Carlo simulation, so as to simulate the running of the replication-based storage systems with proactive fault tolerance; at last, the expected number of data-loss events during a period in the systems was counted. The Weibull distribution function was used to model the time distribution of device failure and failure repair events, and the impact of proactive fault tolerance mechanism, node failures, node failure repairs, disk failures and disk failure repairs on the system reliability were evaluated quantitatively. Experimental results showed that when the accuracy of the prediction model reached 50%, the reliability of the systems were able to be improved by 1-3 times, and compared with 2-way replication systems, 3-way replication systems were more sensitive to system parameters. By using the proposed models, system administrators can easily assess system reliability under different fault tolerance schemes and system parameters, and then build storage systems with high reliability and high availability.

Key words: proactive fault tolerance, replication-based storage system, reliability analysis, node failure, disk failure, Weibull distribution, system state transition

摘要: 主动容错机制通过预先发现即将故障的硬盘来提醒系统提前迁移备份危险数据,从而显著提高存储系统的可靠性。针对现有研究无法准确评价主动容错副本存储系统可靠性的问题,提出几种副本存储系统的状态转换模型,然后利用蒙特卡洛仿真算法实现了该模型,从而模拟主动容错副本存储系统的运行,最后统计系统在某个运行时期内发生数据丢失事件的期望次数。采用韦布分布函数模拟设备故障和故障修复事件的时间分布,并定量评价了主动容错机制、节点故障、节点故障修复、硬盘故障以及硬盘故障修复事件对存储系统可靠性的影响。实验结果表明,当预测模型的准确率达到50%时,系统的可靠性可以提高1~3倍;与二副本系统相比,三副本系统对系统参数更敏感。所提模型可以帮助系统管理者比较权衡不同的容错方式以及系统参数下的系统可靠性水平,从而搭建高可靠和高可用的存储系统。

关键词: 主动容错, 副本存储系统, 可靠性分析, 节点故障, 硬盘故障, 韦布分布, 系统状态转换

CLC Number: