主动容错副本存储系统的可靠性分析模型

doi:10.11772/j.issn.1001-9081.2020071067

计算机应用 ›› 2021, Vol. 41 ›› Issue (4): 1113-1121.DOI: 10.11772/j.issn.1001-9081.2020071067

所属专题：数据科学与技术

主动容错副本存储系统的可靠性分析模型

李静¹, 罗金飞², 李炳超¹

1. 中国民航大学计算机科学与技术学院, 天津 300300;
2. 南开大学计算机学院, 天津 300350

收稿日期:2020-07-21 修回日期:2020-09-15 发布日期:2020-10-20 出版日期:2021-04-10
通讯作者: 李静
作者简介:李静（1982—），女，山东德州人，讲师，博士，主要研究方向：大规模数据存储、机器学习；罗金飞（1995—），男，河南周口人，硕士研究生，主要研究方向：分布式存储系统、纠删码存储系统；李炳超（1987—），男，河北沧州人，讲师，博士，主要研究方向：计算机体系结构。
基金资助:
国家自然科学基金青年科学基金资助项目（61702521）；中央高校基本科研业务费专项资金资助项目（3122019122）。

Reliability analysis models for replication-based storage systems with proactive fault tolerance

LI Jing¹, LUO Jinfei², LI Bingchao¹

1. College of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China;
2. College of Computer Science, Nankai University, Tianjin 300350, China

Received:2020-07-21 Revised:2020-09-15 Online:2020-10-20 Published:2021-04-10
Supported by:
This work is partially supported by the Youth Program of National Natural Science Foundation of China (61702521), the Fundamental Research Funds for the Central Universities (3122019122).

摘要/Abstract

摘要： 主动容错机制通过预先发现即将故障的硬盘来提醒系统提前迁移备份危险数据，从而显著提高存储系统的可靠性。针对现有研究无法准确评价主动容错副本存储系统可靠性的问题，提出几种副本存储系统的状态转换模型，然后利用蒙特卡洛仿真算法实现了该模型，从而模拟主动容错副本存储系统的运行，最后统计系统在某个运行时期内发生数据丢失事件的期望次数。采用韦布分布函数模拟设备故障和故障修复事件的时间分布，并定量评价了主动容错机制、节点故障、节点故障修复、硬盘故障以及硬盘故障修复事件对存储系统可靠性的影响。实验结果表明，当预测模型的准确率达到50%时，系统的可靠性可以提高1~3倍；与二副本系统相比，三副本系统对系统参数更敏感。所提模型可以帮助系统管理者比较权衡不同的容错方式以及系统参数下的系统可靠性水平，从而搭建高可靠和高可用的存储系统。

关键词: 主动容错, 副本存储系统, 可靠性分析, 节点故障, 硬盘故障, 韦布分布, 系统状态转换

Abstract: Proactive fault tolerance mechanism, which predicts disk failures and prompts the system to perform migration and backup for the data in danger in advance, can be used to enhance the storage system reliability. In view of the problem that the reliability of the replication-based storage systems with proactive fault tolerance cannot be evaluated by the existing research accurately, several state transition models were proposed for replication-based storage systems; then the models were implemented based on Monte Carlo simulation, so as to simulate the running of the replication-based storage systems with proactive fault tolerance; at last, the expected number of data-loss events during a period in the systems was counted. The Weibull distribution function was used to model the time distribution of device failure and failure repair events, and the impact of proactive fault tolerance mechanism, node failures, node failure repairs, disk failures and disk failure repairs on the system reliability were evaluated quantitatively. Experimental results showed that when the accuracy of the prediction model reached 50%, the reliability of the systems were able to be improved by 1-3 times, and compared with 2-way replication systems, 3-way replication systems were more sensitive to system parameters. By using the proposed models, system administrators can easily assess system reliability under different fault tolerance schemes and system parameters, and then build storage systems with high reliability and high availability.

Key words: proactive fault tolerance, replication-based storage system, reliability analysis, node failure, disk failure, Weibull distribution, system state transition

中图分类号:

TP302

李静, 罗金飞, 李炳超. 主动容错副本存储系统的可靠性分析模型[J]. 计算机应用, 2021, 41(4): 1113-1121.

LI Jing, LUO Jinfei, LI Bingchao. Reliability analysis models for replication-based storage systems with proactive fault tolerance[J]. Journal of Computer Applications, 2021, 41(4): 1113-1121.

参考文献

[1] VISHWANATH K V, NAGAPPAN N. Characterizing cloud computing hardware reliability[C]//Proceedings of the 1st ACM Symposium on Cloud Computing. New York:ACM, 2010:193-204.
[2] XIN Q, MILLER E L, SCHWARZ T, et al. Reliability mechanisms for very large storage systems[C]//Proceedings of the 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies. Piscataway:IEEE,2003:146-156.
[3] ZHANG M, HAN S, LEE P P C. A simulation analysis of reliability in erasure-coded data centers[C]//Proceedings of the IEEE 36th Symposium on Reliable Distributed Systems. Piscataway:IEEE,2017:144-153.
[4] LI J,JI X,JIA Y,et al. Hard drive failure prediction using classification and regression trees[C]//Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. Piscataway:IEEE,2014:383-394.
[5] MA A,DOUGLIS F,LU G,et al. RAIDShield:characterizing, monitoring,and proactively protecting against disk failures[C]//Proceedings of the 13th USENIX Conference on File and Storage Technologies. Berkeley:USENIX Association,2015:241-256.
[6] WU S,JIANG H,MAO B. Proactive data migration for improved storage availability in large-scale data centers[J]. IEEE Transactions on Computers,2015,64(9):2637-2651.
[7] XU C,WANG G,LIU X,et al. Health status assessment and failure prediction for hard drives with recurrent neural networks[J]. IEEE Transactions on Computers,2016,65(11):3502-3508.
[8] LI J,STONES R J,WANG G,et al. Being accurate is not enough:new metrics for disk failure prediction[C]//Proceedings of the IEEE 35th Symposium on Reliable Distributed Systems. Piscataway:IEEE,2016:71-80.
[9] LI J,STONES R J,WANG G,et al. Hard drive failure prediction using decision trees[J]. Reliability Engineering and System Safety, 2017,164:55-65.
[10] QIN A,HU D,LIU J,et al. Fatman:cost-saving and reliable archival storage based on volunteer resources[J]. Proceedings of the VLDB Endowment,2014,7(13):1748-1753.
[11] JI X,MA Y,MA R,et al. A proactive fault tolerance scheme for large scale storage systems[C]//Proceedings of the 2015 International Conference on Algorithms and Architectures for Parallel Processing, LNCS 9530. Cham:Springer, 2015:337-350.
[12] ONGARO D,RUMBLE S M,STUTSMAN R,et al. Fast crash recovery in RAMCloud[C]//Proceedings of the 23rd ACM Symposium on Operating Systems Principles. New York:ACM, 2011:29-41.
[13] SHVACHKO K, KUANG H, RADIA S, et al. The Hadoop distributed file system[C]//Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies. Piscataway:IEEE,2010:1-10.
[14] 李静, 刘冬实. 主动容错云存储系统的可靠性评价模型[J]. 计算机应用,2018,38(9):2631-2636,2649.(LI J,LIU D S. Reliability evaluation model for cloud storage systems with proactive fault tolerance[J]. Journal of Computer Applications, 2018,38(9):2631-2636,2649.)
[15] 章宏灿, 薛巍. 集群RAID5存储系统可靠性分析[J]. 计算机研究与发展,2010,47(4):727-735.(ZHANG H C,XUE W. Reliability analysis of cluster RAID5 storage system[J]. Journal of Computer Research and Development,2010,47(4):727-735.)
[16] SCHROEDER B,GINSON G A. Disk failures in the real world:what does an MTTF of 1,000,000 hours mean to you?[C]//Proceedings of the 5th USENIX Conference on File and Storage Technologies. Berkeley:USENIX Association,2007:1-16.
[17] LU Y, MILLER A A, HOFFMANN R, et al. Towards the automated verification of Weibull distributions for system failure rates[C]//Proceedings of the 21st International Workshop on Formal Methods for Industrial Critical Systems/16th International Workshop on Automated Verification of Critical Systems,LNCS 9933. Cham:Springer,2016:81-96.
[18] ELERATH J G,SCHINDLER J. Beyond MTTDL:a closed-form RAID 6 reliability equation[J]. ACM Transactions on Storage, 2014,10(2):No. 7.
[19] VENKATESAN V,ILIADIS I. A general reliability model for data storage systems[C]//Proceedings of the 9th International Conference on Quantitative Evaluation of Systems Quantitative Evaluation of Systems. Piscataway:IEEE,2012:209-219.
[20] EPSTEIN A,KOLODNER E K,SOTNIKOV D. Network aware reliability analysis for distributed storage systems[C]//Proceedings of the IEEE 35th Symposium on Reliable Distributed Systems. Piscataway:IEEE,2016:249-258.
[21] WANG J, WU H, WANG R. A new reliability model in replication-based big data storage systems[J]. Journal of Parallel and Distributed Computing,2017,108:14-27.
[22] HALL R J. Tools for predicting the reliability of large-scale storage systems[J]. ACM Transactions on Storage,2016,12(4):No. 24
[23] ECKART B,CHEN X,HE X,et al. Failure prediction models for proactive fault tolerance within storage systems[C]//Proceedings of the 2008 IEEE International Symposium on Modeling,Analysis and Simulation of Computers and Telecommunication Systems. Piscataway:IEEE,2008:1-8.
[24] LI J,LI M,WANG G,et al. Global reliability evaluation for cloud storage systems with proactive fault tolerance[C]//Proceedings of the 2015 International Conference on Algorithms and Architectures for Parallel Processing,LNCS 9531. Cham:Springer,2015:189-203.
[25] LI J,LI P,STONES R J,et al. Reliability equations for cloud storage systems with proactive fault tolerance[J]. IEEE Transactions on Dependable and Secure Computing,2020,17(4):782-794.

主动容错副本存储系统的可靠性分析模型

Reliability analysis models for replication-based storage systems with proactive fault tolerance

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 3

编辑推荐

Metrics

[1]	陆秋琴, 靳超. 煤炭运输公路网络可靠性仿真分析[J]. 计算机应用, 2019, 39(1): 292-297.
[2]	李静, 刘冬实. 主动容错云存储系统的可靠性评价模型[J]. 计算机应用, 2018, 38(9): 2631-2636.
[3]	杨玉星王世英. k元n立方网络的k圈排除问题的递归算法[J]. 计算机应用, 2013, 33(09): 2401-2403.