Log analysis and workload characteristic extraction in distributed storage system

doi:10.11772/j.issn.1001-9081.2020010121

Journal of Computer Applications ›› 2020, Vol. 40 ›› Issue (9): 2586-2593.DOI: 10.11772/j.issn.1001-9081.2020010121

• Data science and technology • Previous Articles Next Articles

Log analysis and workload characteristic extraction in distributed storage system

GOU Zi'an¹, ZHANG Xiao^1,2, WU Dongnan¹, WANG Yanqiu¹

1. School of Computer Science, Northwestern Polytechnical University, Xi'an Shaanxi 710129, China;
2. Key Laboratory of Big Data Storage and Management, Ministry of Industry and Information Technology(Northwestern Polytechnical University), Xi'an Shaanxi 710129, China

Received:2020-02-13 Revised:2020-05-05 Online:2020-05-10 Published:2020-09-10
Supported by:
This work is partially supported by the National Key Research and Development Program of China (2018YFB1004400), the Beijing Natural Science Foundation-Haidian Original Innovation Union Foundation (L192027).

分布式存储系统中的日志分析与负载特征提取

苟子安¹, 张晓^1,2, 吴东南¹, 王艳秋¹

1. 西北工业大学计算机学院, 西安 710129;
2. 工信部大数据存储与管理重点实验室(西北工业大学), 西安 710129

通讯作者: 张晓
作者简介:苟子安(1996-),男,陕西西安人,硕士研究生,主要研究方向:分布式文件系统性能优化、缓存优化、文件预取、机器学习;张晓(1978-),男,河南南阳人,副教授,博士,CCF高级会员,主要研究方向:计算机存储系统、云计算、大数据存储与管理;吴东南(1996-),男,陕西宝鸡人,硕士研究生,主要研究方向:分布式文件系统性能优化;王艳秋(1995-),女,河南南阳人,硕士研究生,主要研究方向:分布式文件系统性能优化、大数据存储与管理。
基金资助:
国家重点研发计划项目（2018YFB1004400）；北京市自然科学基金-海淀原始创新联合基金资助项目（L192027）。

Abstract

Abstract: Analysis of the workload running on the file system is helpful to optimize the performance of the distributed file system and is crucial to the construction of new storage system. Due to the complexity of workload and the increase of scale diversity, it is incomplete to explicitly capture the characteristics of workload traces by intuition-based analysis. To solve this problem, a distributed log analysis and workload characteristic extraction model was proposed. First, reading and writing related information was extracted from distributed file system logs according to the keywords. Second, the workload characteristics were described from two aspects: statistics and timing. Finally, the possibility of system optimization based on workload characteristics was analyzed. Experimental results show that the proposed model has certain feasibility and accuracy, and can give workload statistics and timing characteristics in detail. It has the advantages of low overhead, high timeliness and being easy to analyze, and can be used to guide the synthesis of workloads with the same characteristics, hot spot data monitoring, and cache prefetching optimization of the system.

Key words: distributed file system, log analysis, workload characteristic, timing characteristic, performance optimization

摘要： 对运行在文件系统上的工作负载进行分析有助于优化分布式文件系统的性能，且对构建新型存储系统至关重要。由于工作负载的复杂性和规模多样性的增加，使用基于直觉的分析来显式地捕获工作负载踪迹的特征是不完备的。针对这一问题，提出了一个分布式日志分析与负载特征提取模型。首先，从分布式文件系统日志中根据关键字抽取出与读写相关的信息；其次，从统计与时序两方面对负载特征进行描述；最后，分析基于负载特征进行系统优化的可能。实验结果表明，提出的模型具有一定的可行性与准确性，且可以较为详细地给出负载统计与时序特征，具有低开销、高时效、易于分析等优点，可以用来指导具有相同特征的工作负载的合成、热点数据监测、系统的缓存预取优化。

关键词: 分布式文件系统, 日志分析, 负载特征, 时序特征, 性能优化

CLC Number:

TP311

GOU Zi'an, ZHANG Xiao, WU Dongnan, WANG Yanqiu. Log analysis and workload characteristic extraction in distributed storage system[J]. Journal of Computer Applications, 2020, 40(9): 2586-2593.

苟子安, 张晓, 吴东南, 王艳秋. 分布式存储系统中的日志分析与负载特征提取[J]. 计算机应用, 2020, 40(9): 2586-2593.

References

[1] SHVACHKO K, KUANG H, RADIA S, et al. The Hadoop distributed file system[C]//Proceedings of the 26th IEEE Symposium on Mass Storage Systems and Technologies. Piscataway:IEEE,2010:1-10.
[2] LIAO J,TRAHAY F,XIAO G,et al. Performing initiative data prefetching in distributed file systems for cloud computing[J]. IEEE Transactions on Cloud Computing,2017,5(3):550-562.
[3] REN Z,XU X,WAN J,et al. Workload characterization on a production Hadoop cluster:a case study on Taobao[C]//Proceedings of the 2012 IEEE International Symposium on Workload Characterization. Piscataway:IEEE,2012:3-13.
[4] ABAD C L. Big data storage workload characterization,modeling and synthetic generation[D]. Champaign-Urbana,IL:University of Illinois at Urbana-Champaign,2014:35-50.
[5] DITTRICH J,QUIANÉ-RUIZ J A. Efficient big data processing in Hadoop MapReduce[J]. Proceedings of the VLDB Endowment, 2012,5(12):2014-2015.
[6] 金国栋, 卞昊穹, 陈跃国, 等. HDFS存储和优化技术研究综述[J]. 软件学报,2020,31(1):137-161.(JIN G D,BIAN H Q, CHEN Y G,et al,Survey on storage and optimization techniques of HDFS[J]. Journal of Software,2020,31(1):137-161.)
[7] LIN X, WANG P, WU B. Log analysis in cloud computing environment with Hadoop and Spark[C]//Proceedings of the 5th IEEE International Conference on Broadband Network and Multimedia Technology. Piscataway:IEEE,2013:273-276.
[8] ZHANG B,KŘIKAVA F,ROUVOY R,et al. Self-balancing job parallelism and throughput in Hadoop[C]//Proceedings of the 16th IFIP WG 6.1 International Conference on Distributed Applications and Interoperable Systems,LNCS 9687. Cham:Springer,2016:129-143.
[9] 于晓龙. MapReduce模型在Hadoop实现中计算资源利用率分析和多作业批调度优化[D]. 济南:山东大学,2016:26-38.(YU X L. MapReduce model for computing resource utilization analysis and multi-job batch scheduling optimization in Hadoop implementation[D]. Jinan:Shandong University,2016:26-38.)
[10] ZACHEILAS N,KALOGERAKI V. Pareto-based scheduling of MapReduce workloads[C]//Proceedings of the 19th IEEE International Symposium on Real-Time Distributed Computing. Piscataway:IEEE,2016:174-181.
[11] KIM Y, GUNASEKARAN R. Understanding I/O workload characteristics of a Peta-scale storage system[J]. Journal of Supercomputing,2015,71(3):761-780.
[12] LIU Y, GUNASEKARAN R, MA X, et al. Automatic identification of application I/O signatures from noisy server-side traces[C]//Proceedings of the 12th USENIX Conference on File and Storage Technologies. Berkeley:USENIX Association,2014:213-228.
[13] GUNASEKARAN R,ORAL S,HILL J,et al. Comparative I/O workload characterization of two leadership class storage clusters[C]//Proceedings of the 10th Parallel Data Storage Workshop. New York:ACM,2015:31-36.
[14] LIU S, HUANG X, FU H, et al. Understanding data characteristics and access patterns in a cloud storage system[C]//Proceedings of the 13th IEEE/ACM International Symposium on Cluster,Cloud,and Grid Computing. Piscataway:IEEE,2013:327-334.
[15] REN Z, SHI W, WAN J, et al. Realistic and scalable benchmarking cloud file systems:practices and lessons from AliCloud[J]. IEEE Transactions on Parallel and Distributed Systems,2017,28(11):3272-3285.
[16] REN K,GIBSON G,KWON Y,et al. Hadoop's adolescence;a comparative workloads analysis from three research clusters[C]//Proceedings of the 2012 SC Companion:High Performance Computing,Networking Storage and Analysis. Piscataway:IEEE, 2012:1452-1453.
[17] REN Z,XU B,SHI W,et al. iGen:a realistic request generator for cloud file systems benchmarking[C]//Proceedings of the 2016 IEEE 9th International Conference on Cloud Computing. Piscataway:IEEE,2016:343-350.
[18] BOCCHI E,DRAGO I,MELLIA M. Personal cloud storage:usage,performance and impact of terminals[C]//Proceedings of the 2015 IEEE 4th International Conference on Cloud Networking. Piscataway:IEEE,2015:106-111.
[19] ABAD C L,ROBERTS N,LU Y,et al. A storage-centric analysis of MapReduce workloads:file popularity,temporal locality and arrival patterns[C]//Proceedings of the 2012 IEEE International Symposium on Workload Characterization. Piscataway:IEEE, 2012:100-109.
[20] KAVULYA S,TAN J,GANDHI R,et al. An analysis of traces from a production MapReduce cluster[C]//Proceedings of the 10th IEEE/ACM International Conference on Cluster,Cloud and Grid Computing. Piscataway:IEEE,2010:94-103.
[21] REN Z,WAN J,SHI W,et al. Workload analysis,implications, and optimization on a production Hadoop cluster:a case study on Taobao[J]. IEEE Transactions on Services Computing,2014,7(2):307-321.
[22] DIMOPOULOS S,KRINTZ C,WOLSKI R. PYTHIA:admission control for multi-framework,deadline-driven,big data workloads[C]//Proceedings of the 2017 IEEE 10th International Conference on Cloud Computing. Piscataway:IEEE,2017:488-495.
[23] Apache Hadoop Software Library. HDFS users guide[EB/OL].[2020-01-05]. http://hadoop.apache.org/docs/r2.9.0/hadoopproject-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer.
[24] Apache Hadoop Software Library. Centralized cache management in HDFS[EB/OL].[2020-01-05]. http://hadoop.apache.org/docs/r2.9.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html.
[25] CHEN Y, ALSPAUGH S, GANAPATHI A, et al. SWIM:Statistical Workload Injector for MapReduce[EB/OL].[2020-03-23]. https://github.com/SWIMProjectUCB/SWIM/wiki.
[26] HUANG Y F,HSU J M. Mining Web logs to improve hit ratios of prefetching and caching[J]. Knowledge-Based Systems,2008,21(1):62-69.
[27] WANG H,YI X,HUANG P,et al. Efficient SSD caching by avoiding unnecessary writes using machine learning[C]//Proceedings of the 47th International Conference on Parallel Processing. New York:ACM,2018:No. 82.

Log analysis and workload characteristic extraction in distributed storage system

分布式存储系统中的日志分析与负载特征提取

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	Xu LI, Yulin HE, Laizhong CUI, Zhexue HUANG, Fournier‑Viger PHILIPPE. Distributed observation point classifier for big data with random sample partition [J]. Journal of Computer Applications, 2024, 44(6): 1727-1733.
[2]	Ruixuan NI, Miao CAI, Baoliu YE. DFS-Cache： memory-efficient and persistent client cache for distributed file systems [J]. Journal of Computer Applications, 2024, 44(4): 1172-1180.
[3]	Chunyong YIN, Yangchun ZHANG. Unsupervised log anomaly detection model based on CNN and Bi-LSTM [J]. Journal of Computer Applications, 2023, 43(11): 3510-3516.
[4]	YAO Jie, CHENG Chunling, HAN Jing, LIU Zheng. Anomaly detection method based on multi-task temporal convolutional network in cloud workflow [J]. Journal of Computer Applications, 2021, 41(6): 1701-1708.
[5]	Xiaohang MA, Lingxia LIAO, Zhi LI, Bin QIN, Han-chieh CHAO. Multi-objective optimization based on dynamic mixed flow entry timeouts in software defined network [J]. Journal of Computer Applications, 2021, 41(12): 3658-3665.
[6]	DONG Cong, ZHANG Xiao, CHENG Wendi, SHI Jia. Performance optimization of distributed file system based on new type storage devices [J]. Journal of Computer Applications, 2020, 40(12): 3594-3603.
[7]	LI Shuai, SUN Lei, GUO Songhui. Interrupt path optimization method of virtual cryptographic device with reducing context switching [J]. Journal of Computer Applications, 2018, 38(7): 1946-1950.
[8]	CHEN Bo, HE Lianyue, YAN Weiwei, XU Zhaomiao, XU Jun. Portable operating system interface of UNIX compatibility technology in mass small distributed file system [J]. Journal of Computer Applications, 2018, 38(5): 1389-1392.
[9]	LI Qiang, LIU Xiaofeng. Load balancing strategy of cloud storage based on Hopfield neural network [J]. Journal of Computer Applications, 2017, 37(8): 2214-2217.
[10]	LIAO Bin, ZHANG Tao, GUO Binglei, YU Jiong, ZHANG Xuguang, LIU Yan. Performance optimization of ItemBased recommendation algorithm based on Spark [J]. Journal of Computer Applications, 2017, 37(7): 1900-1905.
[11]	XIAO Zida, ZHU Ligu, FENG Dongyu, ZHANG Di. Performance optimization of distributed database aggregation computing [J]. Journal of Computer Applications, 2017, 37(5): 1251-1256.
[12]	PENG Xingxiong, XIAO Ruliang. Multi-fractal Web log simulation generation algorithm based on stable process [J]. Journal of Computer Applications, 2017, 37(2): 587-592.
[13]	YANG Junjie, LIAO Zhuofan, FENG Chaochao. Survey on big data storage framework and algorithm [J]. Journal of Computer Applications, 2016, 36(9): 2465-2471.
[14]	HAO Zhifeng, CHEN Wei, CAI Ruichu, HUANG Ruihui, WEN Wen, WANG Lijuan. Performance optimization of wireless network based on canonical causal inference algorithm [J]. Journal of Computer Applications, 2016, 36(8): 2114-2120.
[15]	LIU Qing, FU Yinjin, NI Guiqiang, MEI Jianmin. Distributed deduplication storage system based on Hadoop platform [J]. Journal of Computer Applications, 2016, 36(2): 330-335.