Journal of Computer Applications ›› 2020, Vol. 40 ›› Issue (9): 2586-2593.DOI: 10.11772/j.issn.1001-9081.2020010121

• Data science and technology • Previous Articles     Next Articles

Log analysis and workload characteristic extraction in distributed storage system

GOU Zi'an1, ZHANG Xiao1,2, WU Dongnan1, WANG Yanqiu1   

  1. 1. School of Computer Science, Northwestern Polytechnical University, Xi'an Shaanxi 710129, China;
    2. Key Laboratory of Big Data Storage and Management, Ministry of Industry and Information Technology(Northwestern Polytechnical University), Xi'an Shaanxi 710129, China
  • Received:2020-02-13 Revised:2020-05-05 Online:2020-09-10 Published:2020-05-10
  • Supported by:
    This work is partially supported by the National Key Research and Development Program of China (2018YFB1004400), the Beijing Natural Science Foundation-Haidian Original Innovation Union Foundation (L192027).


苟子安1, 张晓1,2, 吴东南1, 王艳秋1   

  1. 1. 西北工业大学 计算机学院, 西安 710129;
    2. 工信部大数据存储与管理重点实验室(西北工业大学), 西安 710129
  • 通讯作者: 张晓
  • 作者简介:苟子安(1996-),男,陕西西安人,硕士研究生,主要研究方向:分布式文件系统性能优化、缓存优化、文件预取、机器学习;张晓(1978-),男,河南南阳人,副教授,博士,CCF高级会员,主要研究方向:计算机存储系统、云计算、大数据存储与管理;吴东南(1996-),男,陕西宝鸡人,硕士研究生,主要研究方向:分布式文件系统性能优化;王艳秋(1995-),女,河南南阳人,硕士研究生,主要研究方向:分布式文件系统性能优化、大数据存储与管理。
  • 基金资助:

Abstract: Analysis of the workload running on the file system is helpful to optimize the performance of the distributed file system and is crucial to the construction of new storage system. Due to the complexity of workload and the increase of scale diversity, it is incomplete to explicitly capture the characteristics of workload traces by intuition-based analysis. To solve this problem, a distributed log analysis and workload characteristic extraction model was proposed. First, reading and writing related information was extracted from distributed file system logs according to the keywords. Second, the workload characteristics were described from two aspects: statistics and timing. Finally, the possibility of system optimization based on workload characteristics was analyzed. Experimental results show that the proposed model has certain feasibility and accuracy, and can give workload statistics and timing characteristics in detail. It has the advantages of low overhead, high timeliness and being easy to analyze, and can be used to guide the synthesis of workloads with the same characteristics, hot spot data monitoring, and cache prefetching optimization of the system.

Key words: distributed file system, log analysis, workload characteristic, timing characteristic, performance optimization

摘要: 对运行在文件系统上的工作负载进行分析有助于优化分布式文件系统的性能,且对构建新型存储系统至关重要。由于工作负载的复杂性和规模多样性的增加,使用基于直觉的分析来显式地捕获工作负载踪迹的特征是不完备的。针对这一问题,提出了一个分布式日志分析与负载特征提取模型。首先,从分布式文件系统日志中根据关键字抽取出与读写相关的信息;其次,从统计与时序两方面对负载特征进行描述;最后,分析基于负载特征进行系统优化的可能。实验结果表明,提出的模型具有一定的可行性与准确性,且可以较为详细地给出负载统计与时序特征,具有低开销、高时效、易于分析等优点,可以用来指导具有相同特征的工作负载的合成、热点数据监测、系统的缓存预取优化。

关键词: 分布式文件系统, 日志分析, 负载特征, 时序特征, 性能优化

CLC Number: