Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (3): 618-622.DOI: 10.11772/j.issn.1001-9081.2020122053

Special Issue: 第37届CCF中国数据库学术会议(NDBC 2020)

• The 37th CCF National Database Conference (NDBC 2020) • Previous Articles     Next Articles

Two-stage file compaction framework by log-structured merge-tree for time series data

ZHANG Lingzhe, HUANG Xiangdong, QIAO Jialin, GOU Wangminhao, WANG Jianmin   

  1. School of Software, Tsinghua University, Beijing 100085, China
  • Received:2020-09-07 Revised:2021-01-11 Online:2021-03-10 Published:2021-01-29
  • Supported by:
    This work is partially supported by the National Key Research and Development Program of China (2019YFB1707001), the High-quality Development Special Project in 2020 of Ministry of Industry and Information Technology.


张凌哲, 黄向东, 乔嘉林, 勾王敏浩, 王建民   

  1. 清华大学 软件学院, 北京 100085
  • 通讯作者: 黄向东
  • 作者简介:张凌哲(1998-),男,江苏苏州人,硕士研究生,主要研究方向:大数据系统管理、分布式数据库;黄向东(1989-),男,河南郑州人,助理研究员,博士,主要研究方向:大数据系统管理和建模;乔嘉林(1993-),男,河北保定人,博士研究生,主要研究方向:大数据系统管理;勾王敏浩(1996-),男,湖北十堰人,硕士研究生,主要研究方向:大数据存储系统、时间序列数据管理;王建民(1968-),男,吉林磐石人,教授,博士,主要研究方向:大数据、知识工程、软件工程。
  • 基金资助:

Abstract: When the Log-Structured Merge-tree (LSM-tree) in the time series database is under high write load or resource constraints, file compaction not in time will cause a large accumulation of LSM C0 layer data, resulting in an increase in the latency of ad hoc queries of recently written data. To address this problem, a two-stage LSM compaction framework was proposed that realizes low-latency query of newly written time series data on the basis of maintaining efficient query for large blocks of data. Firstly, the file compaction process was divided into two stages:quickly merging of a small number of out-of-order files, merging of a large number of small files, then multiple file compaction strategies were provided in each stage, finally the two-stage compaction resource allocation was performed according to the query load of the system. By implementing the test of the traditional LSM compaction strategy and the two-stage LSM compaction framework on the time series database Apache IoTDB, the results showed that compared with the traditional LSM, the two-stage file compaction module was able to greatly reduce the number of ad hoc query reads while improving the flexibility of the strategy, and made the historical data analysis and query performance improved by about 20%. Experimental results show that the two-stage LSM compaction framework can increase the ad hoc query efficiency of recently written data, and can improve the performance of historical data analysis and query as well as the flexibility of compaction strategy.

Key words: Internet of Things (IoT), time series database, time series data, file compaction, Log-Structured Merge-tree (LSM), ad hoc query

摘要: 时序数据库中日志结构合并树(LSM-tree)在高写入负载或资源受限情况下的不及时的文件合并会导致LSM的C0层数据大量堆积,从而造成近期写入数据的即席查询延迟增加。针对上述问题,提出了一种在保持面向大块数据的高效查询的基础上实现对最新写入的时序数据的低延迟查询的两阶段LSM合并框架。首先将文件的合并过程分为少量乱序文件快速合并与大量小文件合并这两个阶段,然后在每个阶段内提供多种文件合并策略,最后根据系统的查询负载进行两阶段合并的资源分配。通过在时序数据库Apache IoTDB上分别实现传统的LSM合并策略以及两阶段LSM合并框架和测试,结果表明与传统的LSM相比,两阶段的文件合并模块在提升策略灵活性的情况下使即席查询读盘次数大大降低,并且使历史数据分析查询性能提升了约20%。实验结果表明,两阶段的LSM合并框架能够提高近期写入数据的即席查询效率,提高历史数据分析查询性能,而且提升合并策略的灵活性。

关键词: 物联网, 时序数据库, 时间序列数据, 文件合并, 日志结构合并树, 即席查询

CLC Number: