Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (3): 759-766.DOI: 10.11772/j.issn.1001-9081.2022020211

Special Issue: 数据科学与技术

• Data science and technology • Previous Articles     Next Articles

Performance optimization strategy of distributed storage for industrial time series big data based on HBase

Li YANG, Jianting CHEN, Yang XIANG()   

  1. College of Electronic and Information Engineering,Tongji University,Shanghai 201804,China
  • Received:2022-02-24 Revised:2022-05-31 Accepted:2022-06-02 Online:2022-08-16 Published:2023-03-10
  • Contact: Yang XIANG
  • About author:YANG Li, born in 1998, M. S. candidate. His research interests include big data, distributed system.
    CHEN Jianting, born in 1996, Ph. D. candidate. His research interests include industrial big data, intelligent manufacturing, deep learning.
  • Supported by:
    National Key Research and Development Program of China(2019YFB1704402)


杨力, 陈建廷, 向阳()   

  1. 同济大学 电子与信息工程学院,上海 201804
  • 通讯作者: 向阳
  • 作者简介:杨力(1998—),男,甘肃张掖人,硕士研究生,主要研究方向:大数据、分布式系统
  • 基金资助:


In automated industrial scenarios, the amount of time series log data generated by a large number of industrial devices has exploded, and the demand for access to time series data in business scenarios has further increased. Although HBase, a distributed column family database, can store industrial time series big data, the existing strategies cannot meet the specific access requirements of industrial time series data well because the correlation between data and access behavior characteristics in specific business scenarios is not considered. In view of the above problem, based on the distributed storage system HBase, and using the correlation between data and access behavior characteristics in industrial scenarios, a distributed storage performance optimization strategy for massive industrial time series data was proposed. Aiming at the load tilt problem caused by characteristics of industrial time series data, a load balancing optimization strategy based on hot and cold data partition and access behavior classification was proposed. The data were classified into cold and hot ones by using a Logistic Regression (LR) model, and the hot data were distributed and stored in different nodes. In addition, in order to further reduce the cross-node communication overhead in storage cluster and improve the query efficiency of the high-dimensional index of industrial time series data, a strategy of putting the index and main data into a same Region was proposed. By designing the index RowKey field and splicing rules, the index was stored with its corresponding main data in the same Region. Experimental results on real industrial time series data show that the data load distribution tilt degree is reduced by 28.5% and the query efficiency is improved by 27.7% after introducing the optimization strategy, demonstrating the proposed strategy can mine access patterns for specific time series data effectively, distribute load reasonably, reduce data access overhead, and meet access requirements for specific time series big data.

Key words: distributed storage, time series big data, industrial big data, load balancing, HBase



关键词: 分布式存储, 时序大数据, 工业大数据, 负载均衡, HBase

CLC Number: