《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (3): 759-766.DOI: 10.11772/j.issn.1001-9081.2022020211

• 数据科学与技术 • 上一篇    

基于HBase的工业时序大数据分布式存储性能优化策略

杨力, 陈建廷, 向阳()   

  1. 同济大学 电子与信息工程学院,上海 201804
  • 收稿日期:2022-02-24 修回日期:2022-05-31 接受日期:2022-06-02 发布日期:2022-08-16 出版日期:2023-03-10
  • 通讯作者: 向阳
  • 作者简介:杨力(1998—),男,甘肃张掖人,硕士研究生,主要研究方向:大数据、分布式系统
    陈建廷(1996—),男,吉林吉林人,博士研究生,CCF学生会员,主要研究方向:工业大数据、智能制造、深度学习
    向阳(1962—),男,上海人,教授,博士,CCF高级会员,主要研究方向:自然语言处理、数据挖掘、知识图谱。
  • 基金资助:
    国家重点研发计划项目(2019YFB1704402)

Performance optimization strategy of distributed storage for industrial time series big data based on HBase

Li YANG, Jianting CHEN, Yang XIANG()   

  1. College of Electronic and Information Engineering,Tongji University,Shanghai 201804,China
  • Received:2022-02-24 Revised:2022-05-31 Accepted:2022-06-02 Online:2022-08-16 Published:2023-03-10
  • Contact: Yang XIANG
  • About author:YANG Li, born in 1998, M. S. candidate. His research interests include big data, distributed system.
    CHEN Jianting, born in 1996, Ph. D. candidate. His research interests include industrial big data, intelligent manufacturing, deep learning.
  • Supported by:
    National Key Research and Development Program of China(2019YFB1704402)

摘要:

在自动化的工业场景中,大量工业设备产生的时序性日志数据量呈爆炸式增长,业务场景对时序数据的访问需求进一步提升。虽然目前基于分布式列族的数据库HBase能够存储工业时序大数据,但由于未考虑特定业务场景中数据与访问行为特征的关联,现有策略无法较好地满足工业时序数据的特定访问需求。针对上述问题,基于分布式存储系统HBase,利用工业场景中数据与访问行为特征的关联性,提出面向海量工业时序数据的分布式存储性能优化策略。针对工业时序数据特点引发的负载倾斜问题,提出基于冷热数据分区及访问行为分类的负载均衡优化策略。使用逻辑回归模型(LR)对数据进行冷热分类,并将热数据分散存储在不同节点;同时,为进一步降低存储集群中跨节点的通信开销,以提升工业时序数据高维索引的查询效率,提出索引主数据同Region化策略,设计索引RowKey字段及拼接规则,将索引存放到与它对应的主数据相同的Region中。在真实工业时序数据上的实验结果表明,引入优化策略后的数据负载分布倾斜度降低28.5%,查询效率提升27.7%,验证了所提优化策略能够有效地挖掘特定时序数据的访问模式,合理地分配负载,降低数据访问开销,有能力满足对特定时序大数据的访问需求。

关键词: 分布式存储, 时序大数据, 工业大数据, 负载均衡, HBase

Abstract:

In automated industrial scenarios, the amount of time series log data generated by a large number of industrial devices has exploded, and the demand for access to time series data in business scenarios has further increased. Although HBase, a distributed column family database, can store industrial time series big data, the existing strategies cannot meet the specific access requirements of industrial time series data well because the correlation between data and access behavior characteristics in specific business scenarios is not considered. In view of the above problem, based on the distributed storage system HBase, and using the correlation between data and access behavior characteristics in industrial scenarios, a distributed storage performance optimization strategy for massive industrial time series data was proposed. Aiming at the load tilt problem caused by characteristics of industrial time series data, a load balancing optimization strategy based on hot and cold data partition and access behavior classification was proposed. The data were classified into cold and hot ones by using a Logistic Regression (LR) model, and the hot data were distributed and stored in different nodes. In addition, in order to further reduce the cross-node communication overhead in storage cluster and improve the query efficiency of the high-dimensional index of industrial time series data, a strategy of putting the index and main data into a same Region was proposed. By designing the index RowKey field and splicing rules, the index was stored with its corresponding main data in the same Region. Experimental results on real industrial time series data show that the data load distribution tilt degree is reduced by 28.5% and the query efficiency is improved by 27.7% after introducing the optimization strategy, demonstrating the proposed strategy can mine access patterns for specific time series data effectively, distribute load reasonably, reduce data access overhead, and meet access requirements for specific time series big data.

Key words: distributed storage, time series big data, industrial big data, load balancing, HBase

中图分类号: