基于HBase的工业时序大数据分布式存储性能优化策略

doi:10.11772/j.issn.1001-9081.2022020211

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (3): 759-766.DOI: 10.11772/j.issn.1001-9081.2022020211

• 数据科学与技术 • 上一篇

基于HBase的工业时序大数据分布式存储性能优化策略

杨力, 陈建廷, 向阳()

同济大学电子与信息工程学院，上海 201804

收稿日期:2022-02-24 修回日期:2022-05-31 接受日期:2022-06-02 发布日期:2022-08-16 出版日期:2023-03-10
通讯作者: 向阳
作者简介:杨力（1998—），男，甘肃张掖人，硕士研究生，主要研究方向：大数据、分布式系统
陈建廷（1996—），男，吉林吉林人，博士研究生，CCF学生会员，主要研究方向：工业大数据、智能制造、深度学习
向阳（1962—），男，上海人，教授，博士，CCF高级会员，主要研究方向：自然语言处理、数据挖掘、知识图谱。
基金资助:
国家重点研发计划项目(2019YFB1704402)

Performance optimization strategy of distributed storage for industrial time series big data based on HBase

Li YANG, Jianting CHEN, Yang XIANG()

College of Electronic and Information Engineering，Tongji University，Shanghai 201804，China

Received:2022-02-24 Revised:2022-05-31 Accepted:2022-06-02 Online:2022-08-16 Published:2023-03-10
Contact: Yang XIANG
About author:YANG Li， born in 1998， M. S. candidate. His research interests include big data， distributed system.
CHEN Jianting， born in 1996， Ph. D. candidate. His research interests include industrial big data， intelligent manufacturing， deep learning.
Supported by:
National Key Research and Development Program of China(2019YFB1704402)

摘要/Abstract

摘要：

在自动化的工业场景中，大量工业设备产生的时序性日志数据量呈爆炸式增长，业务场景对时序数据的访问需求进一步提升。虽然目前基于分布式列族的数据库HBase能够存储工业时序大数据，但由于未考虑特定业务场景中数据与访问行为特征的关联，现有策略无法较好地满足工业时序数据的特定访问需求。针对上述问题，基于分布式存储系统HBase，利用工业场景中数据与访问行为特征的关联性，提出面向海量工业时序数据的分布式存储性能优化策略。针对工业时序数据特点引发的负载倾斜问题，提出基于冷热数据分区及访问行为分类的负载均衡优化策略。使用逻辑回归模型（LR）对数据进行冷热分类，并将热数据分散存储在不同节点；同时，为进一步降低存储集群中跨节点的通信开销，以提升工业时序数据高维索引的查询效率，提出索引主数据同Region化策略，设计索引RowKey字段及拼接规则，将索引存放到与它对应的主数据相同的Region中。在真实工业时序数据上的实验结果表明，引入优化策略后的数据负载分布倾斜度降低28.5%，查询效率提升27.7%，验证了所提优化策略能够有效地挖掘特定时序数据的访问模式，合理地分配负载，降低数据访问开销，有能力满足对特定时序大数据的访问需求。

关键词: 分布式存储, 时序大数据, 工业大数据, 负载均衡, HBase

Abstract:

In automated industrial scenarios， the amount of time series log data generated by a large number of industrial devices has exploded， and the demand for access to time series data in business scenarios has further increased. Although HBase， a distributed column family database， can store industrial time series big data， the existing strategies cannot meet the specific access requirements of industrial time series data well because the correlation between data and access behavior characteristics in specific business scenarios is not considered. In view of the above problem， based on the distributed storage system HBase， and using the correlation between data and access behavior characteristics in industrial scenarios， a distributed storage performance optimization strategy for massive industrial time series data was proposed. Aiming at the load tilt problem caused by characteristics of industrial time series data， a load balancing optimization strategy based on hot and cold data partition and access behavior classification was proposed. The data were classified into cold and hot ones by using a Logistic Regression （LR） model， and the hot data were distributed and stored in different nodes. In addition， in order to further reduce the cross-node communication overhead in storage cluster and improve the query efficiency of the high-dimensional index of industrial time series data， a strategy of putting the index and main data into a same Region was proposed. By designing the index RowKey field and splicing rules， the index was stored with its corresponding main data in the same Region. Experimental results on real industrial time series data show that the data load distribution tilt degree is reduced by 28.5% and the query efficiency is improved by 27.7% after introducing the optimization strategy， demonstrating the proposed strategy can mine access patterns for specific time series data effectively， distribute load reasonably， reduce data access overhead， and meet access requirements for specific time series big data.

Key words: distributed storage, time series big data, industrial big data, load balancing, HBase

中图分类号:

TP311

杨力, 陈建廷, 向阳. 基于HBase的工业时序大数据分布式存储性能优化策略[J]. 计算机应用, 2023, 43(3): 759-766.

Li YANG, Jianting CHEN, Yang XIANG. Performance optimization strategy of distributed storage for industrial time series big data based on HBase[J]. Journal of Computer Applications, 2023, 43(3): 759-766.

图/表 12

图1 HBase集群负载倾斜

Fig. 1 HBase cluster load tilt

图2 引入优化策略后的系统架构

Fig. 2 System architecture after introducing optimization strategy

图3 优化策略的写数据流程

Fig. 3 Writing process of optimization strategy

图4 优化策略的读数据流程

Fig. 4 Reading process of optimization strategy

图5 二级索引的RowKey字段设计

Fig. 5 Secondary index RowKey field design

图6 优化策略的索引查询流程

Fig. 6 Index query process of optimization strategy

表1 TS数据集

Tab. 1 Time series datasets

数据集	数据规模	数据大小/GB
TS1	4 551 131	0.53
TS2	9 102 282	1.05
TS3	27 307 256	3.15
TS4	45 511 729	5.25
TS5	63 716 420	7.35
TS6	91 023 458	10.49
TS7	136 535 187	15.75
TS8	182 046 916	20.44
TS9	240 000 000	26.95

表2 不同方法在不同数据量下的负载倾斜度

Tab. 2 Load tilts of different methods under different data volumes

数据集	原系统	PUB-HBase	HBalancer	本文方法
TS1	1 320 336	1 093 551	1 079 617	970 510
TS2	1 228 332	1 148 735	1 080 976	968 935
TS3	959 272	964 214	962 724	828 265
TS4	784 302	708 937	672 987	622 490
TS5	580 265	514 919	486 567	424 084
TS6	120 129	100 345	92 936	82 163
TS7	121 554	92 912	93 268	75 974
TS8	104 722	77 992	73 147	63 139
TS9	91 217	68 624	64 869	55 609

表3 在不同训练量下的训练时间和预测精度

Tab. 3 Train times and prediction accuracyies under different training volumes

训练集数据量/GB	训练时间/s	模型精度/%
0.54	369.19	76.36
0.90	612.80	79.83
1.26	856.43	82.16
1.80	1 219.90	83.38
2.70	1 829.10	84.56
4.50	3 048.49	85.42
5.40	3 658.19	85.57
7.20	4 877.59	85.62

图7 数据查询时间对比

Fig. 7 Comparision of data query time

表4 综合数据查询时间对比 (ms)

Tab. 4 Comparision of comprehensive data query time

数据集	原系统	PUB-HBase	本文方法
TS1	212	194	180
TS2	408	360	339
TS3	739	646	602
TS4	1 322	1 038	954
TS5	6 536	5 062	4 685
TS6	8 549	7 082	6 114
TS7	12 059	11 012	8 184
TS8	21 715	17 944	13 948
TS9	26 234	19 345	14 166

图8 热数据区命中率与数据查询所需时间关系

Fig. 8 Relationship between hot data area hit rate and time required for data query

参考文献 21

1	施巍松，孙辉，曹杰，等. 边缘计算：万物互联时代新型计算模型［J］. 计算机研究与发展， 2017， 54（5）：907-924. 10.7544/issn1000-1239.2017.20160941
	SHI W S， SUN H， CAO J， et al. Edge computing - an emerging computing model for the internet of everything era［J］. Journal of Computer Research and Development， 2017， 54（5）： 907-924. 10.7544/issn1000-1239.2017.20160941
2	JEONG K J， PARK J D， HWANG K， et al. Two-stage deep anomaly detection with heterogeneous time series data［J］. IEEE Access， 2022， 10： 13704-13714. 10.1109/access.2022.3147188
3	刘博伟，黄瑞章. 基于HBase的金融时序数据存储系统［J］. 中国科技论文， 2016， 11（20）：2387-2392. 10.3969/j.issn.2095-2783.2016.20.022
	LIU B W， HUANG R Z. HBase-based storage system for financial time series data［J］. China Sciencepaper， 2016， 11（20）：2387-2392. 10.3969/j.issn.2095-2783.2016.20.022
4	李晓根. 基于Hadoop的工业大数据监测分析平台技术实现［D］. 北京：北方工业大学， 2019： 1-4.
	LI X G. Implementation of industrial big data monitoring and analysis platform technology based on Hadoop［D］. Beijing： North China University of Technology， 2019： 1-4.
5	刘磊. 基于Spark 平台的大数据聚类算法研究及其应用［D］. 南京：南京邮电大学， 2018： 1-2.
	LIU L. Research and application of big data clustering algorithm based on spark platform［D］. Nanjing： Nanjing University of Posts and Telecommunications， 2018： 1-2.
6	VORA M N. Hadoop-HBase for large-scale data［C］// Proceedings of the 2011 International Conference on Computer Science and Network Technology， Volume 1. Piscataway： IEEE， 2011： 601-605. 10.1109/iccsnt.2011.6182030
7	VAN LE H， TAKASU A. A scalable spatio-temporal data storage for intelligent transportation systems based on HBase［C］// Proceedings of the IEEE 18th International Conference on Intelligent Transportation Systems. Piscataway： IEEE， 2015： 2733-2738. 10.1109/itsc.2015.439
8	王远，陶烨，袁军，等. 一种基于HBase的智能电网时序大数据处理方法［J］. 系统仿真学报， 2016， 28（3）： 559-568.
	WANG Y， TAO Y， YUAN J， et al. Approach to process smart grid time-serial big data based on HBase［J］. Journal of System Simulation， 2016， 28（3）： 559-568.
9	AZQUETA-ALZÚAZ A， PATIÑO-MARTINEZ M， BRONDINO I， et al. Massive data load on distributed database systems over HBase［C］// Proceedings of the 17th IEEE/ACM International Symposium on Cluster， Cloud and Grid Computing. Piscataway： IEEE， 2017： 776-779. 10.1109/ccgrid.2017.124
10	雷鸣，姜罕盛，武国良，等. 基于HBase的大数据架构下负载平衡技术［J］. 计算机与现代化， 2021（6）：91-95. 10.3969/j.issn.1006-2475.2021.06.015
	LEI M， JIANG H S， WU G L， et al. Load balancing technology under big data architecture based on HBase［J］. Computer and Modernization， 2021（6）：91-95. 10.3969/j.issn.1006-2475.2021.06.015
11	王璐. 基于HBase的大数据存储设计及高并发查询方法研究［J］. 信息与电脑， 2021， 33（15）：184-187. 10.3969/j.issn.1003-9767.2021.15.057
	WANG L. Research on big data storage design and high concurrent query method based on HBase［J］. China Computer & Communication， 2021， 33（15）：184-187. 10.3969/j.issn.1003-9767.2021.15.057
12	张周. HBase中面向多源异构时序数据的高效能存储策略研究［D］.长沙：湖南大学， 2019： 1-51.
	ZHANG Z. Research on high-performance storage strategy for multi-source heterogeneous time series data in HBase［D］. Changsha： Hunan University， 2019： 1-51.
13	SUN J L， ZHANG Y. Research on dynamic load balancing of data flow under big data platform［J］. International Journal of Modeling， Simulation， and Scientific Computing， 2021， 12（2）： No.2150014. 10.1142/s1793962321500148
14	CHEN Y B， XIANG X， LING X， et al. Dynamic load balance for hot-spot and unbalance region problems in HBase［C］// Proceedings of the 2020 IEEE International Conference on Big Data. Piscataway： IEEE， 2020： 2583-2589. 10.1109/bigdata50022.2020.9378465
15	XIONG A P， ZOU J. Research of dynamic load balancing strategy on HBase［C］// Proceedings of the 5th International Conference on Information Engineering for Mechanics and Materials. Dordrecht： Atlantis Press， 2015： 1599-1604. 10.2991/icimm-15.2015.296
16	CRUZ F， MAIA F， OLIVEIRA R， et al. Workload-aware table splitting for NoSQL［C］// Proceedings of the 29th Annual ACM Symposium on Applied Computing. New York： ACM， 2014： 399-404. 10.1145/2554850.2555027
17	GHANDOUR A， MOUKALLED M， JABER M， et al. User-based load balancer in HBase［C］// Proceedings of the 7th International Conference on Cloud Computing and Services Science. Setúbal： SciTePress， 2017： 392-396. 10.5220/0006290103920396
18	祝烨. 分布式数据库系统热点负载均衡研究［D］. 武汉：华中科技大学， 2015： 1-48.
	ZHU Y. Research of balancing for hotpot in the distribute cluster［D］. Wuhan： Huazhong University of Science and Technology， 2015： 1-48.
19	王荣生，杨际祥，王凡. 负载均衡策略研究综述［J］. 小型微型计算机系统， 2010， 31（8）：1681-1686.
	WANG R S， YANG J X， WANG F. Survey of load balancing strategies［J］. Journal of Chinese Computer Systems， 2010， 31（8）：1681-1686.
20	HUANG X H， WANG L Z， YAN J N， et al. Towards building a distributed data management architecture to integrate multi-sources remote sensing big data［C］// Proceedings of the IEEE 20th International Conference on High Performance Computing and Communications/IEEE 16th International Conference on Smart City/IEEE 4th International Conference on Data Science and Systems. Piscataway： IEEE， 2018：83-90. 10.1109/hpcc/smartcity/dss.2018.00043
21	王帅. HBase数据库评测关键技术的研究［D］. 哈尔滨：哈尔滨工业大学， 2015：35-50.
	WANG S. Research on key evaluating techniques of HBase database［D］. Harbin： Harbin Institute of Technology， 2015： 35-50.

[1]	高旗, 吕娜, 缪竞成. 基于负载均衡的无线虚拟网络映射算法[J]. 《计算机应用》唯一官方网站, 2022, 42(10): 3148-3153.
[2]	卿欣艺, 陈玉玲, 周正强, 涂园超, 李涛. 基于中国剩余定理的区块链存储扩展模型[J]. 计算机应用, 2021, 41(7): 1977-1982.
[3]	杨翎, 姜春茂. 基于三支决策的虚拟机节能迁移策略[J]. 计算机应用, 2021, 41(4): 990-998.
[4]	许红亮, 杨桂芹, 蒋占军. 基于软件定义网络的数据中心自适应多路径负载均衡算法[J]. 计算机应用, 2021, 41(4): 1160-1164.
[5]	崔双双, 王宏志. 基于日志结构合并树的轻量级分布式索引实现方法[J]. 计算机应用, 2021, 41(3): 630-635.
[6]	徐江峰, 谭玉龙. 基于HBase的多维索引查询机制的优化[J]. 《计算机应用》唯一官方网站, 2020, 40(2): 571-577.
[7]	李翠, 陈庆奎. 基于DPDK并行通信的动态监控模型[J]. 《计算机应用》唯一官方网站, 2020, 40(2): 335-341.
[8]	张航, 刘善政, 唐聃, 蔡红亮. 分布式存储系统中的低修复成本纠删码[J]. 计算机应用, 2020, 40(10): 2942-2950.
[9]	张国潮, 王瑞锦. 基于门限秘密共享的区块链分片存储模型[J]. 计算机应用, 2019, 39(9): 2617-2622.
[10]	李祝红, 赵灿明, 闫龙, 张信明. 智能电网中电力线通信网络负载均衡的机会路由协议[J]. 计算机应用, 2019, 39(3): 812-816.
[11]	冯钧, 李顶圣, 陆佳民, 张立霞. 基于HBase的路网移动对象时空索引方法[J]. 计算机应用, 2018, 38(6): 1575-1583.
[12]	崔晨, 郑林江, 韩凤萍, 何牧君. 基于内存的HBase二级索引设计[J]. 计算机应用, 2018, 38(6): 1584-1590.
[13]	吴仁彪, 刘超, 屈景怡. 基于HBase和Hive的航班延误平台的存储方法[J]. 计算机应用, 2018, 38(5): 1339-1345.
[14]	王泽武, 孙磊, 郭松辉, 孙瑞辰. 密码云中基于熵权评价的虚拟密码机调度方法[J]. 计算机应用, 2018, 38(5): 1353-1359.
[15]	鲁亮, 于炯, 卞琛, 英昌甜, 师康利, 蒲勇霖. Storm环境下基于权重的任务调度算法[J]. 计算机应用, 2018, 38(3): 699-706.

基于HBase的工业时序大数据分布式存储性能优化策略

Performance optimization strategy of distributed storage for industrial time series big data based on HBase

RichHTML

PDF

PDF (Mobile)

可视化

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 21

相关文章 15

编辑推荐

Metrics