Learned Index和B-Tree在不同分布数据上的性能对比及优化

doi:10.11772/j.issn.1001-9081.2022091372

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (S1): 100-106.DOI: 10.11772/j.issn.1001-9081.2022091372

• 数据科学与技术 • 上一篇

Learned Index和B-Tree在不同分布数据上的性能对比及优化

沈怡琪¹, 蔡鹏¹(), 刘松灵²

^1.华东师范大学数据科学与工程学院，上海 200062
^2.华为技术有限公司，广东深圳 518129

收稿日期:2022-08-25 修回日期:2022-09-19 接受日期:2022-09-30 发布日期:2023-07-04 出版日期:2023-06-30
通讯作者: 蔡鹏
作者简介:沈怡琪（2000—），女，上海人，硕士研究生，主要研究方向：AI4DB
蔡鹏（1978—），男，江苏泰兴人，教授，博士，CCF会员，主要研究方向：事务处理、AI4DB.pcai@dase.ecnu.edu.cn
刘松灵（1992—），男，湖北武汉人，工程师，硕士，主要研究方向：大数据、分布式数据引擎。
基金资助:
国家自然科学基金资助项目(61972149);中国工业和信息化部项目(TC210804V?1)

Performance comparison and optimization of Learned Index and B-Tree on different data distribution

Yiqi SHEN¹, Peng CAI¹(), Songling LIU²

^1.School of Data Science & Engineering，East China Normal University，Shanghai 200062，China
^2.Huawei Technologies Company Limited，Shenzhen Guangdong 518129，China

Received:2022-08-25 Revised:2022-09-19 Accepted:2022-09-30 Online:2023-07-04 Published:2023-06-30
Contact: Peng CAI

摘要/Abstract

摘要：

Learned Index是一种通过训练模型来建立输入数据和存储位置之间映射关系的索引，它能学习到数据间分布的信息，而不同的数据分布将影响模型训练准确率和模型复杂度之间的平衡。为了探索Learned Index适用的场景，使用不同分布、不同数据量的数据对它和加以优化的可更新的自适应学习索引（ALEX）进行性能测试，并与B-Tree进行对比，最终发现Learned Index构建大批量数据的索引时间比B-Tree短，读操作性能、存储空间大小有明显的优势，但写操作性能较差，因此得出Learned Index更适用于大数据情景下的在线分析处理（OLAP）数据库，用于静态数据的存储和查询操作的结论。基于B-Tree的索引结构，对初版Learned Index的结构进行了优化和调整，最终使优化后Learned Index在大批量数据的读写操作性能上有明显提高，其中读操作最高达到原版Learned Index的2倍，写操作最高达到原版的3倍。

关键词: Learned Index, B-Tree, 可更新的自适应学习索引, 在线分析处理数据库, 静态数据, 优化调整

Abstract:

Learned Index is an index which establishes the relationship between input data and data storage location through training model. It can learn the information of data distribution， but different data distribution will affect the balance between model accuracy and model complexity. In order to explore the applicable scenarios of Learned Index， the performance of it and An Updatable Adaptive Learned Index （ALEX） were tested by using data of different distribution and amount， and compared with B-Tree. It is found that the time to build Learned Index of big data is shorter than that of B-Tree， and it has obvious advantage in reading operation performance and storage space size， but its writing operation performance is poor. Therefore， it is concluded that Learned Index is more suitable for OnLine Analytical Processing （OLAP） databases in big data scenarios， which is used for static data storage and query operation. Based on the structure of B-Tree， Learned Index was optimized and adjusted， and finally the performance of the optimized Learned Index was significantly improved in the read and write operations of large volume data， which beats B-Tree by up to 2 times on reading operation performance and 3 times on writing operation performance.

Key words: Learned Index, B-Tree, An Updatable Adaptive Learned Index (ALEX), OnLine Analytical Processing (OLAP) database, static data, optimization and adjustment

中图分类号:

TP311

沈怡琪, 蔡鹏, 刘松灵. Learned Index和B-Tree在不同分布数据上的性能对比及优化[J]. 计算机应用, 2023, 43(S1): 100-106.

Yiqi SHEN, Peng CAI, Songling LIU. Performance comparison and optimization of Learned Index and B-Tree on different data distribution[J]. Journal of Computer Applications, 2023, 43(S1): 100-106.

图/表 18

参考文献 16

1	张洲，金培权，谢希科. 学习索引：现状与研究展望［J］.软件学报， 2021， 32（4）： 1129-1150.
2	王小丽. 学习式索引算法研究综述［J］. 无线通信技术， 2021， 30（1）： 47-52. 10.3969/j.issn.1003-8329.2021.01.010
3	KRASKA T， BEUTEL A， CHI E H， et al. The case for learned index structures［C］// Proceedings of the 2018 International Conference on Management of Data. New York： ACM， 2018： 489-504. 10.1145/3183713.3196909
4	BAYER R， MCCREIGHT E M. Organization and maintenance of large ordered indices ［J］. Acta Informatica， 1970， 1： 173-189. 10.21236/ad0712079
5	DING J， MINHAS U F， YU J， et al. ALEX： an updatable adaptive learned index ［C］// SIGMOD Conference 2020： Proceedings of the 2020 International Conference on Management of Data. New York： ACM， 2020： 969-984. 10.1145/3318464.3389711
6	LI P， HUA Y， ZUO P， et al. A scalable learned index scheme in storage systems［EB/OL］. （2019-05-08）［2022-07-15］. . 10.14778/3489496.3489512
7	FERRAGINA P， VINCIGUERRA G. The PGM-index［J］. Proceedings of the VLDB Endowment， 2020， 13： 1162-1175. 10.14778/3389133.3389135
8	KIPF A， MARCUS R， van RENEN A， et al. SOSD： a Benchmark for learned indexes ［EB/OL］. （2019-11-29）［2022-07-11］. . 10.14778/3421424.3421425
9	TANG C， WANG Y， DONG Z， et al. XIndex： a scalable learned index for multicore data storage ［C］// Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York： ACM， 2020： 308-320. 10.1145/3332466.3374547
10	KIPF A， MARCUS R， VAN RENEN A， et al. RadixSpline： a single-pass learned index［C］// Proceedings of the 3rd International Work-shop on Exploiting Artificial Intelligence Techniques for Data Management. New York： ACM， 2020： Article No.5. 10.1145/3401071.3401659
11	HADIAN A， HEINIS T. Shift-table： a low-latency learned index for range queries using model correction ［C］// Proceedings of the 24th International Conference on Extending Database Technology. ［S.l.］： dblp， 2021： 253-264.
12	LIU G， KULIK L， MA X， et al. A Lazy Approach for Efficient Index Learning ［EB/OL］. （2021-02-16）［2022-07-17］. .
13	YANG J. Learned-indexes［EB/OL］. （2018-12-21）［2022-07-19］. .
14	MARTIN T. BTree［EB/OL］. （2012-07-22）［2022-07-19］. .
15	郭超，李坤，王永炎，等.多核处理器环境下内存数据库索引性能分析［J］.计算机学报，2010，33（8）：1512-1522. 10.3724/SP.J.1016.2010.01512
16	MARCUS R， KIPF A， VAN RENEN A， et al. Benchmarking learned indexes ［J］. Proceedings of the VLDB Endowment， 2020， 14（1）： 1-13. 10.14778/3421424.3421425

Learned Index	数据分布
初版Learned Index^［3］	对数正态分布
ALEX^［5］	对数正态分布、均匀分布
AIDEL^［6］	对数正态分布
PGM-index^［7］	对数正态分布、均匀分布、齐夫定律
SOSD benchmark^［8］	对数正态分布、正态分布、均匀分布
XIndex^［9］	对数正态分布、正态分布、线性分布
RadixSpline^［10］	对数正态分布
Shift-Table^［11］	对数正态分布、正态分布、均匀分布
Lazy Index Learning^［12］	均匀分布

Learned Index	数据分布
初版Learned Index^［3］	对数正态分布
ALEX^［5］	对数正态分布、均匀分布
AIDEL^［6］	对数正态分布
PGM-index^［7］	对数正态分布、均匀分布、齐夫定律
SOSD benchmark^［8］	对数正态分布、正态分布、均匀分布
XIndex^［9］	对数正态分布、正态分布、线性分布
RadixSpline^［10］	对数正态分布
Shift-Table^［11］	对数正态分布、正态分布、均匀分布
Lazy Index Learning^［12］	均匀分布

索引	数据分布	执行操作	平均每条操作时间/s
ALEX	线性分布	查询操作	9.07E-09
B-Tree	线性分布	查询操作	1.30E-07
ALEX	线性分布	插入操作	5.69E-08
B-Tree	线性分布	插入操作	2.11E-08
ALEX	线性分布	删除操作	2.98E-07
B-Tree	线性分布	删除操作	2.34E-07
ALEX	指数分布	查询操作	1.98E-08
B-Tree	指数分布	查询操作	1.24E-07
ALEX	指数分布	插入操作	1.11E-07
B-Tree	指数分布	插入操作	2.79E-08
ALEX	指数分布	删除操作	2.98E-07
B-Tree	指数分布	删除操作	2.38E-07
ALEX	wiki数据	查询操作	2.67E-08
B-Tree	wiki数据	查询操作	1.21E-07
ALEX	wiki数据	插入操作	1.93E-07
B-Tree	wiki数据	插入操作	2.80E-08
ALEX	wiki数据	删除操作	2.64E-07
B-Tree	wiki数据	删除操作	2.32E-07

索引	数据分布	执行操作	平均每条操作时间/s
ALEX	线性分布	查询操作	9.07E-09
B-Tree	线性分布	查询操作	1.30E-07
ALEX	线性分布	插入操作	5.69E-08
B-Tree	线性分布	插入操作	2.11E-08
ALEX	线性分布	删除操作	2.98E-07
B-Tree	线性分布	删除操作	2.34E-07
ALEX	指数分布	查询操作	1.98E-08
B-Tree	指数分布	查询操作	1.24E-07
ALEX	指数分布	插入操作	1.11E-07
B-Tree	指数分布	插入操作	2.79E-08
ALEX	指数分布	删除操作	2.98E-07
B-Tree	指数分布	删除操作	2.38E-07
ALEX	wiki数据	查询操作	2.67E-08
B-Tree	wiki数据	查询操作	1.21E-07
ALEX	wiki数据	插入操作	1.93E-07
B-Tree	wiki数据	插入操作	2.80E-08
ALEX	wiki数据	删除操作	2.64E-07
B-Tree	wiki数据	删除操作	2.32E-07

索引	数据分布	执行操作	平均每条操作时间/s
ALEX	线性分布	查询操作	3.69E-08
B-Tree	线性分布	查询操作	4.12E-07
ALEX	线性分布	插入操作	2.75E-07
B-Tree	线性分布	插入操作	6.15E-08
ALEX	线性分布	删除操作	2.09E-04
B-Tree	线性分布	删除操作	3.14E-07
ALEX	指数分布	查询操作	2.34E-08
B-Tree	指数分布	查询操作	4.11E-07
ALEX	指数分布	插入操作	3.32E-07
B-Tree	指数分布	插入操作	6.17E-08
ALEX	指数分布	删除操作	1.60E-04
B-Tree	指数分布	删除操作	3.07E-07
ALEX	wiki数据	查询操作	3.38E-08
B-Tree	wiki数据	查询操作	1.21E-07
ALEX	wiki数据	插入操作	2.69E-07
B-Tree	wiki数据	插入操作	2.80E-08
ALEX	wiki数据	删除操作	6.20E-06
B-Tree	wiki数据	删除操作	2.32E-07

Learned Index和B-Tree在不同分布数据上的性能对比及优化

Performance comparison and optimization of Learned Index and B-Tree on different data distribution

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 18

参考文献 16

相关文章 2

编辑推荐

Metrics

[1]	张妮, 韩萌, 王乐, 李小娟, 程浩东. 基于正负效用划分的高效用模式挖掘方法综述[J]. 《计算机应用》唯一官方网站, 2022, 42(4): 999-1010.
[2]	吴晓军陈霁房佩郭海亮. 基于子树间快捷连接的非结构化P2P资源搜索方法[J]. 计算机应用, 2012, 32(07): 1799-1803.