《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (S1): 100-106.DOI: 10.11772/j.issn.1001-9081.2022091372

• 数据科学与技术 • 上一篇    

Learned Index和B-Tree在不同分布数据上的性能对比及优化

沈怡琪1, 蔡鹏1(), 刘松灵2   

  1. 1.华东师范大学 数据科学与工程学院,上海 200062
    2.华为技术有限公司,广东 深圳 518129
  • 收稿日期:2022-08-25 修回日期:2022-09-19 接受日期:2022-09-30 发布日期:2023-07-04 出版日期:2023-06-30
  • 通讯作者: 蔡鹏
  • 作者简介:沈怡琪(2000—),女,上海人,硕士研究生,主要研究方向:AI4DB
    蔡鹏(1978—),男,江苏泰兴人,教授,博士,CCF会员,主要研究方向:事务处理、AI4DB.pcai@dase.ecnu.edu.cn
    刘松灵(1992—),男,湖北武汉人,工程师,硕士,主要研究方向:大数据、分布式数据引擎。
  • 基金资助:
    国家自然科学基金资助项目(61972149);中国工业和信息化部项目(TC210804V?1)

Performance comparison and optimization of Learned Index and B-Tree on different data distribution

Yiqi SHEN1, Peng CAI1(), Songling LIU2   

  1. 1.School of Data Science & Engineering,East China Normal University,Shanghai 200062,China
    2.Huawei Technologies Company Limited,Shenzhen Guangdong 518129,China
  • Received:2022-08-25 Revised:2022-09-19 Accepted:2022-09-30 Online:2023-07-04 Published:2023-06-30
  • Contact: Peng CAI

摘要:

Learned Index是一种通过训练模型来建立输入数据和存储位置之间映射关系的索引,它能学习到数据间分布的信息,而不同的数据分布将影响模型训练准确率和模型复杂度之间的平衡。为了探索Learned Index适用的场景,使用不同分布、不同数据量的数据对它和加以优化的可更新的自适应学习索引(ALEX)进行性能测试,并与B-Tree进行对比,最终发现Learned Index构建大批量数据的索引时间比B-Tree短,读操作性能、存储空间大小有明显的优势,但写操作性能较差,因此得出Learned Index更适用于大数据情景下的在线分析处理(OLAP)数据库,用于静态数据的存储和查询操作的结论。基于B-Tree的索引结构,对初版Learned Index的结构进行了优化和调整,最终使优化后Learned Index在大批量数据的读写操作性能上有明显提高,其中读操作最高达到原版Learned Index的2倍,写操作最高达到原版的3倍。

关键词: Learned Index, B-Tree, 可更新的自适应学习索引, 在线分析处理数据库, 静态数据, 优化调整

Abstract:

Learned Index is an index which establishes the relationship between input data and data storage location through training model. It can learn the information of data distribution, but different data distribution will affect the balance between model accuracy and model complexity. In order to explore the applicable scenarios of Learned Index, the performance of it and An Updatable Adaptive Learned Index (ALEX) were tested by using data of different distribution and amount, and compared with B-Tree. It is found that the time to build Learned Index of big data is shorter than that of B-Tree, and it has obvious advantage in reading operation performance and storage space size, but its writing operation performance is poor. Therefore, it is concluded that Learned Index is more suitable for OnLine Analytical Processing (OLAP) databases in big data scenarios, which is used for static data storage and query operation. Based on the structure of B-Tree, Learned Index was optimized and adjusted, and finally the performance of the optimized Learned Index was significantly improved in the read and write operations of large volume data, which beats B-Tree by up to 2 times on reading operation performance and 3 times on writing operation performance.

Key words: Learned Index, B-Tree, An Updatable Adaptive Learned Index (ALEX), OnLine Analytical Processing (OLAP) database, static data, optimization and adjustment

中图分类号: