计算机应用 ›› 2018, Vol. 38 ›› Issue (6): 1584-1590.DOI: 10.11772/j.issn.1001-9081.2017112777

• 数据科学与技术 • 上一篇    下一篇

基于内存的HBase二级索引设计

崔晨1, 郑林江1, 韩凤萍2, 何牧君1   

  1. 1. 重庆大学 计算机学院, 重庆 400044;
    2. 重庆城市综合交通枢纽开发投资有限公司, 重庆 401121
  • 收稿日期:2017-11-27 修回日期:2018-02-04 出版日期:2018-06-10 发布日期:2018-06-13
  • 通讯作者: 崔晨
  • 作者简介:崔晨(1994-),男,安徽安庆人,硕士研究生,主要研究方向:智能交通系统、大数据;郑林江(1983-),男,四川邻水人,副教授,博士,CCF会员,主要研究方向:智能交通系统、大数据;韩凤萍(1983一),女,江苏扬州人,工程师,硕士,主要研究方向:交通工程;何牧君(1982一),男,浙江慈溪人,博士研究生,主要研究方向:智能交通系统、大数据。
  • 基金资助:
    国家863计划项目(2015AA015308);国家重点研发计划项目(2016YFC0801707);重庆市应用开发计划重点项目(cstc2014yykfB30003)。

Design of secondary indexes in HBase based on memory

CUI Chen1, ZHENG Linjiang1, HAN Fengping2, HE Mujun1   

  1. 1. College of Computer Science, Chongqing University, Chongqing 400044, China;
    2. Chongqing Integrated Transport Hub Development Investment Company Limited, Chongqing 401121, China
  • Received:2017-11-27 Revised:2018-02-04 Online:2018-06-10 Published:2018-06-13
  • Supported by:
    This work is partially supported by the National High Technology R&D Program of China (2015AA015308), the National Key R&D Program of China (2016YFC0801707), the Key Project of Chongqing Application Development Plan (cstc2014yykfB30003).

摘要: 在大数据时代,具有海量数据存储能力的HBase已被广泛应用。HBase只对行键进行了索引优化,对非行键的列未建立索引,这严重影响了复杂条件查询的效率。针对此问题,提出了基于内存的HBase二级索引方案。该方案对需要查询的列建立了映射到行键的索引,并将索引存储在Spark搭建的内存环境中,在查询时先通过索引获取行键,然后利用行键在HBase中快速查找对应的记录。由于列的基数大小和是否涉及范围查询决定了建立索引的类型,故针对三种不同情况构建了不同类型的索引,并利用Spark内存计算、并行化的特点来提高索引的查询效率。实验结果表明,该二级索引具有较好的查询性能,查询时间小于基于Solr的二级索引,可以解决HBase中因非行键的列缺乏索引导致查询效率较低的问题,提高基于HBase存储的大数据分析的查询效率。

关键词: HBase, Spark, 二级索引, 内存索引, 并行化

Abstract: In the age of big data, HBase which can store massive data is widely used. HBase only can optimize index for the rowkey and donot create indexes to the columns of non-rowkey, which has a serious impact on the efficiency of complicated condition query. In order to solve the problem, a new scheme about secondary indexes in HBase based on memory was proposed. The indexes of mapping to rowkey for the columns which needed to be queried were established, and these indexes were stored in memory environment which was built by Spark. The rowkey was firstly got by index during query time, then the rowkey was used to find the corresponding record quickly in HBase. Due to the cardinality size of the column and whether or not the scope query determined the type of index, and different types of indexes were constructed to deal with three different situations. Meanwhile, the memory computation and parallelization were used in Spark to improve the query efficiency of indexes. The experimental results show that the proposed secondary indexes in HBase can gain better query performance, and the query time is less than the secondary indexes based on Solr. The proposed secondary indexes can solve the problem of low query efficiency, which is caused by the lack of indexes of non-rowkey columns in HBase, and improve the query efficiency for large data analysis based on HBase storage.

Key words: HBase, Spark, secondary index, memory index, parallelization

中图分类号: