计算机应用 ›› 2017, Vol. 37 ›› Issue (3): 860-865.DOI: 10.11772/j.issn.1001-9081.2017.03.860

• 数据科学与技术 • 上一篇    下一篇

基于实时数据和历史查询分布的时空索引新方法

孟学潮1, 叶少珍1,2   

  1. 1. 福州大学 数学与计算机科学学院, 福州 350108;
    2. 福建省医疗器械与医药技术重点实验室, 福州 350002
  • 收稿日期:2016-08-15 修回日期:2016-10-03 出版日期:2017-03-10 发布日期:2017-03-22
  • 通讯作者: 叶少珍
  • 作者简介:孟学潮(1989-),男,河南驻马店人,硕士研究生,主要研究方向:大数据存储与处理、时空数据库索引优化;叶少珍(1963-),女,福建福州人,教授,博士,CCF高级会员,主要研究方向:医学信息智能分析与处理、电子商务。
  • 基金资助:
    国家自然科学基金资助项目(61502106);福建省区域重大科技专项资助项目(2014H4015)。

New spatio-temporal index method based on real-time data and query log distribution

MENG Xuechao1, YE Shaozhen1,2   

  1. 1. College of Mathematics and Computer Science, Fuzhou University, Fuzhou Fujian 350108, China;
    2. Fujian Key Laboratory of Medical Instrumentation and Pharmaceutical Technology, Fuzhou Fujian 350002, China
  • Received:2016-08-15 Revised:2016-10-03 Online:2017-03-10 Published:2017-03-22
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61502106), the Regional Special Major Science and Technology Project of Fujian Province (2014H4015).

摘要: 在大数据时代,数据具有体量大、时空复杂性明显、对实时性要求较高等特点,而传统基于树形结构对大规模时空数据进行索引的方法存在存储空间浪费和查询效率较低的问题。为了解决该问题,提出了一种基于数据和历史查询记录分布建立时空索引的新方法HDL-index。该算法一方面根据数据在空间上的分布,通过空间划分的思想建立索引网格;另一方面考虑到查询在时间上的延续性,对查询记录对象进行密度聚类后抽象出查询代表模型,然后根据模型的坐标位置和其查询粒度对整体查询区域进行分割。两部分所得到的索引网格都采用Geohash编码,最终合并得到最优的索引编码。HDL-index在考虑数据分布的同时充分考虑用户查询行为,使得频繁查询区域上的索引更加细化。在真实航空数据集上与同类方法进行比较测试的结果表明,其创建索引的效率提高了50%;同时在数据均匀分布的情况下对热点区域的查询效率可提高75%以上。

关键词: 时空索引, 大数据, GeoHash编码, 密度聚类, 热点区域查询

Abstract: In the era of large data, the data has the characteristics of large volume, obvious spatio-temporal complexity, high real-time requirement, and etc. However, the traditional method of indexing large-scale spatio-temporal data based on tree structure has the problem of low utilization of storage space and low efficiency of query. In order to solve this problem, a new method named HDL-index was proposed to establish the spatio-temporal index based on the distribution of data and historical query records. On the one hand, the whole area was partitioned based on the spatial distribution of the data. On the other hand, taking into account the continuity of query, the query-models were obtained after density-based clustering on historical query objects, and then based on the model coordinates and query granularity of the overall query area segmentation, the two indexes were merged based on their GeoHash codes, and finally the optimal index coding was obtained. HDL-index takes better account of the data distribution and users' queries, making the index on the frequent query area more refined. Compared with the efficiency of the similar method, the efficiency of the index creation is improved by 50%, and the query efficiency of the hotspot region can be increased by more than 75% when the data is evenly distributed in the real aeronautical data set.

Key words: spatio-temporal index, big data, GeoHash encoding, density clustering, hotspot region query

中图分类号: