基于实时数据和历史查询分布的时空索引新方法

doi:10.11772/j.issn.1001-9081.2017.03.860

计算机应用 ›› 2017, Vol. 37 ›› Issue (3): 860-865.DOI: 10.11772/j.issn.1001-9081.2017.03.860

基于实时数据和历史查询分布的时空索引新方法

孟学潮¹, 叶少珍^1,2

1. 福州大学数学与计算机科学学院, 福州 350108;
2. 福建省医疗器械与医药技术重点实验室, 福州 350002

收稿日期:2016-08-15 修回日期:2016-10-03 发布日期:2017-03-22 出版日期:2017-03-10
通讯作者: 叶少珍
作者简介:孟学潮(1989-),男,河南驻马店人,硕士研究生,主要研究方向:大数据存储与处理、时空数据库索引优化;叶少珍(1963-),女,福建福州人,教授,博士,CCF高级会员,主要研究方向:医学信息智能分析与处理、电子商务。
基金资助:
国家自然科学基金资助项目（61502106）；福建省区域重大科技专项资助项目（2014H4015）。

New spatio-temporal index method based on real-time data and query log distribution

MENG Xuechao¹, YE Shaozhen^1,2

1. College of Mathematics and Computer Science, Fuzhou University, Fuzhou Fujian 350108, China;
2. Fujian Key Laboratory of Medical Instrumentation and Pharmaceutical Technology, Fuzhou Fujian 350002, China

Received:2016-08-15 Revised:2016-10-03 Online:2017-03-22 Published:2017-03-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61502106), the Regional Special Major Science and Technology Project of Fujian Province (2014H4015).

摘要/Abstract

摘要： 在大数据时代，数据具有体量大、时空复杂性明显、对实时性要求较高等特点，而传统基于树形结构对大规模时空数据进行索引的方法存在存储空间浪费和查询效率较低的问题。为了解决该问题，提出了一种基于数据和历史查询记录分布建立时空索引的新方法HDL-index。该算法一方面根据数据在空间上的分布，通过空间划分的思想建立索引网格；另一方面考虑到查询在时间上的延续性，对查询记录对象进行密度聚类后抽象出查询代表模型，然后根据模型的坐标位置和其查询粒度对整体查询区域进行分割。两部分所得到的索引网格都采用Geohash编码，最终合并得到最优的索引编码。HDL-index在考虑数据分布的同时充分考虑用户查询行为，使得频繁查询区域上的索引更加细化。在真实航空数据集上与同类方法进行比较测试的结果表明，其创建索引的效率提高了50%；同时在数据均匀分布的情况下对热点区域的查询效率可提高75%以上。

关键词: 时空索引, 大数据, GeoHash编码, 密度聚类, 热点区域查询

Abstract: In the era of large data, the data has the characteristics of large volume, obvious spatio-temporal complexity, high real-time requirement, and etc. However, the traditional method of indexing large-scale spatio-temporal data based on tree structure has the problem of low utilization of storage space and low efficiency of query. In order to solve this problem, a new method named HDL-index was proposed to establish the spatio-temporal index based on the distribution of data and historical query records. On the one hand, the whole area was partitioned based on the spatial distribution of the data. On the other hand, taking into account the continuity of query, the query-models were obtained after density-based clustering on historical query objects, and then based on the model coordinates and query granularity of the overall query area segmentation, the two indexes were merged based on their GeoHash codes, and finally the optimal index coding was obtained. HDL-index takes better account of the data distribution and users' queries, making the index on the frequent query area more refined. Compared with the efficiency of the similar method, the efficiency of the index creation is improved by 50%, and the query efficiency of the hotspot region can be increased by more than 75% when the data is evenly distributed in the real aeronautical data set.

Key words: spatio-temporal index, big data, GeoHash encoding, density clustering, hotspot region query

中图分类号:

TP311

孟学潮, 叶少珍. 基于实时数据和历史查询分布的时空索引新方法[J]. 计算机应用, 2017, 37(3): 860-865.

MENG Xuechao, YE Shaozhen. New spatio-temporal index method based on real-time data and query log distribution[J]. Journal of Computer Applications, 2017, 37(3): 860-865.

参考文献

[1] GUTTMAN A. A dynamic index structure for spatial searching[J]. ACM SIGMOD Record, 1984, 14(2):47-57.
[2] BECKMANN N, KRIEGEL H-P, SCHNEIDER R, et al. The R^*-tree:an efficient and robust access method for points and rectangles[J]. ACM Sigmod Record, 1990, 9(2):322-331.
[3] BOK K S, YOON H W, SEO D M, et al. Indexing of continuously moving objects on road networks[J]. IEICE-Transactions on Information and Systems, 2008, E91-D(7):2061-2064.
[4] TAO Y, PAPADIAS D, SUN J. The TPR^*-tree:an optimized spatio-temporal access method for predictive queries[C]//Proceedings of the 29th International Conference on Very Large Data Bases. Berlin:VLDB Endowment, 2003, 29:790-801.
[5] FANG Y, CAO J, WANG J, et al. HTPR^*-tree:an efficient index for moving objects to support predictive query and partial history query[C]//Web-Age Information Management, LNCS 7142. Berlin:Springer, 2012:26-39.
[6] HE Z, WU C, LIU G, et al. Decomposition tree:a spatio-temporal indexing method for movement big data[J]. Cluster Computing, 2015, 18(4):1481-1492.
[7] 陈建华,王卫红,苗放.基于Ex-Dewey前缀编码与R树的GML空间数据索引机制[J].地球信息科学学报,2010,12(2):186-193.(CHEN J H, WANG W H, MIAO F. GML spatial data index mechanism based on Ex-Dewey prefix encoding and R-tree[J]. Journal of Geo-Information Science, 2010, 12(2):186-193.)
[8] 骆歆远,陈刚,伍赛.基于GPU加速的超精简型编码数据库系统[J].计算机研究与发展,2015,52(2):362-376.(LUO X Y, CHEN G, WU S. A GPU-accelerated highly compact and encoding based database system[J]. Journal of Computer Research and Development, 2015, 52(2):362-376.)
[9] LI Y, WANG H. Spatial index study for multi-dimension vector data based on improved quad-tree encoding[EB/OL].[2016-02-09]. http://xueshu.baidu.com/s?wd=paperuri%3A%2836fe3b793cc15fbceb06230d1c65a4b4%29&filter=sc_long_sign&tn=SE_xueshusource_2kduw22v&sc_vurl=http%3A%2F%2Fproceedings.spiedigitallibrary.org%2Fproceeding.aspx%3Farticleid%3D790968&ie=utf-8&sc_us=16220466832006650551.
[10] 金安,程承旗,宋树华,等.基于Geohash的面数据区域查询[J].地理与地理信息科学,2013,29(5):31-35.(JIN A, CHENG C Q, SONG S H, et al. Regional query of area data based on geohash[J]. Geography and Geo-Information Science, 2013, 29(5):31-35.)
[11] GUDMUNDSSON J, LEVCOPOULOS C, NARASIMHAN G. Improved greedy algorithms for constructing sparse geometric spanners[J]. SIAM Journal on Computing, 2002, 31(5):1479-1500.
[12] BAEZA-YATES R, SAINT-JEAN F. A three level search engine index based in query log distribution[M]//String Processing and Information Retrieval, LNCS 2857. Berlin:Springer, 2003:56-65.
[13] LAM H T, PEREGO R, SILVESTRI F. On using query logs for static index pruning[C]//Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. Washington, DC:IEEE Computer Society, 2010:167-170.
[14] GURAJADA S, SREENIVASA K P. Index tuning for query-log based on-line index maintenance[C]//Proceedings of the 20th ACM Conference on Information and Knowledge Management. New York:ACM, 2011:1997-2000.
[15] ESTER B M, KRIEGEL H P, SANDER J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise[EB/OL].[2016-02-05]. http://www.dblab.ntua.gr/~gtsat/collection/dbscan.pdf.

基于实时数据和历史查询分布的时空索引新方法

New spatio-temporal index method based on real-time data and query log distribution

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	李旭, 何玉林, 崔来中, 黄哲学, PHILIPPE Fournier‑Viger. 基于大数据随机样本划分的分布式观测点分类器[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1727-1733.
[2]	曹萌, 余孙婕, 曾辉, 史红周. 基于区块链的医疗数据分级访问控制与共享系统[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1518-1526.
[3]	杨力, 陈建廷, 向阳. 基于HBase的工业时序大数据分布式存储性能优化策略[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 759-766.
[4]	凌宇, 单志龙. 基于兴趣增强的知识概念推荐系统[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3697-3702.
[5]	陈延伟, 赵兴旺. 基于边界点检测的变密度聚类算法[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2450-2460.
[6]	郭佳, 韩李涛, 孙宪龙, 周丽娟. 自动确定聚类中心的比较密度峰值聚类算法[J]. 计算机应用, 2021, 41(3): 738-744.
[7]	周翔, 翟俊海, 黄雅婕, 申瑞彩, 侯璎真. 基于随机森林和投票机制的大数据样例选择算法[J]. 计算机应用, 2021, 41(1): 74-80.
[8]	曹策俊, 刘桔. 灾害运作管理中应急组织决策建模方法综述[J]. 计算机应用, 2020, 40(7): 2142-2149.
[9]	朱小杰, 赵子豪, 杜一. 模型驱动的大数据流水线框架PiFlow[J]. 计算机应用, 2020, 40(6): 1638-1647.
[10]	吴文莉, 刘国华, 张君宝. 大数据上函数查询解答的复杂度分析[J]. 《计算机应用》唯一官方网站, 2020, 40(2): 416-419.
[11]	汪敏, 武禹伯, 闵帆. 基于多种聚类算法和多元线性回归的多分类主动学习算法[J]. 计算机应用, 2020, 40(12): 3437-3444.
[12]	李孜颖, 石振国. 面向大数据任务的调度方法[J]. 计算机应用, 2020, 40(10): 2923-2928.
[13]	章永来, 周耀鉴. 聚类算法综述[J]. 计算机应用, 2019, 39(7): 1869-1882.
[14]	马建刚, 马应龙. 语义驱动的司法文档学习分类方法[J]. 计算机应用, 2019, 39(6): 1696-1700.
[15]	纪丽娜, 陈凯, 于彦伟, 宋鹏, 王淑莹, 王成锐. 基于城市交通大数据的车辆类别挖掘及应用分析[J]. 计算机应用, 2019, 39(5): 1343-1350.