基于概率的大数据查询系统——Probery

doi:10.11772/j.issn.1001-9081.2016.01.0008

计算机应用 ›› 2016, Vol. 36 ›› Issue (1): 8-12.DOI: 10.11772/j.issn.1001-9081.2016.01.0008

• 第32届中国数据库学术会议(NDBC 2015) • 上一篇下一篇

基于概率的大数据查询系统——Probery

伍晋博¹, 宋杰¹, 张莉¹, 鲍玉斌²

1. 东北大学软件学院, 沈阳 110819;
2. 东北大学信息科学与工程学院, 沈阳 110819

收稿日期:2015-08-26 修回日期:2015-09-17 发布日期:2016-01-09 出版日期:2016-01-10
通讯作者: 宋杰(1980-),男,安徽淮北人,副教授,博士,CCF高级会员,主要研究方向:大数据管理、云计算、高能效计算
作者简介:伍晋博(1992-),男,陕西渭南人,硕士研究生,主要研究方向:大数据管理、云计算;张莉(1978-),女,辽宁沈阳人,讲师,博士,主要研究方向:大数据管理、服务计算;鲍玉斌(1968-),男,吉林集安人,教授,博士,CCF高级会员,主要研究方向:数据仓库、联机分析处理、云计算、数据密集型计算。
基金资助:
国家自然科学基金重大项目(61433008);国家自然科学青年基金资助项目(61202088);中国博士后科学基金面上项目(2013M540232);中央高校基本科研业务费专项(N120817001);教育部博士点基金资助项目(20120042110028)。

Probery: probability-based data query system for big data

WU Jinbo¹, SONG Jie¹, ZHANG Li¹, BAO Yubin²

1. Software College, Northeastern University, Shenyang Liaoning 110819, China;
2. College of Information Science and Engineering, Northeastern University, Shenyang Liaoning 110819, China

Received:2015-08-26 Revised:2015-09-17 Online:2016-01-09 Published:2016-01-10
Supported by:
This work is partially supported by the Major Program of the National Natural Science Foundation of China (61433008), the National Science Foundation for Young Scientists of China (61202088), the China Postdoctoral Science Foundation General Program (2013M5402 32), the Fundamental Research Funds for the Central Universities (N120817001), the Research Fund for the Doctoral Program of Higher Education of China (20120042110028).

摘要/Abstract

摘要： 针对大数据环境下完整性查询时间代价消耗过高的问题,提出了一种采用近似完整性查询方法的系统——Probery。Probery所采用的近似完整性查询方法不同于传统的近似查询,其近似性主要体现为数据查全的可能性,是一种新型的数据查询方法。Probery首先将存入系统的数据划分为多个数据分段;然后,根据概率放置模型将各个数据分段的数据存储在分布式文件系统中;最后,对于给定的查询条件,Probery采用一种启发式查询方法进行概率查询。通过与其他主流的非关系型数据管理系统的查询性能进行比较,对Probery进行验证,Probery在损失8%查询完整性的情形下,查询时间较HBase相比节约了51%,较Cassandra相比节约了23%,较MongoDB相比节约了12%,较Hive相比节约了3%。实验结果表明,Probery可以适当地损失查询完整性来提高数据的查询性能,具有较好的通用性、适应性和可扩展性。

关键词: 大数据, 概率查询, 查全概率, 分布式文件系统, MapReduce

Abstract: Since the time consumption of full-result query for big data is excessively high, the system Probery was proposed. Different from traditional approximate query, Probery adopted an approximate full-result query method, an original method to query data. The approximation of Probery referred to the probability of containing all data satisfying query conditions in query results. Firstly, Probery divided the data stored in system into multiple data segments. Secondly, Probery placed the data in Distributed File System (DFS) according to the probability placing model. Finally, given a query condition, Probery adopted a heuristic query method to query data probably. The performance of query data was shown by comparing with other dominated non-relational data management system, in the case that the completeness of result set lost by 8%. The query time consumption of Probery was saved by 51% compared with HBase, by 23% compared with Cassandra, by 12% compared with MongoDB, by 3% compared with Hive. The experimental results show that Probery improves the performance of query data when the completeness of query data losses appropriately. In addition, Probery has better generality, adaptability and extensibility for big data query.

Key words: big data, probability query, recall probability, Distributed File System (DFS), MapReduce

中图分类号:

TP311

伍晋博, 宋杰, 张莉, 鲍玉斌. 基于概率的大数据查询系统——Probery[J]. 计算机应用, 2016, 36(1): 8-12.

WU Jinbo, SONG Jie, ZHANG Li, BAO Yubin. Probery: probability-based data query system for big data[J]. Journal of Computer Applications, 2016, 36(1): 8-12.

[1]	李旭, 何玉林, 崔来中, 黄哲学, PHILIPPE Fournier‑Viger. 基于大数据随机样本划分的分布式观测点分类器[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1727-1733.
[2]	倪瑞轩, 蔡淼, 叶保留. 内存高效的持久性分布式文件系统客户端缓存DFS-Cache[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1172-1180.
[3]	曹萌, 余孙婕, 曾辉, 史红周. 基于区块链的医疗数据分级访问控制与共享系统[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1518-1526.
[4]	杨力, 陈建廷, 向阳. 基于HBase的工业时序大数据分布式存储性能优化策略[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 759-766.
[5]	凌宇, 单志龙. 基于兴趣增强的知识概念推荐系统[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3697-3702.
[6]	徐雪敏, 张秀国, 肖媛元, 曹志英. 基于优化的灰狼算法的大规模Web服务组合[J]. 《计算机应用》唯一官方网站, 2022, 42(10): 3162-3169.
[7]	周翔, 翟俊海, 黄雅婕, 申瑞彩, 侯璎真. 基于随机森林和投票机制的大数据样例选择算法[J]. 计算机应用, 2021, 41(1): 74-80.
[8]	苟子安, 张晓, 吴东南, 王艳秋. 分布式存储系统中的日志分析与负载特征提取[J]. 计算机应用, 2020, 40(9): 2586-2593.
[9]	曹策俊, 刘桔. 灾害运作管理中应急组织决策建模方法综述[J]. 计算机应用, 2020, 40(7): 2142-2149.
[10]	朱小杰, 赵子豪, 杜一. 模型驱动的大数据流水线框架PiFlow[J]. 计算机应用, 2020, 40(6): 1638-1647.
[11]	吴文莉, 刘国华, 张君宝. 大数据上函数查询解答的复杂度分析[J]. 《计算机应用》唯一官方网站, 2020, 40(2): 416-419.
[12]	董聪, 张晓, 程文迪, 石佳. 基于新型存储器件的分布式文件系统性能优化[J]. 计算机应用, 2020, 40(12): 3594-3603.
[13]	李孜颖, 石振国. 面向大数据任务的调度方法[J]. 计算机应用, 2020, 40(10): 2923-2928.
[14]	章永来, 周耀鉴. 聚类算法综述[J]. 计算机应用, 2019, 39(7): 1869-1882.
[15]	马建刚, 马应龙. 语义驱动的司法文档学习分类方法[J]. 计算机应用, 2019, 39(6): 1696-1700.

基于概率的大数据查询系统——Probery

Probery: probability-based data query system for big data

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics