《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (2): 440-448.DOI: 10.11772/j.issn.1001-9081.2021020255

• 数据科学与技术 • 上一篇    

分布式资源描述框架数据管理系统查询性能评价

冯钧, 王秉发, 陆佳民()   

  1. 河海大学 计算机与信息学院,南京 211100
  • 收稿日期:2021-02-22 修回日期:2021-04-28 接受日期:2021-04-29 发布日期:2021-05-11 出版日期:2022-02-10
  • 通讯作者: 陆佳民
  • 作者简介:冯钧(1969—),女,江苏武进人,教授,博士,CCF会员,主要研究方向:时空数据管理、智能数据处理、数据挖掘、水利信息化;
    王秉发(1995—),男,江西九江人,硕士研究生,CCF会员,主要研究方向:知识图谱存储管理、分布式SPARQL查询优化;
    陆佳民(1983—),男,江苏南通人,讲师,博士,CCF会员,主要研究方向:分布式数据处理、知识图谱、水利信息化。
  • 基金资助:
    国家重点研发计划项目(2018YFC0407901)

Query performance evaluation of distributed resource description framework data management systems

Jun FENG, Bingfa WANG, Jiamin LU()   

  1. College of Computer and Information,Hohai University,Nanjing Jiangsu 211100,China
  • Received:2021-02-22 Revised:2021-04-28 Accepted:2021-04-29 Online:2021-05-11 Published:2022-02-10
  • Contact: Jiamin LU
  • About author:FENG Jun, born in 1969, Ph. D., professor. Her research interests include spatio-temporal data management, intelligent data processing, data mining, water conservancy informatization.
    WANG Bingfa, born in 1995, M. S. candidate. His research interests include knowledge graph storage management, distributed SPARQL query optimization.
    LU Jiamin, born in 1983, Ph. D., lecturer. His research interests include distributed data processing, knowledge graph, water conservancy informatization.
  • Supported by:
    National Key Research and Development Program of China(2018YFC0407901)

摘要:

随着知识图谱技术的不断发展,知识图谱驱动的知识信息管理广泛应用于各个领域,因此面向知识图谱的分布式SPARQL(Simple Protocol and Resource description framework Query Language)的查询效率显得尤为重要。首先针对现有的基于Spark和基于主存(RAM)的分布式资源描述框架(RDF)系统进行详细调研;其次,从上述系统中选出8个具有代表性的系统进行查询性能评估,比较基于Spark和基于RAM的系统在不同查询类型、查询直径、数据集上的查询性能差异;然后,全面分析实验结果,对基于Spark和基于RAM的系统的查询性能进行评价;最后,针对现有系统在分布式SPARQL查询中存在的查询伸缩性差、查询连接复杂度高、查询编译时间长等问题,展望面向垂直应用领域的分布式SPARQL查询优化的未来研究方向。

关键词: 分布式资源描述框架, 主存, Spark, 分布式SPARQL查询, 选择性, 查询效率, 查询准确性

Abstract:

With the continuous development of knowledge graph technology, knowledge information management driven by knowledge graph has been widely applied in multiple domains, so the efficiency of distributed Simple Protocol and Resource description framework Query Language (SPARQL) query for knowledge graph is particularly important. Firstly, a detailed investigation on the existing Spark-based and Random Access Memory (RAM)-based distributed RDF systems was conducted. Secondly, query performance evaluation of eight representative systems selected from the above systems was performed, thereby comparing query performance differences between Spark-based and RAM-based systems with different query types, query diameters and datasets. Thirdly, the query performance of Spark-based and RAM-based systems was evaluated by analyzing the experimental results comprehensively. Finally, the future research directions of distributed SPARQL query optimization which oriented vertical application domain were pointed out aiming at problems of the existing distributed SPARQL query, such as poor query scalability, high query join complexity and long query compilation time.

Key words: distributed Resource Description Framework (RDF), Random Access Memory (RAM), Spark, distributed Simple Protocol and RDF Query Language (SPARQL) query, selectivity, query efficiency, query accuracy

中图分类号: