Journal of Computer Applications ›› 2020, Vol. 40 ›› Issue (11): 3184-3191.DOI: 10.11772/j.issn.1001-9081.2020040539

• Data science and technology • Previous Articles     Next Articles

Survey of large-scale resource description framework data partitioning methods in distributed environment

YANG Cheng, LU Jiamin, FENG Jun   

  1. College of Computer and Information, Hohai University, Nanjing Jiangsu 211100, China
  • Received:2020-04-26 Revised:2020-06-21 Online:2020-11-10 Published:2020-07-20
  • Supported by:
    This work is partially supported by the National Key Research and Development Program of China (2017YFC0405806, 2018YFC0407901).


杨程, 陆佳民, 冯钧   

  1. 河海大学 计算机与信息学院, 南京 211100
  • 通讯作者: 冯钧(1969-),女,江苏武进人,教授,博士,CCF会员,主要研究方向:时空数据管理、智能数据处理、数据挖掘、水利信息化;
  • 作者简介:杨程(1996-),女,安徽芜湖人,硕士研究生,CCF会员,主要研究方向:知识图谱数据管理、分布式数据库;陆佳民(1983-),男,江苏南通人,讲师,博士,CCF会员,主要研究方向:移动对象数据管理、分布式数据处理、水利信息化
  • 基金资助:

Abstract: With the rapid development of knowledge graph and its wide usage in various vertical domains, the requirements for efficient processing of Resource Description Framework (RDF) data has increasingly become a new topic in the field of modern big data management. RDF is a data model proposed by W3C to describe knowledge graph entities and inter-entity relationships. In order to effectively cope with the storage and query of the large-scale RDF data, many scholars consider managing RDF data in a distributed environment. The key problem faced by the distributed storage of RDF data is data partitioning, and the performance of Simple Protocol and RDF Query Language (SPARQL) queries is largely determined by the results of partitioning. From the perspective of data partitioning, two types:graph structure-based RDF data partitioning methods and semantics-based RDF data partitioning methods, were mainly focused on and described in depth. The former include multi-granularity hierarchical partitioning, template partitioning and clustering partitioning, and are suitable for the wide semantic categories scenes of general domain query, while the latter include hash partitioning, vertical partitioning and pattern partitioning, and are more suitable for the environments of the relatively fixed semantic categories of vertical domain query. In addition, several typical partitioning methods were compared and analyzed to provide enlightenment for the future research on RDF data partitioning methods. Finally, the future research directions of RDF data partitioning methods were summarized.

Key words: Resource Description Framework (RDF), data partitioning, distributed RDF data storage, Simple Protocol and RDF Query Language (SPARQL) query, distributed database

摘要: 随着知识图谱的日益发展和在各个垂直领域的广泛应用,对于资源描述框架(RDF)数据的高效处理需求日益成为现代大数据管理领域中的新课题。RDF是W3C提出的用于描述知识图谱实体以及实体间关系的数据模型。为了有效地应对大规模RDF数据的存储和查询,很多学者考虑在分布式环境中管理RDF数据。RDF数据的分布式存储所面临的关键问题是数据的划分,而划分的结果很大程度上决定了SPARQL的查询性能。从数据划分的角度,主要围绕两类:基于图结构的RDF数据划分方法和基于语义的RDF数据划分方法展开深入阐述。前者包括多粒度层次划分、模板划分和聚类划分,适用于通用领域查询的语义范畴较为宽泛的场景;后者包括哈希划分、垂直划分和模式划分,更加适用于垂直领域查询的语义范畴相对固定的环境。此外,针对几种典型的划分方法进行对比与分析,为未来RDF数据划分方法的研究提供参考。最后,对未来RDF数据划分方法的发展方向进行了归纳总结。

关键词: 资源描述框架, 数据划分, 分布式RDF数据存储, SPARQL查询, 分布式数据库

CLC Number: