计算机应用 ›› 2016, Vol. 36 ›› Issue (3): 665-669.DOI: 10.11772/j.issn.1001-9081.2016.03.665

• 大数据 • 上一篇    下一篇

基于列表的可扩展标记语言流数据查询处理方法

何志学1,2, 廖湖声1   

  1. 1. 北京工业大学 计算机学院, 北京 100124;
    2. 北华航天工业学院 计算机与遥感信息技术学院, 河北 廊坊 065000
  • 收稿日期:2015-07-31 修回日期:2015-09-24 出版日期:2016-03-10 发布日期:2016-03-17
  • 通讯作者: 何志学
  • 作者简介:何志学(1982-),男,河北衡水人,讲师,博士研究生,主要研究方向:数据库、分布式计算;廖湖声(1954-),男,广东大埔人,教授,硕士,主要研究方向:数据库理论、软件自动化方法。
  • 基金资助:
    北京市自然科学基金资助项目(4122011);国家自然科学基金青年基金资助项目(61202074);河北省教育厅青年基金资助项目(QN2014178);北华航天工业学院校级科技创新团队资助项目(XJTD20140)。

Query processing method of XML streaming data using list

HE Zhixue1,2, LIAO Husheng1   

  1. 1. College of Computer Science, Beijing University of Technology, Beijing 100124, China;
    2. Computer and Remote Sensing Information Technology Institute, North China Institute of Aerospace Engineering, Langfang Hebei 065000, China
  • Received:2015-07-31 Revised:2015-09-24 Online:2016-03-10 Published:2016-03-17
  • Supported by:
    This work is partially supported by the Beijing Natural Science Foundation (4122011), the National Natural Science Foundation of China (61202074), the Educational Commission of Hebei Province(QN2014178) and the Science and Technology Planning Project of North China Institute of Aerospace Engineering (XJTD20140).

摘要: 针对半结构化可扩展标记语言(XML)流数据实时在线到达,顺序性一次访问及处理时效性高、缓存量小的需求,以及目前算法在大规模数据处理中查询表达式的能力有限、效率尚不能满足实际应用的现状,基于SAX解析,提出以列表及关系指针组合处理XPath查询的QXSList方法。首先定义数据模型,给出算法实现的整体框架,然后分别针对两个不同的XPath查询片段重点考虑了谓词判断条件和通配符的处理方法;该方法通过层次值计算判断节点的结构关系,利用关系指针链接多个候选节点列表,获取查询查询结果;最后分析给出优化算法,进一步减少缓存管理。通过实验对该方法与QStream++方法及MonetDB和SAXON查询引擎的运行时间和内存占比进行分析,得出与同类算法相比,随着数据量级的增加,效率提升在30%以上,且运行过程中内存占比接近于常量。

关键词: 流数据, 查询处理, XPath查询, 关系指针, 缓存管理

Abstract: Focused on the characteristics of processing semi-structure eXtensible Markup Language (XML) streaming data such as the stream real-time arriving continuously, requiring to be read sequentially and only once into memory, the query must be processed on the fly and usable buffer size is very little, and concerned the current status for limitation of query expression and inefficiency in practical applications of processing large scale data, QXList method was proposed for massive data processing based on SAX parsing XML. Data model and algorithm integrated framework were defined firstly. The integrated methods to process predicate and wildcard were discussed in detail. Layer value was used to determine the relationship of two elements and relational pointer was constructed to link multiple candidate nodes' lists to get query results in this method. Two optimal points were analyzed for decreasing buffer size. The experimental results show that the proposed approach is effective and efficient to this problem, and outperforms the state-of-the-art algorithms about 30 percent such as QStream++ and query engines MonetDB and SAXSON especially for large processed data. At the same time, memory usage is nearly constant.

Key words: streaming data, query processing, XPath query, relational pointer, buffer management

中图分类号: