《计算机应用》唯一官方网站 ›› 2020, Vol. 40 ›› Issue (2): 510-517.DOI: 10.11772/j.issn.1001-9081.2019091666

• 第七届CCF大数据学术会议 • 上一篇    下一篇

低冗余计算的可达性查询保持图压缩策略

赵丹枫1, 林俊辰1, 宋巍1, 王建1, 黄冬梅2()   

  1. 1.上海海洋大学 信息学院,上海 201306
    2.上海电力大学,上海 200090
  • 收稿日期:2019-08-30 修回日期:2019-09-29 接受日期:2019-10-09 发布日期:2019-10-14 出版日期:2020-02-10
  • 通讯作者: 黄冬梅
  • 作者简介:赵丹枫(1982—),女,黑龙江五常人,讲师,博士,CCF会员,主要研究方向:大数据存储、业务流程管理
    林俊辰(1994—),男,吉林白山人,硕士研究生,主要研究方向:图计算
    宋巍(1977—),女,山西大同人,教授,博士,主要研究方向:计算机视觉、机器学习、数据挖掘
    王建(1979—),女,湖北武汉人,讲师,博士,主要研究方向:海洋信息技术;
  • 基金资助:
    国家重点研发计划项目(2016YFC1401907)

Strategy with low redundant computation for reachability query preserving graph compression

Danfeng ZHAO1, Junchen LIN1, Wei SONG1, Jian WANG1, Dongmei HUANG2()   

  1. 1.College of Information,Shanghai Ocean University,Shanghai 201306,China
    2.Shanghai University of Electric Power,Shanghai 200090,China
  • Received:2019-08-30 Revised:2019-09-29 Accepted:2019-10-09 Online:2019-10-14 Published:2020-02-10
  • Contact: Dongmei HUANG
  • About author:ZHAO Danfeng, born in 1982. Ph. D., lecturer. Her research interests include big data storage, business process management.
    LIN Junchen, born in 1994, M. S. candidate. His research interests include graph computation.
    SONG Wei, born in 1977. Ph. D., professor. Her research interests include computer vision, machine learning, data mining.
    WANG Jian, born in 1979. Ph. D., lecturer. Her research interests include marine information technology.
  • Supported by:
    the National Key Research and Development Program of China(2016YFC1401907)

摘要:

针对可达性查询保持图压缩(QPGC)算法存在冗余计算的问题,提出了一种高性能压缩策略。在求解顶点的祖先后代集阶段,针对普通图数据,提出一种基于拓扑排序的求解算法TSB,首先将图数据顶点拓扑排序,然后沿拓扑序列顺序(逆序)求解顶点的祖先(后代)集,避免了求解顺序不明确导致的冗余计算;针对最长路径较短的图数据,提出一种基于图聚合运算的求解算法AGGB,可在确定次数的聚合运算内完成顶点的祖先和后代集的求解。在求解可达性等价类阶段,提出一种分段统计剪枝算法PSP,先对祖先后代集分段统计,再比较统计值以实现粗匹配,剪除了部分不必要的精细匹配。实验结果表明,与QPGC算法相比:在祖先后代集求解阶段,TSB和AGGB在不同数据集上的性能平均提升94.22%和90.00%;在求解可达性等价类阶段,PSP算法在大部分数据集上性能提升超过70%;随着数据集的增大,TSB和AGGB配合PSP算法,性能提升了近28倍。理论分析和模拟实验表明,该策略与QPGC算法相比冗余计算更少、压缩速度更快。

关键词: 可达性查询, 图压缩, 查询保持, 图数据, 拓扑排序, 聚合运算

Abstract:

Since some computation in reachability Query Preserving Graph Compression (QPGC) algorithm are redundant, a high-performance compression strategy was proposed. In the stage of solving the vertex sets of ancestors and descendants, an algorithm named TSB (Topological Sorting Based algorithm for solving ancestor and descendant sets) was proposed for common graph data. Firstly, the vertices of the graph data were topological sorted. Then, the vertex sets were solved in the order or backward order of the topological sequence, avoiding the redundant computation caused by the ambiguous solution order. And an algorithm based on graph aggregation operation was proposed for graph data with short longest path, namely AGGB (AGGregation Based algorithm for solving ancestor and descendant sets), so the vertex sets were able to be solved in a certain number of aggregation operations. In the stage of solving reachability equivalence class, a Piecewise Statistical Pruning (PSP) algorithm was proposed. Firstly, piecewise statistics of ancestors and descendants sets were obtained and then the statistics were compared to achieve the coarse matching, and some unnecessary fine matches were pruned off. Experimental results show that compared with QPGC algorithm: in the stage of solving the vertex sets of ancestors and descendants, TSB and AGGB algorithm have the performance averagely increased by 94.22% and 90.00% respectively on different datasets; and in the stage of solving the reachability equivalence class, PSP algorithm has the performance increased by more than 70% on most datasets. With the increasing of the dataset, using TSB and AGGB cooperated with PSP has the performance improved by nearly 28 times. Theoretical analysis and simulation results show that the proposed strategy has less redundant computation and faster compression speed compared to QPGC.

Key words: reachability query, graph compression, query preserving, graph data, topological sorting, aggregation operation

中图分类号: