《计算机应用》唯一官方网站 ›› 2020, Vol. 40 ›› Issue (2): 426-433.DOI: 10.11772/j.issn.1001-9081.2019081605
• 第36届CCF中国数据库学术会议(NDBC 2019) • 上一篇 下一篇
收稿日期:
2019-08-16
修回日期:
2019-09-23
接受日期:
2019-10-24
发布日期:
2019-10-31
出版日期:
2020-02-10
通讯作者:
周军锋
作者简介:
杜明(1975—),男,黑龙江虎林人,副教授,博士,主要研究方向:信息检索、数据分析和挖掘、自然语言处理
Ming DU1, Anping YANG1, Junfeng ZHOU1(), Ziyang CHEN2, Yun YANG1
Received:
2019-08-16
Revised:
2019-09-23
Accepted:
2019-10-24
Online:
2019-10-31
Published:
2020-02-10
Contact:
Junfeng ZHOU
About author:
DU Ming, born in 1975, Ph. D., associate professor. His research interests include information retrieval, data analysis and mining, natural language processing.摘要:
k步可达查询用于在给定的有向无环图(DAG)中回答两点之间是否存在长度不超过k的路径。针对现有方法的索引规模大、查询处理效率低的问题,提出一种基于部分点的双向最短路径索引来提升索引的可达信息覆盖率,并提出一组优化规则来减小索引规模;然后提出基于简化图的正反互逆拓扑索引来加速回答不可达查询;最后提出远距离优先的双向遍历策略来提高查询处理的效率。基于21个真实数据集(如引用网络、社交网络等)的实验结果表明,相比已有的高效方法PLL及BFSI-B,所提出的算法具有更小的索引规模和更快的查询响应速度。
中图分类号:
杜明, 杨安平, 周军锋, 陈子阳, 杨云. 有向无环图上k步可达查询优化算法[J]. 计算机应用, 2020, 40(2): 426-433.
Ming DU, Anping YANG, Junfeng ZHOU, Ziyang CHEN, Yun YANG. Optimized algorithm for k-step reachability queries on directed acyclic graphs[J]. Journal of Computer Applications, 2020, 40(2): 426-433.
顶点 | LIN | LOUT |
---|---|---|
a | ||
b | ||
c | ||
d | ||
e | ||
f | ||
g |
表1 基于所有顶点的最短路径索引
Tab. 1 Shortest path index based on all points
顶点 | LIN | LOUT |
---|---|---|
a | ||
b | ||
c | ||
d | ||
e | ||
f | ||
g |
顶点 | ||
---|---|---|
a | ||
b | ||
c | ||
d | ||
e | ||
f | ||
g |
表2 基于c和f的最短路径索引
Tab. 2 Shortest path index based on c and f
顶点 | ||
---|---|---|
a | ||
b | ||
c | ||
d | ||
e | ||
f | ||
g |
顶点 | ||
---|---|---|
a | ||
b | ||
c | ||
d | ||
e | ||
f | ||
g |
表3 基于c和f的优化后的最短路径索引
Tab. 3 Optimized shortest path index based on c and f
顶点 | ||
---|---|---|
a | ||
b | ||
c | ||
d | ||
e | ||
f | ||
g |
数据集 | |||
---|---|---|---|
human | 38 811 | 39 576 | 1.01 |
citeseer | 10 720 | 44 258 | 4.13 |
pubmed | 9 000 | 40 028 | 4.44 |
yago | 6 642 | 42 392 | 6.38 |
arxiv | 6 000 | 66 707 | 11.12 |
Email-EuAll | 231 000 | 223 004 | 0.97 |
uniprot100m | 16 087 295 | 16 087 293 | 1.00 |
uniprot150m | 25 037 600 | 25 037 598 | 1.00 |
WikiTalk | 2 281 879 | 2 311 570 | 1.01 |
soc-LJ | 971 232 | 1 024 140 | 1.05 |
web-Google | 371 764 | 517 805 | 1.39 |
10cit-Patent | 1 097 775 | 1 651 894 | 1.51 |
10citeseerx | 770 539 | 1 501 126 | 1.95 |
citeseerx | 6 540 401 | 15 011 260 | 2.30 |
dbpedia | 3 365 623 | 7 989 191 | 2.37 |
govwild | 8 022 880 | 23 652 610 | 2.95 |
cit-Patents | 3 774 768 | 16 518 947 | 4.38 |
go_uniprot | 6 967 956 | 34 769 339 | 4.99 |
10go-unipr | 469 526 | 3 476 397 | 7.40 |
18 121 168 | 18 359 487 | 1.01 | |
web-uk | 22 753 644 | 38 184 039 | 1.68 |
表4 实验数据集统计信息
Tab. 4 Statistics of experimental datasets
数据集 | |||
---|---|---|---|
human | 38 811 | 39 576 | 1.01 |
citeseer | 10 720 | 44 258 | 4.13 |
pubmed | 9 000 | 40 028 | 4.44 |
yago | 6 642 | 42 392 | 6.38 |
arxiv | 6 000 | 66 707 | 11.12 |
Email-EuAll | 231 000 | 223 004 | 0.97 |
uniprot100m | 16 087 295 | 16 087 293 | 1.00 |
uniprot150m | 25 037 600 | 25 037 598 | 1.00 |
WikiTalk | 2 281 879 | 2 311 570 | 1.01 |
soc-LJ | 971 232 | 1 024 140 | 1.05 |
web-Google | 371 764 | 517 805 | 1.39 |
10cit-Patent | 1 097 775 | 1 651 894 | 1.51 |
10citeseerx | 770 539 | 1 501 126 | 1.95 |
citeseerx | 6 540 401 | 15 011 260 | 2.30 |
dbpedia | 3 365 623 | 7 989 191 | 2.37 |
govwild | 8 022 880 | 23 652 610 | 2.95 |
cit-Patents | 3 774 768 | 16 518 947 | 4.38 |
go_uniprot | 6 967 956 | 34 769 339 | 4.99 |
10go-unipr | 469 526 | 3 476 397 | 7.40 |
18 121 168 | 18 359 487 | 1.01 | |
web-uk | 22 753 644 | 38 184 039 | 1.68 |
数据集 | PLL | BFSI-B | kRH |
---|---|---|---|
human | 0.91 | 1.18 | 0.39 |
citeseer | 0.77 | 0.33 | 0.12 |
pubmed | 0.70 | 0.27 | 0.11 |
yago | 0.51 | 0.20 | 0.08 |
arxiv | 2.68 | 0.18 | 0.07 |
Email-EuAll | 5.22 | 7.05 | 3.25 |
uniprot100m | 393.76 | 490.95 | 240.47 |
uniprot150m | 635.14 | 764.09 | 342.04 |
WikiTalk | 52.54 | 69.67 | 30.82 |
soc-LJ | 22.73 | 29.64 | 11.82 |
web-Google | 10.12 | 11.35 | 4.67 |
10cit-Patent | 30.22 | 33.50 | 13.75 |
10citeseerx | 39.28 | 23.52 | 8.76 |
citeseerx | 276.14 | 199.60 | 90.80 |
dbpedia | 121.90 | 102.71 | 46.36 |
govwild | 398.22 | 244.84 | 116.42 |
cit-Patents | 1 748.75 | 115.20 | 53.60 |
go_uniprot | 532.72 | 212.64 | 197.32 |
10go-unipr | 46.63 | 14.33 | 6.17 |
416.68 | 553.01 | 236.51 | |
web-uk | 967.76 | 694.39 | 307.19 |
表5 不同数据集上的索引大小 (MB)
Tab. 5 Index size on different datasets
数据集 | PLL | BFSI-B | kRH |
---|---|---|---|
human | 0.91 | 1.18 | 0.39 |
citeseer | 0.77 | 0.33 | 0.12 |
pubmed | 0.70 | 0.27 | 0.11 |
yago | 0.51 | 0.20 | 0.08 |
arxiv | 2.68 | 0.18 | 0.07 |
Email-EuAll | 5.22 | 7.05 | 3.25 |
uniprot100m | 393.76 | 490.95 | 240.47 |
uniprot150m | 635.14 | 764.09 | 342.04 |
WikiTalk | 52.54 | 69.67 | 30.82 |
soc-LJ | 22.73 | 29.64 | 11.82 |
web-Google | 10.12 | 11.35 | 4.67 |
10cit-Patent | 30.22 | 33.50 | 13.75 |
10citeseerx | 39.28 | 23.52 | 8.76 |
citeseerx | 276.14 | 199.60 | 90.80 |
dbpedia | 121.90 | 102.71 | 46.36 |
govwild | 398.22 | 244.84 | 116.42 |
cit-Patents | 1 748.75 | 115.20 | 53.60 |
go_uniprot | 532.72 | 212.64 | 197.32 |
10go-unipr | 46.63 | 14.33 | 6.17 |
416.68 | 553.01 | 236.51 | |
web-uk | 967.76 | 694.39 | 307.19 |
数据集 | PLL | BFSI-B | kRH |
---|---|---|---|
human | 10.60 | 5.89 | 5.24 |
citeseer | 22.24 | 3.09 | 2.96 |
pubmed | 24.35 | 2.99 | 3.20 |
yago | 7.45 | 2.61 | 2.55 |
arxiv | 232.31 | 2.58 | 11.45 |
Email-EuAll | 78.03 | 38.41 | 41.55 |
unprot100m | 6 642.67 | 4 299.18 | 3 817.71 |
unprot150m | 12 033.31 | 6 776.78 | 6 725.61 |
WikiTalk | 617.26 | 369.50 | 318.81 |
soc-LJ | 360.48 | 172.819 | 185.40 |
web-Google | 223.31 | 99.04 | 101.70 |
10cit-Patent | 964.10 | 614.03 | 420.48 |
10citeseerx | 1 690.31 | 314.29 | 254.37 |
citeseerx | 14 563.31 | 5 012.91 | 4 166.43 |
dbpedia | 3 989.27 | 1 777.21 | 1 835.07 |
govwild | 12 069.51 | 3 347.31 | 3 769.61 |
cit-Patents | 476 502 | 8 040.33 | 6 065.75 |
go_uniprot | 15 748.51 | 4 079.98 | 7 624.27 |
10go-unipr | 1 453.04 | 328.76 | 584.98 |
6 727.98 | 3 712.88 | 3 664.62 | |
web-uk | 47 093.21 | 3 703.31 | 4 686.35 |
表6 不同数据集上的索引时间 (ms)
Tab. 6 Index time on different datasets
数据集 | PLL | BFSI-B | kRH |
---|---|---|---|
human | 10.60 | 5.89 | 5.24 |
citeseer | 22.24 | 3.09 | 2.96 |
pubmed | 24.35 | 2.99 | 3.20 |
yago | 7.45 | 2.61 | 2.55 |
arxiv | 232.31 | 2.58 | 11.45 |
Email-EuAll | 78.03 | 38.41 | 41.55 |
unprot100m | 6 642.67 | 4 299.18 | 3 817.71 |
unprot150m | 12 033.31 | 6 776.78 | 6 725.61 |
WikiTalk | 617.26 | 369.50 | 318.81 |
soc-LJ | 360.48 | 172.819 | 185.40 |
web-Google | 223.31 | 99.04 | 101.70 |
10cit-Patent | 964.10 | 614.03 | 420.48 |
10citeseerx | 1 690.31 | 314.29 | 254.37 |
citeseerx | 14 563.31 | 5 012.91 | 4 166.43 |
dbpedia | 3 989.27 | 1 777.21 | 1 835.07 |
govwild | 12 069.51 | 3 347.31 | 3 769.61 |
cit-Patents | 476 502 | 8 040.33 | 6 065.75 |
go_uniprot | 15 748.51 | 4 079.98 | 7 624.27 |
10go-unipr | 1 453.04 | 328.76 | 584.98 |
6 727.98 | 3 712.88 | 3 664.62 | |
web-uk | 47 093.21 | 3 703.31 | 4 686.35 |
数据集 | PLL | BFSI-B | kRH |
---|---|---|---|
human | 54.04 | 42.36 | 22.49 |
citeseer | 94.97 | 84.81 | 46.68 |
pubmed | 63.44 | 61.99 | 38.49 |
yago | 51.48 | 35.67 | 17.67 |
arxiv | 371.15 | 336.50 | 145.22 |
Email-EuAll | 86.57 | 57.71 | 37.03 |
unprot100m | 71.09 | 60.83 | 11.49 |
unprot150m | 65.43 | 62.15 | 11.45 |
WikiTalk | 105.08 | 77.17 | 48.57 |
soc-LJ | 153.13 | 124.11 | 76.81 |
web-Google | 141.81 | 132.59 | 69.11 |
10cit-Patent | 88.39 | 39.58 | 15.16 |
10citeseerx | 143.41 | 74.84 | 54.21 |
citeseerx | 174.43 | 126.10 | 97.17 |
dbpedia | 149.32 | 131.11 | 72.29 |
govwild | 172.26 | 127.73 | 97.14 |
cit-Patents | 539.40 | 158.39 | 96.79 |
go_uniprot | 184.28 | 54.32 | 36.62 |
10go-unipr | 154.07 | 58.70 | 54.03 |
214.22 | 155.51 | 120.57 | |
web-uk | 248.95 | 242.83 | 207.51 |
表7 k=3时不同数据集上的查询时间 (ms)
Tab. 7 Query time on different datasets when k=3
数据集 | PLL | BFSI-B | kRH |
---|---|---|---|
human | 54.04 | 42.36 | 22.49 |
citeseer | 94.97 | 84.81 | 46.68 |
pubmed | 63.44 | 61.99 | 38.49 |
yago | 51.48 | 35.67 | 17.67 |
arxiv | 371.15 | 336.50 | 145.22 |
Email-EuAll | 86.57 | 57.71 | 37.03 |
unprot100m | 71.09 | 60.83 | 11.49 |
unprot150m | 65.43 | 62.15 | 11.45 |
WikiTalk | 105.08 | 77.17 | 48.57 |
soc-LJ | 153.13 | 124.11 | 76.81 |
web-Google | 141.81 | 132.59 | 69.11 |
10cit-Patent | 88.39 | 39.58 | 15.16 |
10citeseerx | 143.41 | 74.84 | 54.21 |
citeseerx | 174.43 | 126.10 | 97.17 |
dbpedia | 149.32 | 131.11 | 72.29 |
govwild | 172.26 | 127.73 | 97.14 |
cit-Patents | 539.40 | 158.39 | 96.79 |
go_uniprot | 184.28 | 54.32 | 36.62 |
10go-unipr | 154.07 | 58.70 | 54.03 |
214.22 | 155.51 | 120.57 | |
web-uk | 248.95 | 242.83 | 207.51 |
数据集 | BFSI-B | kRH |
---|---|---|
human | 799 564 | 999 956 |
citeseer | 794 836 | 911 257 |
pubmed | 760 636 | 926 193 |
yago | 711 253 | 984 771 |
arxiv | 603 701 | 771 523 |
Email-EuAll | 698 539 | 999 939 |
uniprot100m | 808 389 | 999 995 |
uniprot150m | 785 811 | 999 994 |
WikiTalk | 882 736 | 999 853 |
soc-LJ | 753 982 | 998 973 |
web-Google | 754 241 | 986 717 |
10cit-Patent | 847 771 | 998 752 |
10citeseerx | 710 964 | 977 381 |
citeseerx | 646 989 | 977 857 |
dbpedia | 888 168 | 986 482 |
govwild | 891 929 | 980 872 |
cit-Patents | 717 408 | 998 286 |
go_uniprot | 877 038 | 965 690 |
10go-unipr | 846 093 | 932 241 |
854 191 | 999 701 | |
web-uk | 745 658 | 972 058 |
表8 k=3时常量时间内可判定的查询数量
Tab. 8 Query answers can be determined in constant time when k=3
数据集 | BFSI-B | kRH |
---|---|---|
human | 799 564 | 999 956 |
citeseer | 794 836 | 911 257 |
pubmed | 760 636 | 926 193 |
yago | 711 253 | 984 771 |
arxiv | 603 701 | 771 523 |
Email-EuAll | 698 539 | 999 939 |
uniprot100m | 808 389 | 999 995 |
uniprot150m | 785 811 | 999 994 |
WikiTalk | 882 736 | 999 853 |
soc-LJ | 753 982 | 998 973 |
web-Google | 754 241 | 986 717 |
10cit-Patent | 847 771 | 998 752 |
10citeseerx | 710 964 | 977 381 |
citeseerx | 646 989 | 977 857 |
dbpedia | 888 168 | 986 482 |
govwild | 891 929 | 980 872 |
cit-Patents | 717 408 | 998 286 |
go_uniprot | 877 038 | 965 690 |
10go-unipr | 846 093 | 932 241 |
854 191 | 999 701 | |
web-uk | 745 658 | 972 058 |
1 | ZHANG T, GAO Y, LI C, et al. Distributed reachability queries on massive graphs[C]// Proceedings of the 2019 International Conference on Database Systems for Advanced Applications, LNCS11448. Cham: Springer, 2019: 406-410. |
2 | SENGUPTA N, BAGCHI A, RAMANATH M, et al. Arrow: approximating reachability using random walks over Web-scale graphs[C]// Proceedings of the IEEE 35th International Conference on Data Engineering. Piscataway: IEEE, 2019:470-481. 10.1109/icde.2019.00049 |
3 | XIE X, YANG X, WANG X, et al. BFSI-B: an improved k-hop graph reachability queries for cyber-physical systems[J]. Information Fusion, 2017, 38:35-42. 10.1016/j.inffus.2017.02.009 |
4 | YANO Y, AKIBA T, IWATA Y, et al. Fast and scalable reachability queries on graphs by pruned labeling with landmarks and paths[C]// Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. New York: ACM, 2013: 1601-1606. 10.1145/2505515.2505724 |
5 | CHENG J, SHANG Z, CHENG H, et al. Efficient processing of k-hop reachability queries[J]. The VLDB Journal, 2014, 23(2):227-252. 10.1007/s00778-013-0346-6 |
6 | CHENG J, SHANG Z, CHENG H, et al. K-reach: who is in your small world[J]. Proceedings of the VLDB Endowment, 2012, 5(11): 1292-1303. 10.14778/2350229.2350247 |
7 | CHENG J, KE Y, CHU S, et al. Efficient processing of distance queries in large graphs: a vertex cover approach[C]// Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. New York: ACM, 2012: 457-468. 10.1145/2213836.2213888 |
8 | AGRAWAL R, BORGIDA A, JAGADISH H V. Efficient management of transitive relationships in large data and knowledge bases[C]// Proceedings of the 1989 ACM SIGMOD International Conference on Management of Data. New York: ACM, 1989:253-262. 10.1145/66926.66950 |
9 | AKIBA T, IWATA Y, YUICHI YOSHIDA Y. Fast exact shortest-path distance queries on large networks by pruned landmark labeling[C]// Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. New York: ACM, 2013: 349-360. 10.1145/2463676.2465315 |
10 | TRIßL S, LESER U. Fast and practical indexing and querying of very large graphs[C]// Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data. New York: ACM, 2007: 845-856. 10.1145/1247480.1247573 |
11 | YILDIRIM H, CHAOJI V, ZAKI M J. Grail: scalable reachability index for large graphs[J]. Proceedings of the VLDB Endowment, 2010, 3(1/2): 276-284. 10.14778/1920841.1920879 |
12 | VELOSO R R, CERF L, MEIRA W, Jr, et al. Reachability queries in very large graphs: a fast refined online search approach[C]// Proceedings of the 17th International Conference on Extending DB Technology. Berling: Springer, 2014: 511-522. |
13 | ZHOU J, ZHOU S, YU J X, et al. DAG reduction: fast answering reachability queries[C]// Proceedings of the 2017 ACM International Conference on Management of Data. New York: ACM, 2017: 375-390. 10.1145/3035918.3035927 |
14 | BOLDI P, SANTINI M, VIGNA S. A large time-aware Web graph[J]. ACM SIGIR Forum, 2008, 42(2): 33-38. 10.1145/1480506.1480511 |
15 | JIN R, RUAN N, DEY S, et al. Scarab: scaling reachability computation on large graphs[C]// Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. New York: ACM, 2012: 169-180. 10.1145/2213836.2213856 |
16 | van SCHAIK S, de MOOR O. A memory efficient reachability data structure through bit vector compression[C]// Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. New York: ACM, 2011: 913-924. 10.1145/1989323.1989419 |
[1] | 邓辅秦, 黄焕钊, 谭朝恩, 付兰慧, 张建民, 林天麟. 结合遗传算法和滚动调度的多机器人任务分配算法[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3833-3839. |
[2] | 刘乾, 张洋铭, 万定生. 网格化分布式新安江模型并行计算算法[J]. 《计算机应用》唯一官方网站, 2023, 43(11): 3327-3333. |
[3] | 姜松岩, 廖晓鹃, 陈光柱. 基于可满足性模理论的多处理机通信延迟优化任务调度方法[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 185-191. |
[4] | 王旭, 申玉民, 熊晓芸, 李鹏, 王金龙. 基于哈希图的建筑物联网数据管理方法[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2471-2480. |
[5] | 吴晴晴, 周丽华, 寸轩懿, 杜国王, 姜懿庭. 异质信息网络中基于有向无环图的影响力最大化算法[J]. 《计算机应用》唯一官方网站, 2022, 42(3): 895-903. |
[6] | 姜琨, 刘征, 朱磊, 李晓星. 基于有向无环图的倒排链等字长划分压缩算法[J]. 计算机应用, 2021, 41(3): 727-732. |
[7] | 丁梦苏, 陈世敏. 轻量级大数据运算系统Helius[J]. 计算机应用, 2017, 37(2): 305-310. |
[8] | 王冠, 王宇新, 陈鑫, 王飞, 郭禾. 基于直接后继节点完成时间的异构调度算法[J]. 计算机应用, 2017, 37(1): 12-17. |
[9] | 温菊屏, 胡小生, 林冬梅, 曾亚光. 基于参考节点嵌入的图可达性查询[J]. 计算机应用, 2016, 36(7): 1998-2005. |
[10] | 许冬冬, 袁凌云. 基于事件共享机制的物联网复杂事件处理方法[J]. 计算机应用, 2015, 35(2): 326-331. |
[11] | 王宇新, 曹仕杰, 郭禾, 陈征, 陈鑫. 兼顾费用与公平的带通信开销的多有向无环图调度[J]. 计算机应用, 2015, 35(11): 3017-3020. |
[12] | 沈舒 朱志宇 吴将. 可重构混合任务调度算法[J]. 计算机应用, 2014, 34(2): 387-390. |
[13] | 刘丹琦 于炯 英昌甜. 云计算环境下多有向无环图工作流的节能调度算法[J]. 计算机应用, 2013, 33(09): 2410-2415. |
[14] | 李兴芳 苑迎春 王克俭. 基于拓扑序列归约的Web服务组合QoS度量算法[J]. 计算机应用, 2012, 32(05): 1432-1435. |
[15] | 魏韡 向阳 陈千. 计算术语间语义相似度的混合方法[J]. 计算机应用, 2010, 30(06): 1668-1670. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||