知识引导的视觉关系检测模型

doi:10.11772/j.issn.1001-9081.2023040413

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (3): 683-689.DOI: 10.11772/j.issn.1001-9081.2023040413

所属专题：人工智能

知识引导的视觉关系检测模型

王元龙(), 胡文博, 张虎

山西大学计算机与信息技术学院，太原 030006

收稿日期:2023-04-13 修回日期:2023-07-04 接受日期:2023-07-10 发布日期:2023-12-04 出版日期:2024-03-10
通讯作者: 王元龙
作者简介:胡文博（1998—），男，山西运城人，硕士研究生，主要研究方向：自然语言处理、计算机视觉
张虎（1979—），男，山西大同人，教授，博士，CCF会员，主要研究方向：自然语言处理。
基金资助:
国家自然科学基金资助项目(62176145)

Knowledge-guided visual relationship detection model

Yuanlong WANG(), Wenbo HU, Hu ZHANG

School of Computer and Information Technology，Shanxi University，Taiyuan Shanxi 030006，China

Received:2023-04-13 Revised:2023-07-04 Accepted:2023-07-10 Online:2023-12-04 Published:2024-03-10
Contact: Yuanlong WANG
About author:HU Wenbo，born in 1998， M. S. candidate. His research interests include natural language processing， computer vision.
ZHANG Hu，born in 1979， Ph. D.， professor. His research interests include natural language processing.
Supported by:
National Natural Science Foundation of China(62176145)

摘要/Abstract

摘要：

视觉关系检测（VRD）任务是在目标识别的基础上，进一步检测目标对象之间的关系，属于视觉理解和推理的关键技术。由于对象之间交互组合，容易造成对象间关系组合爆炸的问题，从而产生很多关联性较弱的实体对，导致后续的关系检测召回率较低。针对上述问题，提出知识引导的视觉关系检测模型。首先构建视觉知识，对常见的视觉关系检测数据集中的实体标签和关系标签进行数据分析与统计，得到实体和关系间交互共现频率作为视觉知识；然后利用所构建的视觉知识，优化实体对的组合流程，降低关联性较弱的实体对得分，提升关联性较强的实体对得分，进而按照实体对的得分排序并删除得分较低的实体对，对于实体之间的关系也同样采用知识引导的方式优化关系得分，从而提升模型的召回率。在公开数据集视觉基因库（VG）和VRD中验证所提模型的效果：在谓词分类任务中，与现有模型PE-Net（Prototype-based Embedding Network）相比，在VG数据集上，召回率Recall@50和Recall@100分别提高了1.84和1.14个百分点；在VRD数据集上，相较于Coacher，Recall@20、Recall@50和Recall@100分别提高了0.22、0.32和0.31个百分点。

关键词: 视觉关系检测, 实体对排序, 组合爆炸, 共现频率, 知识引导

Abstract:

The task of Visual Relationship Detection （VRD） is to further detect the relationship between target objects on the basis of target recognition， which belongs to the key technology of visual understanding and reasoning. Due to the interaction and combination between objects， it is easy to cause the combinatorial explosion problem of relationship between objects， resulting in many entity pairs with weak correlation， which in turn makes the subsequent relationship detection recall rate low. To solve the above problems， a knowledge-guided visual relationship detection model was proposed. Firstly， visual knowledge was constructed， data analysis and statistics were carried out on entity labels and relationship labels in common visual relationship detection datasets， and the interaction co-occurrence frequency between entities and relationships was obtained as visual knowledge. Then， the constructed visual knowledge was used to optimize the combination process of entity pairs， the score of entity pairs with weak correlation decreased， while the score of entity pairs with strong correlation increased， and then the entity pairs were ranked according to their scores and the entity pairs with lower scores were deleted； the relationship score was also optimized in a knowledge-guided way for the relationship between entities， so as to improve the recall rate of the model. The effect of the proposed model was verified in the public datasets VG （Visual Genome） and VRD， respectively. In predicate classification tasks， compared with the existing model PE-Net （Prototype-based Embedding Network）， the recall rates Recall@50 and Recall@100 improved by 1.84 and 1.14 percentage points respectively in the VG dataset. Compared to Coacher， the Recall@20， Recall@50 and Recall@100 increased by 0.22， 0.32 and 0.31 percentage points respectively in the VRD dataset.

Key words: Visual Relationship Detection (VRD), entity pair ranking, combinatorial explosion, co-occurrence frequency, knowledge guidance

中图分类号:

TP391.7

王元龙, 胡文博, 张虎. 知识引导的视觉关系检测模型[J]. 计算机应用, 2024, 44(3): 683-689.

Yuanlong WANG, Wenbo HU, Hu ZHANG. Knowledge-guided visual relationship detection model[J]. Journal of Computer Applications, 2024, 44(3): 683-689.

图/表 9

参考文献 28

1	LU C， KRISHNA R， BERNSTEIN M， et al. Visual relationship detection with language priors ［C］// Proceedings of the 14th European Conference on Computer Vision. Cham： Springer， 2016： 852-869. 10.1007/978-3-319-46448-0_51
2	钟冠华，黄巍.基于多特征提取网络的视觉关系检测方法研究［J］.电脑与电信， 2022（7）： 67-70. 10.3969/j.issn.1008-6609.2022.7.gddnydx202207016
	ZHONG G H， HUANG W. Research on visual relationship detection method based on multi-feature extraction network［J］. Computers & Telecommunications，2022（7）：67-70. 10.3969/j.issn.1008-6609.2022.7.gddnydx202207016
3	马立志.基于深度学习的视觉关系检测方法探讨［J］.现代工业经济和信息化， 2021， 11（8）： 84-86. 10.16525/j.cnki.14-1362/n.2021.08.33
	MA L Z. Discussion on the visual relationship detection method based on deep learning ［J］. Modern Industrial Economy and Informatization，2021，11（8）：84-86. 10.16525/j.cnki.14-1362/n.2021.08.33
4	ZHOU H， ZHANG C， HU C. Visual relationship detection with relative location mining ［C］// Proceedings of the 27th ACM International Conference on Multimedia. New York： ACM， 2019： 30-38. 10.1145/3343031.3351024
5	LI Y， OUYANG W， WANG X， et al. ViP-CNN： visual phrase guided convolutional neural network ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 7244-7253. 10.1109/cvpr.2017.766
6	KRISHNA R， ZHU Y， GROTH O， et al. Visual genome： connecting language and vision using crowdsourced dense image annotations ［J］. International Journal of Computer Vision， 2017， 123： 32-73. 10.1007/s11263-016-0981-7
7	CHE W， FAN X， XIONG R， et al. Paragraph generation network with visual relationship detection ［C］// Proceedings of the 26th ACM International Conference on Multimedia. New York： ACM， 2018： 1435-1443. 10.1145/3240508.3240695
8	XU D， ZHU Y， CHOY C B， et al. Scene graph generation by iterative message passing ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 5410-5419. 10.1109/cvpr.2017.330
9	DONG X， ZHU L， ZHANG D， et al. Fast parameter adaptation for few-shot image captioning and visual question answering ［C］// Proceedings of the 26th ACM International Conference on Multimedia. New York： ACM， 2018： 54-62. 10.1145/3240508.3240527
10	GAO L， ZENG P， SONG J， et al. Examine before you answer： multi-task learning with adaptive-attentions for multiple-choice VQA ［C］// Proceedings of the 26th ACM International Conference on Multimedia. New York： ACM， 2018： 1742-1750. 10.1145/3240508.3240687
11	GALLEGUILLOS C， RABINOVICH A， BELONGIE S. Object categorization using co-occurrence， location and appearance ［C］// Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2008： 1-8. 10.1109/cvpr.2008.4587799
12	DESAI C， RAMANAN D. Detecting actions， poses， and objects with relational phraselets ［C］// Proceedings of the 12th European Conference on Computer Vision. Cham： Springer， 2012：158-172. 10.1007/978-3-642-33765-9_12
13	SADEGHI M A， FARHADI A. Recognition using visual phrases［C］// Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE，2012： 1745-1752. 10.1109/cvpr.2011.5995711
14	YIN G， SHENG L， LIU B， et al. Zoom-Net： mining deep feature interactions for visual relationship recognition ［C］// Proceedings of the 15th European Conference on Computer Vision. Berlin： Springer，2018： 330-347. 10.1007/978-3-030-01219-9_20
15	CUI Z， XU C， ZHENG W， et al. Context-dependent diffusion network for visual relationship detection ［C］// Proceedings of the 26th ACM International Conference on Multimedia. New York： ACM， 2018： 1475-1482. 10.1145/3240508.3240668
16	SHARIFZADEH S， BAHARLOU S M， BERRENDORF M， et al. Improving visual relation detection using depth maps［C］// Proceedings of the 25th International Conference on Pattern Recognition. Piscataway： IEEE， 2021： 3597-3604. 10.1109/icpr48806.2021.9412945
17	ZHANG H， KYAW Z， CHANG S-F， et al. Visual translation embedding network for visual relation detection ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 5532-5540. 10.1109/cvpr.2017.331
18	BORDES A， USUNIER N， GARCIA-DURÁN A， et al. Translating embeddings for modeling multi-relational data ［C］// Proceedings of the 26th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2013： 2787-2795.
19	WAN H， LUO Y， PENG B， et al. Representation learning for scene graph completion via jointly structural and visual embedding［C］// Proceedings of the 27th International Joint Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2018： 949-956. 10.24963/ijcai.2018/132
20	JI G， HE S， XU L， et al. Knowledge graph embedding via dynamic mapping matrix ［C］// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics. Stroudsburg： ACL， 2015： 687-696. 10.3115/v1/p15-1067
21	REN S， HE K， GIRSHICK R， et al. Faster R-CNN： towards real-time object detection with region proposal network ［C］// Proceedings of the 28th International Joint Conference on Artificial Intelligence. Cambridge： MIT Press， 2015： 91-99.
22	KAN X， CUI H， YANG C. Zero-shot scene graph relation prediction through commonsense knowledge integration ［C］// Proceedings of the 2021 Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Cham： Springer， 2021：466-482. 10.1007/978-3-030-86520-7_29
23	TANG K， ZHANG H， WU B， et al. Learning to compose dynamic tree structures for visual contexts ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 6619-6628. 10.1109/cvpr.2019.00678
24	TANG K， NIU Y， HUANG J， et al. Unbiased scene graph generation from biased training ［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 3716-3725. 10.1109/cvpr42600.2020.00377
25	ZELLERS R， YATSKAR M， THOMSON S， et al. Neural motifs： scene graph parsing with global context ［C］// Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 5831-5840. 10.1109/cvpr.2018.00611
26	LIN X， DING C， ZHANG J， et al. RU-Net： regularized unrolling network for scene graph generation ［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 19457-19466. 10.1109/cvpr52688.2022.01885
27	ZHENG C， LYU X， GAO L， et al. Prototype-based embedding network for scene graph generation ［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023： 22783-22792. 10.1109/cvpr52729.2023.02182
28	HU Y， CHEN S， CHEN X， et al. Neural message passing for visual relationship detection ［EB/OL］. ［2022-08-08］. .

模型	谓词分类召回率			短语检测召回率			关系检测召回率
模型	R@20	R@50	R@100	R@20	R@50	R@100	R@20	R@50	R@100
RLM	—	67.93	68.20	—	26.60	33.92	—	16.96	21.17
ViP	—	—	—	—	16.58	21.54	—	10.67	13.81
Motifs	58.46	65.18	67.01	35.63	38.92	39.77	25.48	32.78	37.16
VCTree	59.02	65.42	67.18	42.77	46.67	47.64	24.53	31.93	36.21
Transformer	59.06	65.55	67.29	36.87	40.18	41.02	25.55	33.04	37.40
Coacher	58.91	65.90	67.86	36.48	40.31	41.14	26.33	33.18	38.01
RU-Net	61.60	67.70	69.60	37.20	39.80	40.90	22.90	31.30	34.80
NMP	—	67.03	67.29	—	—	—	—	—	—
PE-Net	—	64.90	67.20	—	39.40	40.70	—	30.70	35.20
本文模型	59.73	66.74	68.34	37.39	41.20	41.84	26.15	33.20	38.10

模型	谓词分类召回率			短语检测召回率			关系检测召回率
模型	R@20	R@50	R@100	R@20	R@50	R@100	R@20	R@50	R@100
RLM	—	67.93	68.20	—	26.60	33.92	—	16.96	21.17
ViP	—	—	—	—	16.58	21.54	—	10.67	13.81
Motifs	58.46	65.18	67.01	35.63	38.92	39.77	25.48	32.78	37.16
VCTree	59.02	65.42	67.18	42.77	46.67	47.64	24.53	31.93	36.21
Transformer	59.06	65.55	67.29	36.87	40.18	41.02	25.55	33.04	37.40
Coacher	58.91	65.90	67.86	36.48	40.31	41.14	26.33	33.18	38.01
RU-Net	61.60	67.70	69.60	37.20	39.80	40.90	22.90	31.30	34.80
NMP	—	67.03	67.29	—	—	—	—	—	—
PE-Net	—	64.90	67.20	—	39.40	40.70	—	30.70	35.20
本文模型	59.73	66.74	68.34	37.39	41.20	41.84	26.15	33.20	38.10

模型	R@20	R@50	R@100
RLM	—	—	52.19
Motifs	47.70	51.84	52.28
VCTree	48.19	52.23	52.71
Transformer	42.30	46.74	47.76
Coacher	48.09	52.08	52.79
NMP	—	52.69	52.69
本文模型	48.31	52.40	53.10

模型	R@20	R@50	R@100
RLM	—	—	52.19
Motifs	47.70	51.84	52.28
VCTree	48.19	52.23	52.71
Transformer	42.30	46.74	47.76
Coacher	48.09	52.08	52.79
NMP	—	52.69	52.69
本文模型	48.31	52.40	53.10

模型	zR@20	zR@50	zR@100
Motifs	13.05	19.03	21.98
VCTree	10.35	13.63	15.64
Transformer	11.04	13.27	15.51
Coacher	13.42	19.31	22.22
本文模型	14.26	20.59	22.02

知识引导的视觉关系检测模型

Knowledge-guided visual relationship detection model

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 28

相关文章 15

编辑推荐

Metrics

模型	R@20	R@50	R@100	zR@20	zR@50	zR@100
BM	57.91	64.90	66.86	13.42	19.31	22.22
BM+P	58.56	65.10	67.56	13.07	18.91	21.97
BM+R	57.87	64.95	66.96	13.58	19.80	22.48
BM+P+R	59.73	65.74	67.34	14.26	20.59	22.02

[1]	刘晋文王磊马博董瑞杨雅婷艾合塔木江·艾合麦提王欣乐. 基于弱监督模态语义增强的多模态有害信息检测方法 [J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[2]	王元龙刘亭华张虎. 基于跨模态对比学习的常识问答模型[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[3]	薛天宇李爱萍段利国. 联合任务卸载和资源优化的车辆边缘计算方案[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[4]	李昕刘雯廖集秀杨宗驰. 面向机器理解的可视化交互信息重构方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[5]	石志良, 廖诗旗, 甘梓博, 祝少博. 三维桡骨成角楔形截骨术前自动规划算法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 588-594.
[6]	李文全, 毛伊敏, 彭新东. 基于犹豫模糊集的凝聚式层次聚类算法[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3755-3763.
[7]	陈旭东, 钟恒, 皇甫洁, 吕高冲, 王成, 王德良, 童凯. 脑电信号情绪识别综述[J]. 《计算机应用》唯一官方网站, 2023, 43(S1): 323-332.
[8]	秦静, 马雪倩, 高福杰, 季长清, 汪祖民. 基于步态分析的帕金森病辅助诊断方法综述[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1687-1695.
[9]	黄琼, 丁兆云. 基于粒子滤波的隧道火灾烟气速度估计方法[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 986-990.
[10]	李晓寒, 王俊, 贾华丁, 萧刘. 基于多重注意力机制的图神经网络股市波动预测方法[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2265-2273.
[11]	李晓寒, 贾华丁, 程雪, 李太勇. 基于改进遗传算法和图神经网络的股市波动预测方法[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1624-1633.
[12]	秦静, 孙法莉, HUI Fang, 汪祖民, 高兵, 季长清. 可穿戴脑电图设备关键技术及其应用综述[J]. 《计算机应用》唯一官方网站, 2022, 42(4): 1029-1035.
[13]	刘晶, 董志红, 张喆语, 孙志刚, 季海鹏. 基于联邦增量学习的工业物联网数据共享方法[J]. 《计算机应用》唯一官方网站, 2022, 42(4): 1235-1243.
[14]	单芝慧, 韩萌, 韩强. 动态数据上的高效用模式挖掘综述[J]. 《计算机应用》唯一官方网站, 2022, 42(1): 94-108.
[15]	黄晓祥, 胡咏梅, 吴丹, 任力杰. 基于变分自编码器的异常颈动脉早期识别和预测[J]. 计算机应用, 2021, 41(10): 3082-3088.