Knowledge-guided visual relationship detection model

doi:10.11772/j.issn.1001-9081.2023040413

Abstract

Abstract:

The task of Visual Relationship Detection （VRD） is to further detect the relationship between target objects on the basis of target recognition， which belongs to the key technology of visual understanding and reasoning. Due to the interaction and combination between objects， it is easy to cause the combinatorial explosion problem of relationship between objects， resulting in many entity pairs with weak correlation， which in turn makes the subsequent relationship detection recall rate low. To solve the above problems， a knowledge-guided visual relationship detection model was proposed. Firstly， visual knowledge was constructed， data analysis and statistics were carried out on entity labels and relationship labels in common visual relationship detection datasets， and the interaction co-occurrence frequency between entities and relationships was obtained as visual knowledge. Then， the constructed visual knowledge was used to optimize the combination process of entity pairs， the score of entity pairs with weak correlation decreased， while the score of entity pairs with strong correlation increased， and then the entity pairs were ranked according to their scores and the entity pairs with lower scores were deleted； the relationship score was also optimized in a knowledge-guided way for the relationship between entities， so as to improve the recall rate of the model. The effect of the proposed model was verified in the public datasets VG （Visual Genome） and VRD， respectively. In predicate classification tasks， compared with the existing model PE-Net （Prototype-based Embedding Network）， the recall rates Recall@50 and Recall@100 improved by 1.84 and 1.14 percentage points respectively in the VG dataset. Compared to Coacher， the Recall@20， Recall@50 and Recall@100 increased by 0.22， 0.32 and 0.31 percentage points respectively in the VRD dataset.

Key words: Visual Relationship Detection (VRD), entity pair ranking, combinatorial explosion, co-occurrence frequency, knowledge guidance

摘要：

视觉关系检测（VRD）任务是在目标识别的基础上，进一步检测目标对象之间的关系，属于视觉理解和推理的关键技术。由于对象之间交互组合，容易造成对象间关系组合爆炸的问题，从而产生很多关联性较弱的实体对，导致后续的关系检测召回率较低。针对上述问题，提出知识引导的视觉关系检测模型。首先构建视觉知识，对常见的视觉关系检测数据集中的实体标签和关系标签进行数据分析与统计，得到实体和关系间交互共现频率作为视觉知识；然后利用所构建的视觉知识，优化实体对的组合流程，降低关联性较弱的实体对得分，提升关联性较强的实体对得分，进而按照实体对的得分排序并删除得分较低的实体对，对于实体之间的关系也同样采用知识引导的方式优化关系得分，从而提升模型的召回率。在公开数据集视觉基因库（VG）和VRD中验证所提模型的效果：在谓词分类任务中，与现有模型PE-Net（Prototype-based Embedding Network）相比，在VG数据集上，召回率Recall@50和Recall@100分别提高了1.84和1.14个百分点；在VRD数据集上，相较于Coacher，Recall@20、Recall@50和Recall@100分别提高了0.22、0.32和0.31个百分点。

关键词: 视觉关系检测, 实体对排序, 组合爆炸, 共现频率, 知识引导

CLC Number:

TP391.7

Yuanlong WANG, Wenbo HU, Hu ZHANG. Knowledge-guided visual relationship detection model[J]. Journal of Computer Applications, 2024, 44(3): 683-689.

王元龙, 胡文博, 张虎. 知识引导的视觉关系检测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 683-689.

Figures/Tables 9

References 28

1	LU C， KRISHNA R， BERNSTEIN M， et al. Visual relationship detection with language priors ［C］// Proceedings of the 14th European Conference on Computer Vision. Cham： Springer， 2016： 852-869. 10.1007/978-3-319-46448-0_51
2	钟冠华，黄巍.基于多特征提取网络的视觉关系检测方法研究［J］.电脑与电信， 2022（7）： 67-70. 10.3969/j.issn.1008-6609.2022.7.gddnydx202207016
	ZHONG G H， HUANG W. Research on visual relationship detection method based on multi-feature extraction network［J］. Computers & Telecommunications，2022（7）：67-70. 10.3969/j.issn.1008-6609.2022.7.gddnydx202207016
3	马立志.基于深度学习的视觉关系检测方法探讨［J］.现代工业经济和信息化， 2021， 11（8）： 84-86. 10.16525/j.cnki.14-1362/n.2021.08.33
	MA L Z. Discussion on the visual relationship detection method based on deep learning ［J］. Modern Industrial Economy and Informatization，2021，11（8）：84-86. 10.16525/j.cnki.14-1362/n.2021.08.33
4	ZHOU H， ZHANG C， HU C. Visual relationship detection with relative location mining ［C］// Proceedings of the 27th ACM International Conference on Multimedia. New York： ACM， 2019： 30-38. 10.1145/3343031.3351024
5	LI Y， OUYANG W， WANG X， et al. ViP-CNN： visual phrase guided convolutional neural network ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 7244-7253. 10.1109/cvpr.2017.766
6	KRISHNA R， ZHU Y， GROTH O， et al. Visual genome： connecting language and vision using crowdsourced dense image annotations ［J］. International Journal of Computer Vision， 2017， 123： 32-73. 10.1007/s11263-016-0981-7
7	CHE W， FAN X， XIONG R， et al. Paragraph generation network with visual relationship detection ［C］// Proceedings of the 26th ACM International Conference on Multimedia. New York： ACM， 2018： 1435-1443. 10.1145/3240508.3240695
8	XU D， ZHU Y， CHOY C B， et al. Scene graph generation by iterative message passing ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 5410-5419. 10.1109/cvpr.2017.330
9	DONG X， ZHU L， ZHANG D， et al. Fast parameter adaptation for few-shot image captioning and visual question answering ［C］// Proceedings of the 26th ACM International Conference on Multimedia. New York： ACM， 2018： 54-62. 10.1145/3240508.3240527
10	GAO L， ZENG P， SONG J， et al. Examine before you answer： multi-task learning with adaptive-attentions for multiple-choice VQA ［C］// Proceedings of the 26th ACM International Conference on Multimedia. New York： ACM， 2018： 1742-1750. 10.1145/3240508.3240687
11	GALLEGUILLOS C， RABINOVICH A， BELONGIE S. Object categorization using co-occurrence， location and appearance ［C］// Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2008： 1-8. 10.1109/cvpr.2008.4587799
12	DESAI C， RAMANAN D. Detecting actions， poses， and objects with relational phraselets ［C］// Proceedings of the 12th European Conference on Computer Vision. Cham： Springer， 2012：158-172. 10.1007/978-3-642-33765-9_12
13	SADEGHI M A， FARHADI A. Recognition using visual phrases［C］// Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE，2012： 1745-1752. 10.1109/cvpr.2011.5995711
14	YIN G， SHENG L， LIU B， et al. Zoom-Net： mining deep feature interactions for visual relationship recognition ［C］// Proceedings of the 15th European Conference on Computer Vision. Berlin： Springer，2018： 330-347. 10.1007/978-3-030-01219-9_20
15	CUI Z， XU C， ZHENG W， et al. Context-dependent diffusion network for visual relationship detection ［C］// Proceedings of the 26th ACM International Conference on Multimedia. New York： ACM， 2018： 1475-1482. 10.1145/3240508.3240668
16	SHARIFZADEH S， BAHARLOU S M， BERRENDORF M， et al. Improving visual relation detection using depth maps［C］// Proceedings of the 25th International Conference on Pattern Recognition. Piscataway： IEEE， 2021： 3597-3604. 10.1109/icpr48806.2021.9412945
17	ZHANG H， KYAW Z， CHANG S-F， et al. Visual translation embedding network for visual relation detection ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 5532-5540. 10.1109/cvpr.2017.331
18	BORDES A， USUNIER N， GARCIA-DURÁN A， et al. Translating embeddings for modeling multi-relational data ［C］// Proceedings of the 26th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2013： 2787-2795.
19	WAN H， LUO Y， PENG B， et al. Representation learning for scene graph completion via jointly structural and visual embedding［C］// Proceedings of the 27th International Joint Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2018： 949-956. 10.24963/ijcai.2018/132
20	JI G， HE S， XU L， et al. Knowledge graph embedding via dynamic mapping matrix ［C］// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics. Stroudsburg： ACL， 2015： 687-696. 10.3115/v1/p15-1067
21	REN S， HE K， GIRSHICK R， et al. Faster R-CNN： towards real-time object detection with region proposal network ［C］// Proceedings of the 28th International Joint Conference on Artificial Intelligence. Cambridge： MIT Press， 2015： 91-99.
22	KAN X， CUI H， YANG C. Zero-shot scene graph relation prediction through commonsense knowledge integration ［C］// Proceedings of the 2021 Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Cham： Springer， 2021：466-482. 10.1007/978-3-030-86520-7_29
23	TANG K， ZHANG H， WU B， et al. Learning to compose dynamic tree structures for visual contexts ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 6619-6628. 10.1109/cvpr.2019.00678
24	TANG K， NIU Y， HUANG J， et al. Unbiased scene graph generation from biased training ［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 3716-3725. 10.1109/cvpr42600.2020.00377
25	ZELLERS R， YATSKAR M， THOMSON S， et al. Neural motifs： scene graph parsing with global context ［C］// Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 5831-5840. 10.1109/cvpr.2018.00611
26	LIN X， DING C， ZHANG J， et al. RU-Net： regularized unrolling network for scene graph generation ［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 19457-19466. 10.1109/cvpr52688.2022.01885
27	ZHENG C， LYU X， GAO L， et al. Prototype-based embedding network for scene graph generation ［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023： 22783-22792. 10.1109/cvpr52729.2023.02182
28	HU Y， CHEN S， CHEN X， et al. Neural message passing for visual relationship detection ［EB/OL］. ［2022-08-08］. .

模型	谓词分类召回率			短语检测召回率			关系检测召回率
模型	R@20	R@50	R@100	R@20	R@50	R@100	R@20	R@50	R@100
RLM	—	67.93	68.20	—	26.60	33.92	—	16.96	21.17
ViP	—	—	—	—	16.58	21.54	—	10.67	13.81
Motifs	58.46	65.18	67.01	35.63	38.92	39.77	25.48	32.78	37.16
VCTree	59.02	65.42	67.18	42.77	46.67	47.64	24.53	31.93	36.21
Transformer	59.06	65.55	67.29	36.87	40.18	41.02	25.55	33.04	37.40
Coacher	58.91	65.90	67.86	36.48	40.31	41.14	26.33	33.18	38.01
RU-Net	61.60	67.70	69.60	37.20	39.80	40.90	22.90	31.30	34.80
NMP	—	67.03	67.29	—	—	—	—	—	—
PE-Net	—	64.90	67.20	—	39.40	40.70	—	30.70	35.20
本文模型	59.73	66.74	68.34	37.39	41.20	41.84	26.15	33.20	38.10

模型	谓词分类召回率			短语检测召回率			关系检测召回率
模型	R@20	R@50	R@100	R@20	R@50	R@100	R@20	R@50	R@100
RLM	—	67.93	68.20	—	26.60	33.92	—	16.96	21.17
ViP	—	—	—	—	16.58	21.54	—	10.67	13.81
Motifs	58.46	65.18	67.01	35.63	38.92	39.77	25.48	32.78	37.16
VCTree	59.02	65.42	67.18	42.77	46.67	47.64	24.53	31.93	36.21
Transformer	59.06	65.55	67.29	36.87	40.18	41.02	25.55	33.04	37.40
Coacher	58.91	65.90	67.86	36.48	40.31	41.14	26.33	33.18	38.01
RU-Net	61.60	67.70	69.60	37.20	39.80	40.90	22.90	31.30	34.80
NMP	—	67.03	67.29	—	—	—	—	—	—
PE-Net	—	64.90	67.20	—	39.40	40.70	—	30.70	35.20
本文模型	59.73	66.74	68.34	37.39	41.20	41.84	26.15	33.20	38.10

模型	R@20	R@50	R@100
RLM	—	—	52.19
Motifs	47.70	51.84	52.28
VCTree	48.19	52.23	52.71
Transformer	42.30	46.74	47.76
Coacher	48.09	52.08	52.79
NMP	—	52.69	52.69
本文模型	48.31	52.40	53.10

模型	R@20	R@50	R@100
RLM	—	—	52.19
Motifs	47.70	51.84	52.28
VCTree	48.19	52.23	52.71
Transformer	42.30	46.74	47.76
Coacher	48.09	52.08	52.79
NMP	—	52.69	52.69
本文模型	48.31	52.40	53.10

模型	zR@20	zR@50	zR@100
Motifs	13.05	19.03	21.98
VCTree	10.35	13.63	15.64
Transformer	11.04	13.27	15.51
Coacher	13.42	19.31	22.22
本文模型	14.26	20.59	22.02