基于多语义关联与融合的视觉问答模型

doi:10.11772/j.issn.1001-9081.2024050660

《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (3): 739-745.DOI: 10.11772/j.issn.1001-9081.2024050660

• 大模型前沿研究与典型应用 • 上一篇下一篇

基于多语义关联与融合的视觉问答模型

周浩¹, 王超¹, 崔国恒¹, 罗廷金²()

^1.海军工程大学作战运筹与规划系，武汉 430033
^2.国防科技大学理学院，长沙 410073

收稿日期:2024-05-22 修回日期:2024-06-23 接受日期:2024-06-28 发布日期:2024-07-25 出版日期:2025-03-10
通讯作者: 罗廷金
作者简介:周浩（1993—），男，湖南长沙人，讲师，博士，CCF会员，主要研究方向：图像理解、场景图生成、不平衡学习
王超（1995—），男，重庆万州人，助教，硕士，主要研究方向：文本检测与识别、机器学习
崔国恒（1981—），男，湖北武汉人，副教授，博士，主要研究方向：目标识别、智能分析
基金资助:
国家自然科学基金资助项目(62302516);湖北省自然科学基金资助项目(2022CFC049);湖南省湖湘青年人才项目(2021RC3070)

Visual question answering model based on association and fusion of multiple semantic features

Hao ZHOU¹, Chao WANG¹, Guoheng CUI¹, Tingjin LUO²()

^1.Department of Operational Research and Planning，Naval University of Engineering，Wuhan Hubei 430033，China
^2.College of Science，National University of Defense Technology，Changsha Hunan 410073，China

Received:2024-05-22 Revised:2024-06-23 Accepted:2024-06-28 Online:2024-07-25 Published:2025-03-10
Contact: Tingjin LUO
About author:ZHOU Hao， born in 1993， Ph. D.， lecturer. His research interests include image understanding， scene graph generation， imbalance learning.
WANG Chao， born in 1995， M. S.， teaching assistant. His research interests include text detection and recognition， machine learning.
CUI Guoheng， born in 1981， Ph. D.， associate professor. His research interests include object detection， intelligent analysis.
Supported by:
National Natural Science Foundation of China(62302516);Natural Science Foundation of Hubei Province(2022CFC049);Huxiang Young Talents Program of Hunan Province(2021RC3070)

摘要/Abstract

摘要：

弥合视觉图像和文本问题之间的语义差异是提高视觉问答（VQA）模型推理准确性的重要方法之一。然而现有的相关模型大多数基于低层图像特征的提取并利用注意力机制推理问题的答案，忽略了高层图像语义特征如关系和属性特征等在视觉推理中的作用。为解决上述问题，提出一种基于多语义关联与融合的VQA模型以建立问题与图像之间的语义联系。首先，基于场景图生成框架提取图像中的多种语义并把它们进行特征精炼后作为VQA模型的特征输入，从而充分挖掘图像场景中的信息；其次，为提高图像特征的语义价值，设计一个信息过滤器过滤图像特征中的噪声和冗余信息；最后，设计多层注意力融合和推理模块将多种图像语义分别与问题特征进行语义融合，以强化视觉图像重点区域与文本问题之间的语义关联。与BAN（Bilinear Attention Network）和CFR（Coarse-to-Fine Reasoning）模型的对比实验结果表明，所提模型在VQA2.0测试集上的准确率分别提高了2.9和0.4个百分点，在GQA测试集上的准确率分别提高了17.2和0.3个百分点。这表明所提模型能够更好地理解图像场景中的语义并回答组合式视觉问题。

关键词: 多语义特征融合, 视觉问答, 场景图, 属性注意力, 关系注意力

Abstract:

Bridging the semantic gaps among visual images and text-based questions is the key to improve the reasoning accuracy of Visual Question Answering （VQA） models. However， most the existing related models rely on extracting low-level image features and using attention mechanisms to reason and obtain answers of questions， while ignoring the important role of high-level image semantic features in visual reasoning， such as relationship features and attribute features. In order to solve the above problems， a VQA model based on multi-semantic association and fusion was proposed to establish semantic association among questions and images. Firstly， based on scene graph generation framework， multiple semantic features in images were extracted and refined as the feature input of VQA model to fully explore the information in visual scenes. Secondly， to enhance the semantic value of image features， an information filter was designed to remove noise and redundant information in the image features. Finally， a multi-layer attention fusion and reasoning module was designed to fuse multiple image semantics with question features， respectively， and strengthen the semantic association among the important regions of images and the questions. Experimental results show that compared with Bilinear Attention Network （BAN） and Coarse-to-Fine Reasoning （CFR） models， the proposed model has the accuracy on VQA2.0 test set increased by 2.9 and 0.4 percentage points respectively， and the accuracy on GQA test set increased by 17.2 and 0.3 percentage points respectively， demonstrating that the proposed model can better understand the semantics in image scenes and answer compositional visual questions.

Key words: fusion of multiple semantic features, Visual Question Answering (VQA), scene graph, attribute attention, relationship attention

中图分类号:

TP391

周浩, 王超, 崔国恒, 罗廷金. 基于多语义关联与融合的视觉问答模型[J]. 计算机应用, 2025, 45(3): 739-745.

Hao ZHOU, Chao WANG, Guoheng CUI, Tingjin LUO. Visual question answering model based on association and fusion of multiple semantic features[J]. Journal of Computer Applications, 2025, 45(3): 739-745.

图/表 9

图1 VQA中关系和属性相关问题的示例

Fig. 1 Example of relationship- and attribute-related questions in VQA

图2 基于多语义关联和融合的视觉问答模型框架

Fig. 2 Framework of visual question answering model based on multi-semantic association and fusion

表1 不同模型在VQA2.0验证集和测试集上的准确率 (%)

Tab. 1 Accuracies of different models on VQA2.0 validation and test sets

模型	验证集	测试集	模型	验证集	测试集
BAN	66.0	70.0	ViLT	—	70.9
DFAF	66.2	70.2	LXMERT	—	72.4
fPMC	61.7	63.9	CFR	69.7	72.5
CTI	66.0	70.1	本文模型	70.6	72.9
MCAN	67.2	70.6

表2 不同模型在GQA验证集和测试集上的准确率 (%)

Tab. 2 Accuracies of different models on GQA validation and test sets

模型	验证集	测试集	模型	验证集	测试集
BAN	61.5	55.2	LXMERT	59.8	60.0
CTI	61.7	54.9	Oscar	—	61.6
MCAN	—	57.4	CFR	73.6	72.1
MMN	—	60.4	本文模型	74.2	72.4

表3 在VQA2.0验证集上消融实验的准确率 (%)

Tab. 3 Accuracies of ablation experiments on VQA2.0 validation set

特征嵌入模块	信息过滤模块	多注意力特征融合模块	准确率
×	×	×	59.2
√	×	×	63.8
√	×	√	65.2
√	√	×	68.7
√	√	√	69.9

表4 不同输入特征对视觉问答模型准确率的影响 (%)

Tab. 4 Influence of different input features on visual question answering model accuracy

方法	验证集	测试集
引入空间特征	66.2	64.5
引入对象特征	71.4	68.9
引入对象和属性特征	72.9	70.8
引入对象和关系特征	73.4	71.6
引入多语义特征（本文模型）	74.2	72.4
真实标记的对象和属性	87.0	—

图3 属性相关问题的可视化结果

Fig. 3 Visualization results of attribute-related questions

图4 关系相关问题的可视化结果示例

Fig. 4 Visualization results with relationship-related questions

图5 组合式问题的可视化结果

Fig. 5 Visualization results of compositional questions

参考文献 36

1	ANTOL S， AGRAWAL A， LU J， et al. VQA： visual question answering ［C］// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2015： 2425-2433.
2	YU J， ZHANG W， LU Y， et al. Reasoning on the relation： enhancing visual representation for visual question answering and cross-modal retrieval ［J］. IEEE Transactions on Multimedia， 2020， 22（12）： 3196-3209.
3	LU S， LIU M， YIN L， et al. The multi-modal fusion in visual question answering： a review of attention mechanisms ［J］. PeerJ Computer Science， 2023， 9： No.e1400.
4	李祥，范志广，李学相，等. 基于深度学习的视觉问答研究综述［J］. 计算机科学， 2023， 50（5）：177-188.
	LI X， FAN Z G， LI X X， et al. Survey of visual question answering based on deep learning［J］. Computer Science， 2023， 50（5）：177-188.
5	MALINOWSKI M， ROHRBACH M， FRITZ M. Ask your neurons： a neural-based approach to answering questions about images ［C］// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2015： 1-9.
6	KIM J H， LEE S W， KWAK D， et al. Multimodal residual learning for visual QA ［C］// Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2016： 361-369.
7	KIM J H， JUN J， ZHANG B T. Bilinear attention networks ［C］// Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2018： 1571-1581.
8	TENEY D， LIU L， VAN DEN HENGEL A. Graph-structured representations for visual question answering ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 3233-3241.
9	YANG Z， HE X， GAO J， et al. Stacked attention networks for image question answering ［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 21-29.
10	RAHMAN T， CHOU S H， SIGAL L， et al. An improved attention for visual question answering ［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 1653-1662.
11	ZENG Y， ZHANG X， LI H. Multi-grained vision language pre-training： aligning texts with visual concepts ［C］// Proceedings of the 39th International Conference on Machine Learning. New York： JMLR.org， 2022： 25994-26009.
12	ANDERSON P， HE X， BUEHLER C， et al. Bottom-up and top-down attention for image captioning and visual question answering［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6077-6086.
13	JIANG H， MISRA I， ROHRBACH M， et al. In defense of grid features for visual question answering ［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 10264-10273.
14	NGUYEN B X， DO T， TRAN H， et al. Coarse-to-fine reasoning for visual question answering ［C］// Proceedings of the 2022 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Piscataway： IEEE， 2022： 4558-4566.
15	ZHOU H， ZHANG J， LUO T， et al. Debiased scene graph generation for dual imbalance learning ［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2023， 45（4）： 4274-4288.
16	ZHOU H， YANG Y， LUO T， et al. A unified deep sparse graph attention network for scene graph generation ［J］. Pattern Recognition， 2022， 123： No.108367.
17	PENNINGTON J， SOCHER R， MANNING C D. GloVe： global vectors for word representation ［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2014： 1532-1543.
18	DEVLIN J， CHANG M W， LEE K， et al. BERT： pre-training of deep bidirectional Transformers for language understanding ［C］// Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsburg： ACL， 2019： 4171-4186.
19	CHEN Y C， LI L， YU L， et al. Uniter： universal image-text representation learning ［C］// Proceedings of the 2020 European Conference on Computer Vision. Cham： Springer， 2020： 104-120.
20	YANG Z， QIN Z， YU J， et al. Scene graph reasoning with prior visual relationship for visual question answering ［C］// Proceedings of the 2020 IEEE International Conference on Image Processing. Piscataway： IEEE， 2020： 1411-1415.
21	YU Z， YU J， CUI Y， et al. Deep modular co-attention networks for visual question answering ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 6274-6283.
22	XIONG P， SHEN Y， JIN H. MGA-VQA： multi-granularity alignment for visual question answering ［EB/OL］. ［2024-02-12］..
23	JING C， JIA Y， WU Y， et al. Maintaining reasoning consistency in compositional visual question answering ［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 5089-5098.
24	ANDREAS J， ROHRBACH M， DARRELL T， et al. Neural module networks ［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 39-48.
25	CHEN W， GAN Z， LI L， et al. Meta module network for compositional visual reasoning ［C］// Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision. Piscataway： IEEE， 2021： 655-664.
26	HU R， ROHRBACH A， DARRELL T， et al. Language-conditioned graph networks for relational reasoning ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 10293-10302.
27	JING C， JIA Y， WU Y， et al. Learning the dynamics of visual relational reasoning via reinforced path routing ［C］// Proceedings of the 36th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2022： 1122-1130.
28	周浩. 图像语义场景图生成方法研究［D］. 长沙：国防科技大学， 2021： 133-136.
	ZHOU H. Scene graph generation for image semantic understanding and representation ［D］. Changsha： National University of Defense Technology， 2021： 133-136.
29	TANG K， ZHANG H， WU B， et al. Learning to compose dynamic tree structures for visual contexts ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 6612-6621.
30	HUDSON D A， MANNING C D. GQA： a new dataset for real-world visual reasoning and compositional question answering ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 6693-6702.
31	GAO P， JIANG Z， YOU H， et al. Dynamic fusion with intra-and inter-modality attention flow for visual question answering ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 6632-6641.
32	SHA F， CHAO W L， HU H. Learning answer embeddings for visual question answering ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 5428-5436.
33	DO T， TRAN H， DO T T， et al. Compact trilinear interaction for visual question answering ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 392-401.
34	TAN H， BANSAL M. LXMERT： learning cross-modality encoder representations from Transformers ［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg： ACL， 2019： 5100-5111.
35	KIM W， SON B， KIM I. ViLT： vision-and-language Transformer without convolution or region supervision ［C］// Proceedings of the 38th International Conference on Machine Learning. New York： JMLR.org， 2021： 5583-5594.
36	LI X， YIN X， LI C， et al. OSCAR： object-semantics aligned pre-training for vision-language tasks ［C］// Proceedings of the 2020 European Conference on Computer Vision， LNCS 12375. Cham： Springer， 2020： 121-137.

[1]	李林昊, 王逸泽, 李英双, 董永峰, 王振. 基于关系特征强化的全景场景图生成方法[J]. 《计算机应用》唯一官方网站, 2025, 45(2): 584-593.
[2]	蔡美玉, 朱润哲, 吴飞, 张开昱, 李家乐. 基于注意力机制和多粒度特征融合的跨视角匹配模型[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 901-908.
[3]	王朱佳, 余宙, 俞俊, 范建平. 基于多尺度时空Transformer的视频动态场景图生成模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 47-57.
[4]	朱志平, 杨燕, 王杰. 基于场景图感知的跨模态图像描述模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 58-64.
[5]	王茂, 彭亚雄, 陆安江. 面向视觉问答的跨模态交叉融合注意网络[J]. 《计算机应用》唯一官方网站, 2022, 42(3): 854-859.
[6]	张驰, 李铸洪, 刘舟, 沈未名. 基于场景图划分的无人机影像定位算法[J]. 计算机应用, 2021, 41(10): 3004-3009.
[7]	张文英, 何坤金, 张荣丽, 刘宇兴. 基于开源场景图形的三维可视化与信息管理系统设计[J]. 计算机应用, 2016, 36(7): 2056-2060.

基于多语义关联与融合的视觉问答模型

Visual question answering model based on association and fusion of multiple semantic features

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 36

相关文章 7

编辑推荐

Metrics