Visual question answering model based on association and fusion of multiple semantic features

doi:10.11772/j.issn.1001-9081.2024050660

Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (3): 739-745.DOI: 10.11772/j.issn.1001-9081.2024050660

• Frontier research and typical applications of large models • Previous Articles Next Articles

Visual question answering model based on association and fusion of multiple semantic features

Hao ZHOU¹, Chao WANG¹, Guoheng CUI¹, Tingjin LUO²()

^1.Department of Operational Research and Planning，Naval University of Engineering，Wuhan Hubei 430033，China
^2.College of Science，National University of Defense Technology，Changsha Hunan 410073，China

Received:2024-05-22 Revised:2024-06-23 Accepted:2024-06-28 Online:2024-07-25 Published:2025-03-10
Contact: Tingjin LUO
About author:ZHOU Hao， born in 1993， Ph. D.， lecturer. His research interests include image understanding， scene graph generation， imbalance learning.
WANG Chao， born in 1995， M. S.， teaching assistant. His research interests include text detection and recognition， machine learning.
CUI Guoheng， born in 1981， Ph. D.， associate professor. His research interests include object detection， intelligent analysis.
Supported by:
National Natural Science Foundation of China(62302516);Natural Science Foundation of Hubei Province(2022CFC049);Huxiang Young Talents Program of Hunan Province(2021RC3070)

基于多语义关联与融合的视觉问答模型

周浩¹, 王超¹, 崔国恒¹, 罗廷金²()

^1.海军工程大学作战运筹与规划系，武汉 430033
^2.国防科技大学理学院，长沙 410073

通讯作者: 罗廷金
作者简介:周浩（1993—），男，湖南长沙人，讲师，博士，CCF会员，主要研究方向：图像理解、场景图生成、不平衡学习
王超（1995—），男，重庆万州人，助教，硕士，主要研究方向：文本检测与识别、机器学习
崔国恒（1981—），男，湖北武汉人，副教授，博士，主要研究方向：目标识别、智能分析
基金资助:
国家自然科学基金资助项目(62302516);湖北省自然科学基金资助项目(2022CFC049);湖南省湖湘青年人才项目(2021RC3070)

Abstract

Abstract:

Bridging the semantic gaps among visual images and text-based questions is the key to improve the reasoning accuracy of Visual Question Answering （VQA） models. However， most the existing related models rely on extracting low-level image features and using attention mechanisms to reason and obtain answers of questions， while ignoring the important role of high-level image semantic features in visual reasoning， such as relationship features and attribute features. In order to solve the above problems， a VQA model based on multi-semantic association and fusion was proposed to establish semantic association among questions and images. Firstly， based on scene graph generation framework， multiple semantic features in images were extracted and refined as the feature input of VQA model to fully explore the information in visual scenes. Secondly， to enhance the semantic value of image features， an information filter was designed to remove noise and redundant information in the image features. Finally， a multi-layer attention fusion and reasoning module was designed to fuse multiple image semantics with question features， respectively， and strengthen the semantic association among the important regions of images and the questions. Experimental results show that compared with Bilinear Attention Network （BAN） and Coarse-to-Fine Reasoning （CFR） models， the proposed model has the accuracy on VQA2.0 test set increased by 2.9 and 0.4 percentage points respectively， and the accuracy on GQA test set increased by 17.2 and 0.3 percentage points respectively， demonstrating that the proposed model can better understand the semantics in image scenes and answer compositional visual questions.

Key words: fusion of multiple semantic features, Visual Question Answering (VQA), scene graph, attribute attention, relationship attention

摘要：

弥合视觉图像和文本问题之间的语义差异是提高视觉问答（VQA）模型推理准确性的重要方法之一。然而现有的相关模型大多数基于低层图像特征的提取并利用注意力机制推理问题的答案，忽略了高层图像语义特征如关系和属性特征等在视觉推理中的作用。为解决上述问题，提出一种基于多语义关联与融合的VQA模型以建立问题与图像之间的语义联系。首先，基于场景图生成框架提取图像中的多种语义并把它们进行特征精炼后作为VQA模型的特征输入，从而充分挖掘图像场景中的信息；其次，为提高图像特征的语义价值，设计一个信息过滤器过滤图像特征中的噪声和冗余信息；最后，设计多层注意力融合和推理模块将多种图像语义分别与问题特征进行语义融合，以强化视觉图像重点区域与文本问题之间的语义关联。与BAN（Bilinear Attention Network）和CFR（Coarse-to-Fine Reasoning）模型的对比实验结果表明，所提模型在VQA2.0测试集上的准确率分别提高了2.9和0.4个百分点，在GQA测试集上的准确率分别提高了17.2和0.3个百分点。这表明所提模型能够更好地理解图像场景中的语义并回答组合式视觉问题。

关键词: 多语义特征融合, 视觉问答, 场景图, 属性注意力, 关系注意力

CLC Number:

TP391

Hao ZHOU, Chao WANG, Guoheng CUI, Tingjin LUO. Visual question answering model based on association and fusion of multiple semantic features[J]. Journal of Computer Applications, 2025, 45(3): 739-745.

周浩, 王超, 崔国恒, 罗廷金. 基于多语义关联与融合的视觉问答模型[J]. 《计算机应用》唯一官方网站, 2025, 45(3): 739-745.

Figures/Tables 9

References 36

1	ANTOL S， AGRAWAL A， LU J， et al. VQA： visual question answering ［C］// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2015： 2425-2433.
2	YU J， ZHANG W， LU Y， et al. Reasoning on the relation： enhancing visual representation for visual question answering and cross-modal retrieval ［J］. IEEE Transactions on Multimedia， 2020， 22（12）： 3196-3209.
3	LU S， LIU M， YIN L， et al. The multi-modal fusion in visual question answering： a review of attention mechanisms ［J］. PeerJ Computer Science， 2023， 9： No.e1400.
4	李祥，范志广，李学相，等. 基于深度学习的视觉问答研究综述［J］. 计算机科学， 2023， 50（5）：177-188.
	LI X， FAN Z G， LI X X， et al. Survey of visual question answering based on deep learning［J］. Computer Science， 2023， 50（5）：177-188.
5	MALINOWSKI M， ROHRBACH M， FRITZ M. Ask your neurons： a neural-based approach to answering questions about images ［C］// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2015： 1-9.
6	KIM J H， LEE S W， KWAK D， et al. Multimodal residual learning for visual QA ［C］// Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2016： 361-369.
7	KIM J H， JUN J， ZHANG B T. Bilinear attention networks ［C］// Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2018： 1571-1581.
8	TENEY D， LIU L， VAN DEN HENGEL A. Graph-structured representations for visual question answering ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 3233-3241.
9	YANG Z， HE X， GAO J， et al. Stacked attention networks for image question answering ［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 21-29.
10	RAHMAN T， CHOU S H， SIGAL L， et al. An improved attention for visual question answering ［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 1653-1662.
11	ZENG Y， ZHANG X， LI H. Multi-grained vision language pre-training： aligning texts with visual concepts ［C］// Proceedings of the 39th International Conference on Machine Learning. New York： JMLR.org， 2022： 25994-26009.
12	ANDERSON P， HE X， BUEHLER C， et al. Bottom-up and top-down attention for image captioning and visual question answering［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6077-6086.
13	JIANG H， MISRA I， ROHRBACH M， et al. In defense of grid features for visual question answering ［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 10264-10273.
14	NGUYEN B X， DO T， TRAN H， et al. Coarse-to-fine reasoning for visual question answering ［C］// Proceedings of the 2022 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Piscataway： IEEE， 2022： 4558-4566.
15	ZHOU H， ZHANG J， LUO T， et al. Debiased scene graph generation for dual imbalance learning ［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2023， 45（4）： 4274-4288.
16	ZHOU H， YANG Y， LUO T， et al. A unified deep sparse graph attention network for scene graph generation ［J］. Pattern Recognition， 2022， 123： No.108367.
17	PENNINGTON J， SOCHER R， MANNING C D. GloVe： global vectors for word representation ［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2014： 1532-1543.
18	DEVLIN J， CHANG M W， LEE K， et al. BERT： pre-training of deep bidirectional Transformers for language understanding ［C］// Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsburg： ACL， 2019： 4171-4186.
19	CHEN Y C， LI L， YU L， et al. Uniter： universal image-text representation learning ［C］// Proceedings of the 2020 European Conference on Computer Vision. Cham： Springer， 2020： 104-120.
20	YANG Z， QIN Z， YU J， et al. Scene graph reasoning with prior visual relationship for visual question answering ［C］// Proceedings of the 2020 IEEE International Conference on Image Processing. Piscataway： IEEE， 2020： 1411-1415.
21	YU Z， YU J， CUI Y， et al. Deep modular co-attention networks for visual question answering ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 6274-6283.
22	XIONG P， SHEN Y， JIN H. MGA-VQA： multi-granularity alignment for visual question answering ［EB/OL］. ［2024-02-12］..
23	JING C， JIA Y， WU Y， et al. Maintaining reasoning consistency in compositional visual question answering ［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 5089-5098.
24	ANDREAS J， ROHRBACH M， DARRELL T， et al. Neural module networks ［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 39-48.
25	CHEN W， GAN Z， LI L， et al. Meta module network for compositional visual reasoning ［C］// Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision. Piscataway： IEEE， 2021： 655-664.
26	HU R， ROHRBACH A， DARRELL T， et al. Language-conditioned graph networks for relational reasoning ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 10293-10302.
27	JING C， JIA Y， WU Y， et al. Learning the dynamics of visual relational reasoning via reinforced path routing ［C］// Proceedings of the 36th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2022： 1122-1130.
28	周浩. 图像语义场景图生成方法研究［D］. 长沙：国防科技大学， 2021： 133-136.
	ZHOU H. Scene graph generation for image semantic understanding and representation ［D］. Changsha： National University of Defense Technology， 2021： 133-136.
29	TANG K， ZHANG H， WU B， et al. Learning to compose dynamic tree structures for visual contexts ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 6612-6621.
30	HUDSON D A， MANNING C D. GQA： a new dataset for real-world visual reasoning and compositional question answering ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 6693-6702.
31	GAO P， JIANG Z， YOU H， et al. Dynamic fusion with intra-and inter-modality attention flow for visual question answering ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 6632-6641.
32	SHA F， CHAO W L， HU H. Learning answer embeddings for visual question answering ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 5428-5436.
33	DO T， TRAN H， DO T T， et al. Compact trilinear interaction for visual question answering ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 392-401.
34	TAN H， BANSAL M. LXMERT： learning cross-modality encoder representations from Transformers ［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg： ACL， 2019： 5100-5111.
35	KIM W， SON B， KIM I. ViLT： vision-and-language Transformer without convolution or region supervision ［C］// Proceedings of the 38th International Conference on Machine Learning. New York： JMLR.org， 2021： 5583-5594.
36	LI X， YIN X， LI C， et al. OSCAR： object-semantics aligned pre-training for vision-language tasks ［C］// Proceedings of the 2020 European Conference on Computer Vision， LNCS 12375. Cham： Springer， 2020： 121-137.

模型	验证集	测试集	模型	验证集	测试集
BAN	66.0	70.0	ViLT	—	70.9
DFAF	66.2	70.2	LXMERT	—	72.4
fPMC	61.7	63.9	CFR	69.7	72.5
CTI	66.0	70.1	本文模型	70.6	72.9
MCAN	67.2	70.6

模型	验证集	测试集	模型	验证集	测试集
BAN	66.0	70.0	ViLT	—	70.9
DFAF	66.2	70.2	LXMERT	—	72.4
fPMC	61.7	63.9	CFR	69.7	72.5
CTI	66.0	70.1	本文模型	70.6	72.9
MCAN	67.2	70.6

模型	验证集	测试集	模型	验证集	测试集
BAN	61.5	55.2	LXMERT	59.8	60.0
CTI	61.7	54.9	Oscar	—	61.6
MCAN	—	57.4	CFR	73.6	72.1
MMN	—	60.4	本文模型	74.2	72.4

模型	验证集	测试集	模型	验证集	测试集
BAN	61.5	55.2	LXMERT	59.8	60.0
CTI	61.7	54.9	Oscar	—	61.6
MCAN	—	57.4	CFR	73.6	72.1
MMN	—	60.4	本文模型	74.2	72.4

特征嵌入模块	信息过滤模块	多注意力特征融合模块	准确率
×	×	×	59.2
√	×	×	63.8
√	×	√	65.2
√	√	×	68.7
√	√	√	69.9

Visual question answering model based on association and fusion of multiple semantic features

基于多语义关联与融合的视觉问答模型

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 9

References 36

Related Articles 6

Recommended Articles

Metrics

方法	验证集	测试集
引入空间特征	66.2	64.5
引入对象特征	71.4	68.9
引入对象和属性特征	72.9	70.8
引入对象和关系特征	73.4	71.6
引入多语义特征（本文模型）	74.2	72.4
真实标记的对象和属性	87.0	—

[1]	Linhao LI, Yize WANG, Yingshuang LI, Yongfeng DONG, Zhen WANG. Panoptic scene graph generation method based on relation feature enhancement [J]. Journal of Computer Applications, 2025, 45(2): 584-593.
[2]	Meiyu CAI, Runzhe ZHU, Fei WU, Kaiyu ZHANG, Jiale LI. Cross-view matching model based on attention mechanism and multi-granularity feature fusion [J]. Journal of Computer Applications, 2024, 44(3): 901-908.
[3]	Jia WANG-ZHU, Zhou YU, Jun YU, Jianping FAN. Video dynamic scene graph generation model based on multi-scale spatial-temporal Transformer [J]. Journal of Computer Applications, 2024, 44(1): 47-57.
[4]	Zhiping ZHU, Yan YANG, Jie WANG. Scene graph-aware cross-modal image captioning model [J]. Journal of Computer Applications, 2024, 44(1): 58-64.
[5]	ZHANG Chi, LI Zhuhong, LIU Zhou, SHEN Weiming. Unmanned aerial vehicle image positioning algorithm based on scene graph division [J]. Journal of Computer Applications, 2021, 41(10): 3004-3009.
[6]	ZHANG Wenying, HE Kunjin, ZHANG Rongli, LIU Yuxing. 3-D visualization and information management system design based on open scene graph [J]. Journal of Computer Applications, 2016, 36(7): 2056-2060.