Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (3): 739-745.DOI: 10.11772/j.issn.1001-9081.2024050660

• Frontier research and typical applications of large models • Previous Articles     Next Articles

Visual question answering model based on association and fusion of multiple semantic features

Hao ZHOU1, Chao WANG1, Guoheng CUI1, Tingjin LUO2()   

  1. 1.Department of Operational Research and Planning,Naval University of Engineering,Wuhan Hubei 430033,China
    2.College of Science,National University of Defense Technology,Changsha Hunan 410073,China
  • Received:2024-05-22 Revised:2024-06-23 Accepted:2024-06-28 Online:2024-07-25 Published:2025-03-10
  • Contact: Tingjin LUO
  • About author:ZHOU Hao, born in 1993, Ph. D., lecturer. His research interests include image understanding, scene graph generation, imbalance learning.
    WANG Chao, born in 1995, M. S., teaching assistant. His research interests include text detection and recognition, machine learning.
    CUI Guoheng, born in 1981, Ph. D., associate professor. His research interests include object detection, intelligent analysis.
  • Supported by:
    National Natural Science Foundation of China(62302516);Natural Science Foundation of Hubei Province(2022CFC049);Huxiang Young Talents Program of Hunan Province(2021RC3070)

基于多语义关联与融合的视觉问答模型

周浩1, 王超1, 崔国恒1, 罗廷金2()   

  1. 1.海军工程大学 作战运筹与规划系,武汉 430033
    2.国防科技大学 理学院,长沙 410073
  • 通讯作者: 罗廷金
  • 作者简介:周浩(1993—),男,湖南长沙人,讲师,博士,CCF会员,主要研究方向:图像理解、场景图生成、不平衡学习
    王超(1995—),男,重庆万州人,助教,硕士,主要研究方向:文本检测与识别、机器学习
    崔国恒(1981—),男,湖北武汉人,副教授,博士,主要研究方向:目标识别、智能分析
  • 基金资助:
    国家自然科学基金资助项目(62302516);湖北省自然科学基金资助项目(2022CFC049);湖南省湖湘青年人才项目(2021RC3070)

Abstract:

Bridging the semantic gaps among visual images and text-based questions is the key to improve the reasoning accuracy of Visual Question Answering (VQA) models. However, most the existing related models rely on extracting low-level image features and using attention mechanisms to reason and obtain answers of questions, while ignoring the important role of high-level image semantic features in visual reasoning, such as relationship features and attribute features. In order to solve the above problems, a VQA model based on multi-semantic association and fusion was proposed to establish semantic association among questions and images. Firstly, based on scene graph generation framework, multiple semantic features in images were extracted and refined as the feature input of VQA model to fully explore the information in visual scenes. Secondly, to enhance the semantic value of image features, an information filter was designed to remove noise and redundant information in the image features. Finally, a multi-layer attention fusion and reasoning module was designed to fuse multiple image semantics with question features, respectively, and strengthen the semantic association among the important regions of images and the questions. Experimental results show that compared with Bilinear Attention Network (BAN) and Coarse-to-Fine Reasoning (CFR) models, the proposed model has the accuracy on VQA2.0 test set increased by 2.9 and 0.4 percentage points respectively, and the accuracy on GQA test set increased by 17.2 and 0.3 percentage points respectively, demonstrating that the proposed model can better understand the semantics in image scenes and answer compositional visual questions.

Key words: fusion of multiple semantic features, Visual Question Answering (VQA), scene graph, attribute attention, relationship attention

摘要:

弥合视觉图像和文本问题之间的语义差异是提高视觉问答(VQA)模型推理准确性的重要方法之一。然而现有的相关模型大多数基于低层图像特征的提取并利用注意力机制推理问题的答案,忽略了高层图像语义特征如关系和属性特征等在视觉推理中的作用。为解决上述问题,提出一种基于多语义关联与融合的VQA模型以建立问题与图像之间的语义联系。首先,基于场景图生成框架提取图像中的多种语义并把它们进行特征精炼后作为VQA模型的特征输入,从而充分挖掘图像场景中的信息;其次,为提高图像特征的语义价值,设计一个信息过滤器过滤图像特征中的噪声和冗余信息;最后,设计多层注意力融合和推理模块将多种图像语义分别与问题特征进行语义融合,以强化视觉图像重点区域与文本问题之间的语义关联。与BAN(Bilinear Attention Network)和CFR(Coarse-to-Fine Reasoning)模型的对比实验结果表明,所提模型在VQA2.0测试集上的准确率分别提高了2.9和0.4个百分点,在GQA测试集上的准确率分别提高了17.2和0.3个百分点。这表明所提模型能够更好地理解图像场景中的语义并回答组合式视觉问题。

关键词: 多语义特征融合, 视觉问答, 场景图, 属性注意力, 关系注意力

CLC Number: