《计算机应用》唯一官方网站

• •    下一篇

基于语义关系图的跨模态张量融合图像文本检索

刘长红1,曾胜1,张斌1,陈勇2   

  1. 1. 江西师范大学
    2. 南昌工程学院
  • 收稿日期:2021-09-14 修回日期:2021-12-20 发布日期:2022-04-15 出版日期:2022-04-15
  • 通讯作者: 刘长红
  • 基金资助:
    面向虚拟教师的语音驱动教学动作序列的深度生成模型研究;教师课堂教学风格多模态信息的一致性表示及应用研究

Cross-Modal Tensor Fusion Network based on Semantic Relation Graph for Image-Text Retrieval

  • Received:2021-09-14 Revised:2021-12-20 Online:2022-04-15 Published:2022-04-15
  • Contact: LIU Changhong

摘要: 摘 要: 跨模态图像文本检索的难点是如何有效地学习图像和文本间的语义相关性。现有的大多数方法都是学习图像区域特征和文本特征的全局语义相关性或对象间的局部语义相关性,忽略了模态内对象之间的关系和模态间的对象关系的关联。针对该问题,提出了一种基于语义关系图的跨模态张量融合网络的图像文本检索方法,采用图卷积网络学习图像区域间的关系和双向GRU构建文本单词间的关系,将所学习到的图像区域和文本单词间的语义关系图通过张量融合网络进行匹配以学习两种不同模态数据间的细粒度关联关系,并采用GRU学习图像全局特征,将图像和文本的全局特征进行匹配以捕获模态间的全局语义相关性。文中所提出的方法在Flickr30K和MS-COCO两个基准数据集上与最近相关的算法进行了实验对比分析,实验结果表明本文所提出的方法能有效提升图像文本的检索精度。

关键词: 跨模态检索, 张量融合网络, 图卷积网络, 语义相关性, 语义关系图

Abstract: Abstract: The key of cross-modal image-text retrieval is how to capture the semantic correlation between images and texts effectively. Most of the existing methods learn the global semantic correlation between image region features and text features or local semantic correlations between inter-modality objects, ignoring the intra-modality relationship and inter-modality correlation between the relationships of image regions and sentence words. To solve this problem, This method proposes a cross-modal tensor fusion network based on semantic relation graph (CMTFN-SRG) for image-text retrieval. The intra-modality relationships of image regions and sentence words are generated by Graph Convolutional Networks (GCN) and Bidirectional Gated Recurrent Unit (Bi-GRU) separately, which are used to explore the fine-grained semantic correlation between the intra-modality relationships of image regions and sentence words by the tensor fusion network. Moreover, this method uses GRU to learn the global features of the image, and match the global features of the image and the text to capture the inter-modality global semantic correlation. The proposed method is compared with the current relevant algorithms on two bench-mark datasets, Flickr30K and MS-COCO. The experimental results show that the proposed method can effectively improve the retrieval accuracy in the image-text retrieval.

Key words: Cross-modal retrieval, Tensor fusion network, Graph Convolutional Networks, Semantic correlation, Semantic graph

中图分类号: