Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (10): 3018-3024.DOI: 10.11772/j.issn.1001-9081.2021091622

• Artificial intelligence • Previous Articles    

Cross-modal tensor fusion network based on semantic relation graph for image-text retrieval

Changhong LIU1, Sheng ZENG1, Bin ZHANG1, Yong CHEN2   

  1. 1.School of Computer and Information Engineering,Jiangxi Normal University,Nanchang Jiangxi 330022,China
    2.School of Business Administration,Nanchang Institute of Technology,Nanchang Jiangxi 330029,China
  • Received:2021-09-14 Revised:2021-12-20 Accepted:2021-12-30 Online:2022-10-14 Published:2022-10-10
  • Contact: Changhong LIU
  • About author:LIU Changhong, born in 1977, Ph. D. , associate professor. Her research interests include computer vision, cross-modal information retrieval, hyper-spectral image processing.
    ZENG Sheng, born in 1996, M. S. candidate. His research interests include cross-modal information retrieval, computer vision.
    ZHANG Bin, born in 1997, M. S. candidate. His research interests include cross-modal generation, computer vision.
    CHEN Yong,born in 1973, M. S. , lecturer. His research interests include e-commerce, image processing.
  • Supported by:
    National Natural Science Foundation of China(62067004)

基于语义关系图的跨模态张量融合网络的图像文本检索

刘长红1, 曾胜1, 张斌1, 陈勇2   

  1. 1.江西师范大学 计算机信息工程学院,南昌 330022
    2.南昌工程学院 工商管理学院,南昌 330029
  • 通讯作者: 刘长红
  • 作者简介:第一联系人:刘长红(1977—),女,江西南丰人,副教授,博士,CCF会员,主要研究方向:计算机视觉、跨模态信息检索、高光谱图像处理; liuch@jxnu.edu.cn
    曾胜(1996—),男,江西九江人,硕士研究生,主要研究方向:跨模态信息检索、计算机视觉
    张斌(1997—),男,江西南昌人,硕士研究生,主要研究方向:跨模态生成、计算机视觉
    陈勇(1973—),男,江西南昌人,讲师,硕士,主要研究方向:电子商务、图像处理。
  • 基金资助:
    国家自然科学基金资助项目(62067004)

Abstract:

The key of cross-modal image-text retrieval is how to capture the semantic correlation between images and text effectively. Most of the existing methods learn the global semantic correlation between image region features and text features or local semantic correlation between inter-modality objects, and ignore the correlation between the intra-modality object relationships and inter-modality object relationships. To solve this problem, a method of Cross-Modal Tensor Fusion Network based on Semantic Relation Graph (CMTFN-SRG) for image-text retrieval was proposed. Firstly, the relationships of image regions and text words were generated by Graph Convolutional Network (GCN) and Bidirectional Gated Recurrent Unit (Bi-GRU) respectively. Then, the fine-grained semantic correlation between the data of two modals was learned by using the tensor fusion network to match the learned semantic relation graph of image regions and the graph of text words. At the same time, Gated Recurrent Unit (GRU) was used to learn global features of the image, and the global features of the image and the text were matched to capture the inter-modality global semantic correlation. The proposed method was compared with the Multi-Modality Cross Attention (MMCA) method on the benchmark datasets Flickr30K and MS-COCO. Experimental results show that the proposed method improves the Recall@1 of text-to-image retrieval task by 2.6%, 9.0% and 4.1% respectively on the test datasets Flickr30K, MS-COCO1K and MS-COCO5K.And mean Recall (mR) improves by 0.4, 1.3 and 0.1 percentage points respectively. It can be seen that the proposed method can effectively improve the precision of image-text retrieval.

Key words: cross-modal retrieval, tensor fusion network, Graph Convolutional Network (GCN), semantic correlation, semantic relation graph

摘要:

跨模态图像文本检索的难点是如何有效地学习图像和文本间的语义相关性。现有的大多数方法都是学习图像区域特征和文本特征的全局语义相关性或模态间对象间的局部语义相关性,而忽略了模态内对象之间的关系和模态间对象关系的关联。针对上述问题,提出了一种基于语义关系图的跨模态张量融合网络(CMTFN-SRG)的图像文本检索方法。首先,采用图卷积网络(GCN)学习图像区域间的关系并使用双向门控循环单元(Bi-GRU)构建文本单词间的关系;然后,将所学习到的图像区域和文本单词间的语义关系图通过张量融合网络进行匹配以学习两种不同模态数据间的细粒度语义关联;同时,采用门控循环单元(GRU)学习图像的全局特征,并将图像和文本的全局特征进行匹配以捕获模态间的全局语义相关性。将所提方法在Flickr30K和MS-COCO两个基准数据集上与多模态交叉注意力(MMCA)方法进行了对比分析。实验结果表明,所提方法在Flickr30K测试集、MS-COCO1K测试集以及MS-COCO5K测试集上文本检索图像任务的Recall@1分别提升了2.6%、9.0%和4.1%,召回率均值(mR)分别提升了0.4、1.3和0.1个百分点,可见该方法能有效提升图像文本检索的精度。

关键词: 跨模态检索, 张量融合网络, 图卷积网络, 语义相关性, 语义关系图

CLC Number: