Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (12): 3776-3783.DOI: 10.11772/j.issn.1001-9081.2023121860

• Artificial intelligence • Previous Articles     Next Articles

Image-text retrieval model based on intra-modal fine-grained feature relationship extraction

Zucheng WU, Xiaojun WU(), Tianyang XU   

  1. School of Artificial Intelligence and Computer Science,Jiangnan University,Wuxi Jiangsu 214122,China
  • Received:2024-01-10 Revised:2024-04-25 Accepted:2024-05-07 Online:2024-06-07 Published:2024-12-10
  • Contact: Xiaojun WU
  • About author:WU Zucheng, born in 1998, M. S. candidate, His research interests include cross-modal retrieval, deep learning.
    XU Tianyang, born in 1989, Ph. D., associate professor. His research interests include artificial intelligence, pattern recognition, computer vision.
  • Supported by:
    National Natural Science Foundation of China(62020106012)

基于模态内细粒度特征关系提取的图像文本检索模型

吴祖成, 吴小俊(), 徐天阳   

  1. 江南大学 人工智能与计算机学院,江苏 无锡 214122
  • 通讯作者: 吴小俊
  • 作者简介:吴祖成(1998—),男,江苏苏州人,硕士研究生,主要研究方向:跨模态检索、深度学习
    徐天阳(1989—),男,江苏无锡人,副教授,博士,主要研究方向:人工智能、模式识别、计算机视觉。
  • 基金资助:
    国家自然科学基金资助项目(62020106012)

Abstract:

In response to the diversity of relationships in cross-modal retrieval tasks, and the poor application effect of the traditional paradigm based on appearance in complex scenarios caused by inaccurately reflecting the relationships between significant objects in images, an image-text retrieval model based on intra-modal fine-grained feature relationship extraction was proposed. Firstly, to obtain more intuitive position information, the image was divided into grids, and position representations were established on the basis of the relationships between objects and grids. Then, to maintain the stability and independence of node information during the relationship modeling stage, a cross-modal information-guided feature fusion module was utilized. Finally, an adaptive triplet loss was proposed to balance the training weights of positive and negative samples dynamically. Experimental results demonstrate that compared with the model CHAN (Cross-modal Hard Aligning Network), on the Flickr30K and MS-COCO 1K datasets, the proposed model improves 1.5% and 0.02% in R@sum metric (the sum of R@1, R@5, R@10 for image-to-text retrieval and text-to-image retrieval tasks), respectively. The above results prove the effectiveness of the proposed model in retrieval recall.

Key words: cross-modal retrieval, image-text retrieval, relationship extraction, Graph Convolutional Network (GCN), triplet loss

摘要:

针对跨模态检索任务中关系具有多样性,以及基于外观的传统范式无法准确反映图像中显著物体间的关联,使得它在复杂场景中的应用效果不佳的问题,提出一种基于模态内细粒度特征关系提取的图像-文本检索模型。首先,为了获得更直观的位置信息,将图像划分为网格,并通过物体与网格的位置关系建立位置表征;其次,为了在关系建模阶段保持节点信息的稳定性和独立性,使用一个跨模态信息指导的特征融合模块;最后,提出一种自适应三元组损失用于动态平衡正负样本的训练权重。实验结果表明,所提模型在Flickr30K和MS-COCO 1K数据集上与模型CHAN(Cross-modal Hard Aligning Network)相比,在R@sum指标(前1,5,10个图像检索文本和文本检索图像的召回率之和)上分别提升了1.5%和0.02%,以上结果验证了所提模型在检索的召回率上的有效性。

关键词: 跨模态检索, 图像文本检索, 关系提取, 图卷积网络, 三元组损失

CLC Number: