Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (1): 16-23.DOI: 10.11772/j.issn.1001-9081.2023060766

• Cross-media representation learning and cognitive reasoning • Previous Articles     Next Articles

Image text retrieval method based on feature enhancement and semantic correlation matching

Jia CHEN1,2(), Hong ZHANG1,2   

  1. 1.School of Computer Science and Technology,Wuhan University of Science and Technology,Wuhan Hubei 430081,China
    2.Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System (Wuhan University of Science and Technology),Wuhan Hubei 430081,China
  • Received:2023-06-16 Revised:2023-08-25 Accepted:2023-08-31 Online:2023-09-14 Published:2024-01-10
  • Contact: Jia CHEN
  • About author:First author contact:ZHANG Hong, born in 1979, Ph. D., professor. Her research interests include machine learning, cross-media retrieval, data mining.
  • Supported by:
    National Key Research and Development Program of China(2020AAA0108503)

基于特征增强和语义相关性匹配的图像文本检索方法

陈佳1,2(), 张鸿1,2   

  1. 1.武汉科技大学 计算机科学与技术学院, 武汉 430081
    2.智能信息处理与实时工业系统湖北省重点实验室(武汉科技大学), 武汉 430081
  • 通讯作者: 陈佳
  • 作者简介:陈佳(1999—),女,江西上饶人,硕士研究生,主要研究方向:机器学习、跨媒体检索;
    张鸿(1979—),女,湖北襄阳人,教授,博士,CCF会员,主要研究方向:机器学习、跨媒体检索、数据挖掘。
  • 基金资助:
    国家重点研发计划项目(2020AAA0108503)

Abstract:

In order to achieve the precise semantic correlation between image and text, an image text retrieval method based on Feature Enhancement and Semantic Correlation Matching (FESCM) was proposed. Firstly, through the feature enhancement representation module, the multi-head self-attention mechanism was introduced to enhance image region features and text word features to reduce the interference of redundant information to alignment of image region and text word. Secondly, the semantic correlation matching module was used to not only capture the corresponding correlation between locally significant objects by local matching, but also incorporate the image background information into the global image features and achieve accurate global semantic correlation by global matching. Finally, the local matching scores and global matching scores were used to obtain the final matching scores of images and texts. The experimental results show that the FESCM-based image text retrieval method improves the recall sum over the extended visual semantic embedding method by 5.7 and 7.5 percentage points on Flickr8k and Flickr30k benchmark datasets, respectively; the recall sum is improved by 3.7 percentage points over the Two-Stream Hierarchical Similarity Reasoning method on the MS-COCO dataset. The proposed method can effectively improve the accuracy of image text retrieval and realize the semantic connection between image and text.

Key words: image text retrieval, feature enhancement representation, multi-head self-attention mechanism, semantic correlation matching

摘要:

为实现图像文本检索中图像与文本的精确语义连接,提出一种基于特征增强和语义相关性匹配(FESCM)的图像文本检索方法。首先,通过特征增强表示模块,引入多头自注意力机制增强图像区域特征和文本单词特征,以减少冗余信息对图像区域和文本单词对齐的干扰;其次,通过语义相关性匹配模块,不仅利用局部匹配捕获局部显著对象之间的对应相关性,还把图像背景信息融入图像全局特征,利用全局匹配实现精确的全局语义相关性;最后,通过局部匹配分数和全局匹配分数获取图像和文本的最终匹配分数。实验结果表明,基于FESCM的图像文本检索方法在Flickr8k和Flickr30k基准数据集上的召回率总值比扩展的视觉语义嵌入方法分别提升了5.7和7.5个百分点,在MS-COCO数据集比双流层次相似度推理方法提升了3.7个百分点。因此该方法可以有效提高图像文本检索的准确度,实现图像与文本的语义连接。

关键词: 图像文本检索, 特征增强表示, 多头自注意力机制, 语义相关性匹配

CLC Number: