Image text retrieval method based on feature enhancement and semantic correlation matching

doi:10.11772/j.issn.1001-9081.2023060766

Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (1): 16-23.DOI: 10.11772/j.issn.1001-9081.2023060766

• Cross-media representation learning and cognitive reasoning • Previous Articles Next Articles

Image text retrieval method based on feature enhancement and semantic correlation matching

Jia CHEN¹^,²(), Hong ZHANG¹^,²

^1.School of Computer Science and Technology，Wuhan University of Science and Technology，Wuhan Hubei 430081，China
^2.Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System （Wuhan University of Science and Technology），Wuhan Hubei 430081，China

Received:2023-06-16 Revised:2023-08-25 Accepted:2023-08-31 Online:2023-09-14 Published:2024-01-10
Contact: Jia CHEN
About author:First author contact:ZHANG Hong， born in 1979， Ph. D.， professor. Her research interests include machine learning， cross-media retrieval， data mining.
Supported by:
National Key Research and Development Program of China(2020AAA0108503)

基于特征增强和语义相关性匹配的图像文本检索方法

陈佳¹^,²(), 张鸿¹^,²

^1.武汉科技大学计算机科学与技术学院, 武汉 430081
^2.智能信息处理与实时工业系统湖北省重点实验室(武汉科技大学), 武汉 430081

通讯作者: 陈佳
作者简介:陈佳（1999—），女，江西上饶人，硕士研究生，主要研究方向：机器学习、跨媒体检索；
张鸿（1979—），女，湖北襄阳人，教授，博士，CCF会员，主要研究方向：机器学习、跨媒体检索、数据挖掘。
基金资助:
国家重点研发计划项目(2020AAA0108503)

Abstract

Abstract:

In order to achieve the precise semantic correlation between image and text， an image text retrieval method based on Feature Enhancement and Semantic Correlation Matching （FESCM） was proposed. Firstly， through the feature enhancement representation module， the multi-head self-attention mechanism was introduced to enhance image region features and text word features to reduce the interference of redundant information to alignment of image region and text word. Secondly， the semantic correlation matching module was used to not only capture the corresponding correlation between locally significant objects by local matching， but also incorporate the image background information into the global image features and achieve accurate global semantic correlation by global matching. Finally， the local matching scores and global matching scores were used to obtain the final matching scores of images and texts. The experimental results show that the FESCM-based image text retrieval method improves the recall sum over the extended visual semantic embedding method by 5.7 and 7.5 percentage points on Flickr8k and Flickr30k benchmark datasets， respectively； the recall sum is improved by 3.7 percentage points over the Two-Stream Hierarchical Similarity Reasoning method on the MS-COCO dataset. The proposed method can effectively improve the accuracy of image text retrieval and realize the semantic connection between image and text.

Key words: image text retrieval, feature enhancement representation, multi-head self-attention mechanism, semantic correlation matching

摘要：

为实现图像文本检索中图像与文本的精确语义连接，提出一种基于特征增强和语义相关性匹配（FESCM）的图像文本检索方法。首先，通过特征增强表示模块，引入多头自注意力机制增强图像区域特征和文本单词特征，以减少冗余信息对图像区域和文本单词对齐的干扰；其次，通过语义相关性匹配模块，不仅利用局部匹配捕获局部显著对象之间的对应相关性，还把图像背景信息融入图像全局特征，利用全局匹配实现精确的全局语义相关性；最后，通过局部匹配分数和全局匹配分数获取图像和文本的最终匹配分数。实验结果表明，基于FESCM的图像文本检索方法在Flickr8k和Flickr30k基准数据集上的召回率总值比扩展的视觉语义嵌入方法分别提升了5.7和7.5个百分点，在MS-COCO数据集比双流层次相似度推理方法提升了3.7个百分点。因此该方法可以有效提高图像文本检索的准确度，实现图像与文本的语义连接。

关键词: 图像文本检索, 特征增强表示, 多头自注意力机制, 语义相关性匹配

CLC Number:

TP391.3

Jia CHEN, Hong ZHANG. Image text retrieval method based on feature enhancement and semantic correlation matching[J]. Journal of Computer Applications, 2024, 44(1): 16-23.

陈佳, 张鸿. 基于特征增强和语义相关性匹配的图像文本检索方法[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 16-23.

Figures/Tables 8

References 29

1	刘颖，郭莹莹，房杰，等.深度学习跨模态图文检索研究综述［J］.计算机科学与探索， 2022， 16（3）： 489-511. 10.3778/j.issn.1673-9418.2107076
	LIU Y， GUO Y Y， FANG J， et al. Survey of research on deep learning image-text cross-modal retrieval ［J］. Journal of Frontiers of Computer Science and Technology， 2022， 16（3）： 489-511. 10.3778/j.issn.1673-9418.2107076
2	LI X， WANG Y， SHA Z. Deep learning methods of cross-modal tasks for conceptual design of product shapes： a review ［J］. Journal of Mechanical Design， 2023， 145（4）： 041401. 10.1115/1.4056436
3	刘长红，曾胜，张斌，等.基于语义关系图的跨模态张量融合网络的图像文本检索［J］.计算机应用， 2022， 42（10）： 3018-3024. 10.11772/j.issn.1001-9081.2021091622
	LIU C H， ZENG S， ZHANG B， et al. Cross-modal tensor fusion network based on semantic relation graph for image-text retrieval ［J］. Journal of Computer Applications， 2022， 42（10）： 3018-3024. 10.11772/j.issn.1001-9081.2021091622
4	李志欣，凌锋，张灿龙，等.融合两级相似度的跨媒体图像文本检索［J］.电子学报， 2021， 49（2）： 268-274. 10.12263/DZXB.20191037
	LI Z X， LING F， ZHANG C L， et al. Cross-media image-text retrieval with two level similarity ［J］. Acta Electronica Sinica， 2021， 49（2）： 268-274. 10.12263/DZXB.20191037
5	FROME A， CORRADO G S， SHLENS J， et al. DeViSE： a deep visual-semantic embedding model ［C］// Proceedings of the 26th International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2013： 2121-2129.
6	FAGHRI F， FLEET D J， KIROS J R， et al. VSE++： improving visual-semantic embeddings with hard negatives ［C］// Proceedings of the 2018 British Machine Vision Conference. Durham： BMVA Press， 2018： No.344.
7	GU J， CAI J， JOTY S R， et al. Look， imagine and match： improving textual-visual cross-modal retrieval with generative models ［C］// Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 7181-7189. 10.1109/cvpr.2018.00750
8	ZHEN L， HU P， WANG X， et al. Deep supervised cross-modal retrieval ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 10386-10395. 10.1109/cvpr.2019.01064
9	WEN K， GU X， CHENG Q. Learning dual semantic relations with graph attention for image-text matching ［J］. IEEE Transactions on Circuits and Systems for Video Technology， 2021， 31（7）： 2866-2879. 10.1109/tcsvt.2020.3030656
10	CHEN J， HU H， WU H， et al. Learning the best pooling strategy for visual semantic embedding ［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 15784-15793. 10.1109/cvpr46437.2021.01553
11	KARPATHY A， JOULIN A， LI F-F. Deep fragment embeddings for bidirectional image sentence mapping ［C］// Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge： MIT Press， 2014： 1889-1897.
12	NIU Z， ZHOU M， WANG L， et al. Hierarchical multimodal LSTM for dense visual-semantic embedding ［C］// Proceedings of the 2017 IEEE International Conference on computer Vision. Piscataway： IEEE， 2017： 1899-1907. 10.1109/iccv.2017.208
13	NAM H， J-W HA， KIM J. Dual attention networks for multimodal reasoning and matching ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 2156-2164. 10.1109/cvpr.2017.232
14	LEE K-H， CHEN X， HUA G， et al. Stacked cross attention for image-text matching ［C］// Proceedings of the 2018 European Conference on Computer Vision. Cham： Springer， 2018： 212-228. 10.1007/978-3-030-01225-0_13
15	CHEN H， DING G， LIU X， et al. IMRAM： iterative matching with recurrent attention memory for cross-modal image-text retrieval ［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 12652-12660. 10.1109/cvpr42600.2020.01267
16	QU L， LIU M， WU J， et al. Dynamic modality interaction modeling for image-text retrieval ［C］// Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York： ACM， 2021： 1104-1113. 10.1145/3404835.3462829
17	JI Z， CHEN K， WANG H. Step-wise hierarchical alignment network for image-text matching ［EB/OL］. ［2021-01-11］. . 10.24963/ijcai.2021/106
18	CHEN R， WANG H， WANG L， et al. Two-stream hierarchical similarity reasoning for image-text matching ［EB/OL］. ［2022-03-10］. .
19	ANDERSON P， HE X， BUEHLER C， et al. Bottom-up and top-down attention for image captioning and visual question answering ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6077-6086. 10.1109/cvpr.2018.00636
20	KRISHNA R， ZHU Y， GROTH O， et al. Visual Genome： connecting language and vision using crowdsourced dense image annotations ［J］. International Journal of Computer Vision， 2017， 123（1）： 32-73. 10.1007/s11263-016-0981-7
21	REN S， HE K， GIRSHICK R， et al. Faster R-CNN： Towards real-time object detection with region proposal networks ［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2017， 39（6）： 1137-1149. 10.1109/tpami.2016.2577031
22	HE K， ZHANG X， REN S， et al. Deep residual learning for image recognition ［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778. 10.1109/cvpr.2016.90
23	DENG J， DONG W， SOCHER R， et al. ImageNet： a large-scale hierarchical image database ［C］// Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2009： 248-255. 10.1109/cvpr.2009.5206848
24	SCHUSTER M， PALIWAL K K. Bidirectional recurrent neural networks ［J］. IEEE Transactions on Signal Processing， 1997， 45（11）： 2673-2681. 10.1109/78.650093
25	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need ［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 6000-6010.
26	PLUMMER B A， WANG L， CERVANTES C M， et al. Flickr30k entities： collecting region-to-phrase correspondences for richer image-to-sentence models ［C］// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2015： 2641-2649. 10.1109/iccv.2015.303
27	VINYALS O， TOSHEV A， BENGIO S， et al. Show and tell： lessons learned from the 2015 MSCOCO image captioning challenge ［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2017， 39（4）： 652-663. 10.1109/tpami.2016.2587640
28	JIANG Z， LIAN Z. Mutil-level local alignment and semantic matching network for image-text retrieval ［C］// Proceedings of the 2022 International Conference on Artificial Neural Networks. Cham： Springer， 2022： 212-224. 10.1007/978-3-031-15934-3_18
29	KINGMA D P， BA J. Adam： a method for stochastic optimization ［EB/OL］. （2017-01-30）［2021-08-03］. .

数据集	样本数
数据集	总数	训练集	验证集	测试集
Flickr8k	8 000	6 000	1 000	1 000
Flickr30k	31 000	29 000	1 000	1 000
MS-COCO	123 287	113 287	5 000	5 000

数据集	样本数
数据集	总数	训练集	验证集	测试集
Flickr8k	8 000	6 000	1 000	1 000
Flickr30k	31 000	29 000	1 000	1 000
MS-COCO	123 287	113 287	5 000	5 000

方法	图像检索文本			文本检索图像			Rsum
方法	R@1	R@5	R@10	R@1	R@5	R@10	Rsum
VSE++^［6］	47.9	77.3	87.1	35.2	65.5	77.6	390.6
VSE∞^［10］	55.0	84.8	91.1	41.7	69.9	80.0	422.5
SCAN^［14］	52.2	81.0	89.2	38.3	67.8	78.9	407.4
IMRAM^［15］	54.7	84.2	91.0	41.0	69.2	79.9	420.0
FESCM	56.8	85.9	91.6	43.1	70.5	80.3	428.2

方法	图像检索文本			文本检索图像			Rsum
方法	R@1	R@5	R@10	R@1	R@5	R@10	Rsum
VSE++^［6］	47.9	77.3	87.1	35.2	65.5	77.6	390.6
VSE∞^［10］	55.0	84.8	91.1	41.7	69.9	80.0	422.5
SCAN^［14］	52.2	81.0	89.2	38.3	67.8	78.9	407.4
IMRAM^［15］	54.7	84.2	91.0	41.0	69.2	79.9	420.0
FESCM	56.8	85.9	91.6	43.1	70.5	80.3	428.2

方法	图像检索文本			文本检索图像			Rsum
方法	R@1	R@5	R@10	R@1	R@5	R@10	Rsum
VSE++^［6］	52.9	80.5	87.2	39.6	70.1	79.5	409.8
VSE∞^［10］	76.5	94.2	97.7	56.4	83.4	89.9	498.1
SCAN^［14］	67.4	90.3	95.8	48.6	77.7	85.2	465.0
IMRAM^［15］	74.1	93.0	96.6	53.9	79.4	87.2	484.2
SHAN^［17］	74.6	92.5	96.9	55.3	81.3	88.4	490.0
TSHSR^［18］	76.3	93.0	95.8	56.6	81.2	85.9	488.8
FESCM	78.5	94.9	98.0	58.3	85.7	90.2	505.6

Image text retrieval method based on feature enhancement and semantic correlation matching

基于特征增强和语义相关性匹配的图像文本检索方法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 8

References 29

Related Articles 2

Recommended Articles

Metrics

模型	模型设置						图像检索文本			文本检索图像			Rsum
模型	IFER	TFER	LM	GM-1	GM-2	GM-full	R@1	R@5	R@10	R@1	R@5	R@10	Rsum
模型1	×	×	√	×	×	×	67.4	90.3	95.8	48.6	77.7	85.2	465.0
模型2	√	×	√	×	×	×	68.3	91.1	96.2	50.4	78.9	86.1	471.0
模型3	×	√	√	×	×	×	68.8	91.4	96.1	50.6	78.8	85.9	471.6
模型4	√	√	√	×	×	×	72.7	92.5	96.9	53.9	81.2	87.3	484.5
模型5	√	√	√	√	×	×	77.9	94.3	97.8	54.6	82.8	88.1	495.5
模型6	√	√	√	×	√	×	73.5	92.8	97.1	57.4	85.3	89.6	495.7
模型7	√	√	√	×	×	√	78.5	94.9	98.0	58.3	85.7	90.2	505.6

[1]	Liqing QIU, Xiaopan SU. Personalized multi-layer interest extraction click-through rate prediction model [J]. Journal of Computer Applications, 2024, 44(11): 3411-3418.
[2]	ZHANG Xiaochuan, DAI Xuyao, LIU Lu, FENG Tianshuo. Chinese short text classification model with multi-head self-attention mechanism [J]. Journal of Computer Applications, 2020, 40(12): 3485-3489.

方法	图像检索文本			文本检索图像			Rsum
方法	R@1	R@5	R@10	R@1	R@5	R@10	Rsum
VSE++^［6］	64.6	90.0	95.7	52.0	84.3	92.0	478.6
VSE∞^［10］	78.5	96.0	98.7	61.7	90.3	95.6	520.8
SCAN^［14］	72.7	94.8	98.4	58.8	88.4	94.8	507.9
IMRAM^［15］	76.7	95.6	98.5	61.7	89.1	95.0	516.6
SHAN^［17］	76.8	96.3	98.7	62.6	89.6	95.8	519.8
TSHSR^［18］	79.0	96.2	98.6	63.1	89.9	95.4	522.2
FESCM	79.6	96.8	99.1	63.3	90.7	96.4	525.9

方法	图像检索文本			文本检索图像			Rsum
方法	R@1	R@5	R@10	R@1	R@5	R@10	Rsum
VSE++^［6］	64.6	90.0	95.7	52.0	84.3	92.0	478.6
VSE∞^［10］	78.5	96.0	98.7	61.7	90.3	95.6	520.8
SCAN^［14］	72.7	94.8	98.4	58.8	88.4	94.8	507.9
IMRAM^［15］	76.7	95.6	98.5	61.7	89.1	95.0	516.6
SHAN^［17］	76.8	96.3	98.7	62.6	89.6	95.8	519.8
TSHSR^［18］	79.0	96.2	98.6	63.1	89.9	95.4	522.2
FESCM	79.6	96.8	99.1	63.3	90.7	96.4	525.9