Image-text retrieval model based on intra-modal fine-grained feature relationship extraction

doi:10.11772/j.issn.1001-9081.2023121860

Abstract

Abstract:

In response to the diversity of relationships in cross-modal retrieval tasks， and the poor application effect of the traditional paradigm based on appearance in complex scenarios caused by inaccurately reflecting the relationships between significant objects in images， an image-text retrieval model based on intra-modal fine-grained feature relationship extraction was proposed. Firstly， to obtain more intuitive position information， the image was divided into grids， and position representations were established on the basis of the relationships between objects and grids. Then， to maintain the stability and independence of node information during the relationship modeling stage， a cross-modal information-guided feature fusion module was utilized. Finally， an adaptive triplet loss was proposed to balance the training weights of positive and negative samples dynamically. Experimental results demonstrate that compared with the model CHAN （Cross-modal Hard Aligning Network）， on the Flickr30K and MS-COCO 1K datasets， the proposed model improves 1.5% and 0.02% in R@sum metric （the sum of R@1， R@5， R@10 for image-to-text retrieval and text-to-image retrieval tasks）， respectively. The above results prove the effectiveness of the proposed model in retrieval recall.

Key words: cross-modal retrieval, image-text retrieval, relationship extraction, Graph Convolutional Network (GCN), triplet loss

摘要：

针对跨模态检索任务中关系具有多样性，以及基于外观的传统范式无法准确反映图像中显著物体间的关联，使得它在复杂场景中的应用效果不佳的问题，提出一种基于模态内细粒度特征关系提取的图像-文本检索模型。首先，为了获得更直观的位置信息，将图像划分为网格，并通过物体与网格的位置关系建立位置表征；其次，为了在关系建模阶段保持节点信息的稳定性和独立性，使用一个跨模态信息指导的特征融合模块；最后，提出一种自适应三元组损失用于动态平衡正负样本的训练权重。实验结果表明，所提模型在Flickr30K和MS-COCO 1K数据集上与模型CHAN（Cross-modal Hard Aligning Network）相比，在R@sum指标（前1，5，10个图像检索文本和文本检索图像的召回率之和）上分别提升了1.5%和0.02%，以上结果验证了所提模型在检索的召回率上的有效性。

关键词: 跨模态检索, 图像文本检索, 关系提取, 图卷积网络, 三元组损失

CLC Number:

TP399

Zucheng WU, Xiaojun WU, Tianyang XU. Image-text retrieval model based on intra-modal fine-grained feature relationship extraction[J]. Journal of Computer Applications, 2024, 44(12): 3776-3783.

吴祖成, 吴小俊, 徐天阳. 基于模态内细粒度特征关系提取的图像文本检索模型[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3776-3783.

Figures/Tables 8

References 27

1	HUANG Y， WU Q， SONG C， et al. Learning semantic concepts and order for image and sentence matching ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6163-6171.
2	GRAVES A. Long short-term memory［M］// Supervised sequence labelling with recurrent neural networks， SCI 385. Berlin： Springer， 2012： 37-45.
3	LEE K H， CHEN X， HUA G， et al. Stacked cross attention for image-text matching［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11208. Cham： Springer， 2018： 212-228.
4	LIU C， MAO Z， ZHANG T， et al. Graph structured network for image-text matching［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 10918-10927.
5	WEI X， ZHANG T， LI Y， et al. Multi-modality cross attention network for image and sentence matching ［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 10938-10947.
6	CHEN H， DING G， LIU X， et al. IMRAM： iterative matching with recurrent attention memory for cross-modal image-text retrieval ［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 12652-12660.
7	EISENSCHTAT A， WOLF L. Linking image and text with 2-way nets［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 1855-1865.
8	GU J， CAI J， JOTY S， et al. Look， imagine and match： improving textual-visual cross-modal retrieval with generative models［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 7181-7189.
9	LIU Y， GUO Y， BAKKER E M， et al. Learning a recurrent residual fusion network for multimodal matching ［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 4127-4136.
10	WANG L， LI Y， HUANG J， et al. Learning two-branch neural networks for image-text matching tasks ［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2019， 41（2）： 394-407.
11	WANG L， LI Y， LAZEBNIK S. Learning deep structure-preserving image-text embeddings ［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 5005-5013.
12	SONG Y， SOLEYMANI M. Polysemous visual-semantic embedding for cross-modal retrieval ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 1979-1988.
13	HE K， ZHANG X， REN S， et al. Deep residual learning for image recognition ［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778.
14	REN S， HE K， GIRSHICK R， et al. Faster R-CNN： towards real-time object detection with region proposal networks［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2017， 39（6）： 1137-1149.
15	ANDERSON P， HE X， BUEHLER C， et al. Bottom-up and top-down attention for image captioning and visual question answering ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6077-6086.
16	KRISHNA R， ZHU Y， GROTH O， et al. Visual Genome： connecting language and vision using crowdsourced dense image annotations ［J］. International Journal of Computer Vision， 2017， 123（1）： 32-73.
17	PENNINGTON J， SOCHER R， MANNING C D. GloVe： global vectors for word representation［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2014： 1532-1543.
18	FAGHRI F， FLEET D J， KIROS J R， et al. VSE++： improving visual-semantic embeddings with hard negatives ［C］// Proceedings of the 2018 British Machine Vision Conference . Durham： BMVA Press， 2018： 1-14.
19	KIROS R， SALAKHUTDINOV R， ZEMEL R S. Unifying visual-semantic embeddings with multimodal neural language models ［EB/OL］. ［2023-12-05］..
20	LIU C， MAO Z， LIU A A， et al. Focus your attention： a bidirectional focal attention network for image-text matching ［C］// Proceedings of the 27th ACM International Conference on Multimedia. New York： ACM， 2019： 3-11.
21	邓一姣，张凤荔，陈学勤，等. 面向跨模态检索的协同注意力网络模型［J］. 计算机科学， 2020， 47（4）： 54-59.
	DENG Y J， ZHANG F L， CHEN X Q， et al. Collaborative attention network modal for cross-modal retrieval ［J］. Computer Science， 2020， 47（4）： 54-59.
22	DIAO H， ZHANG Y， MA L， et al. Similarity reasoning and filtration for image-text matching［C］// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2021： 1218-1226.
23	WEHRMANN J， KOLLING C， BARROS R C. Adaptive cross-modal embeddings for image-text alignment［C］// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2020： 12313-12320.
24	ZHENG Z， ZHENG L， GARRETT M， et al. Dual-path convolutional image-text embeddings with instance loss ［J］. ACM Transactions on Multimedia Computing， Communications， and Applications， 2020， 16（2）： No.51.
25	ZHANG Q， LEI Z， ZHANG Z， et al. Context-aware attention network for image-text retrieval［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 3533-3542.
26	LI Z， GUO C， FENG Z， et al. Multi-view visual semantic embedding［C］// Proceedings of the 31st International Joint Conference on Artificial Intelligence. California： ijcai.org， 2022： 1130-1136.
27	PAN Z， WU F， ZHANG B. Fine-grained image-text matching by cross-modal hard aligning network［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023： 19275-19284.

数据集	模型	图像检索文本			文本检索图像			R@sum
数据集	模型	R@1/%	R@5/%	R@10/%	R@1/%	R@5/%	R@10/%	R@sum
Flickr30K	DPC^［24］	55.6	81.9	89.5	39.1	69.2	80.9	416.2
	SCO^［1］	55.5	82.0	89.3	41.1	70.5	80.1	418.5
	SCAN^［3］	67.4	90.3	95.8	48.6	77.7	85.2	465.0
	CAAN^［25］	70.1	91.6	97.2	52.8	79.0	87.9	478.6
	IMRAM^［6］	74.1	93.0	96.6	53.9	79.4	87.2	484.2
	GSMN^［4］	76.4	94.3	97.3	57.4	82.3	89.0	496.7
	SGRAF^［22］	77.8	94.1	97.4	58.5	83.0	88.8	499.6
	ADAPT^［23］	76.6	95.4	97.6	60.7	86.6	92.0	508.9
	MV^［26］	79.0	94.9	97.7	59.1	84.6	90.6	505.8
	CHAN^［27］	79.7	94.5	97.3	60.2	85.3	90.7	507.8
	IFRE（Image）	76.9	94.4	97.3	59.0	84.5	90.2	502.3
	IFRE（Caption）	78.7	95.9	98.4	59.1	84.1	89.9	506.1
	IFRE（Ensemble）	80.0	97.0	98.5	61.6	86.7	91.6	515.5
MS-COCO 1K	DPC^［24］	65.6	89.8	95.5	47.1	79.9	90.0	467.9
	SCO^［1］	69.9	92.9	97.5	56.7	87.5	94.8	499.3
	SCAN^［3］	72.7	94.8	98.4	58.8	88.4	94.8	507.9
	CAAN^［25］	75.5	95.4	98.5	61.3	89.7	95.2	515.6
	IMRAM^［6］	76.7	95.6	98.5	61.7	89.1	95.0	516.6
	GSMN^［4］	78.4	96.4	98.6	63.3	90.1	95.7	522.5
	SGRAF^［22］	79.6	96.2	98.5	63.2	90.7	96.1	524.3
	ADAPT^［23］	76.5	95.6	98.9	62.2	90.5	96.0	519.8
	MV^［26］	78.7	95.7	98.7	62.7	90.4	95.7	521.9
	CHAN^［27］	79.7	96.7	98.7	63.8	90.4	95.8	525.0
	IFRE（Image）	75.3	96.0	98.5	61.7	90.1	95.7	517.3
	IFRE（Caption）	76.8	96.2	98.8	62.1	89.6	95.4	518.9
	IFRE（Ensemble）	77.9	96.6	98.9	64.4	91.0	96.3	525.1
MS-COCO 5K	DPC^［24］	41.2	70.5	81.1	25.3	53.4	66.4	337.9
	SCO^［1］	42.8	72.3	83.0	33.1	62.9	75.5	369.6
	SCAN^［3］	50.4	82.2	90.0	38.6	69.3	80.4	410.9
	IMRAM^［6］	53.7	83.2	91.0	39.7	69.1	79.8	416.5
	MV^［26］	56.7	84.1	91.4	40.3	70.6	81.6	424.6
	IFRE（Image）	53.4	82.6	90.2	39.8	69.7	80.7	416.4
	IFRE（Caption）	54.7	83.0	90.9	39.9	69.0	80.0	417.6
	IFRE（Ensemble）	57.1	84.5	91.8	42.1	71.2	81.7	428.4

数据集	模型	图像检索文本			文本检索图像			R@sum
数据集	模型	R@1/%	R@5/%	R@10/%	R@1/%	R@5/%	R@10/%	R@sum
Flickr30K	DPC^［24］	55.6	81.9	89.5	39.1	69.2	80.9	416.2
	SCO^［1］	55.5	82.0	89.3	41.1	70.5	80.1	418.5
	SCAN^［3］	67.4	90.3	95.8	48.6	77.7	85.2	465.0
	CAAN^［25］	70.1	91.6	97.2	52.8	79.0	87.9	478.6
	IMRAM^［6］	74.1	93.0	96.6	53.9	79.4	87.2	484.2
	GSMN^［4］	76.4	94.3	97.3	57.4	82.3	89.0	496.7
	SGRAF^［22］	77.8	94.1	97.4	58.5	83.0	88.8	499.6
	ADAPT^［23］	76.6	95.4	97.6	60.7	86.6	92.0	508.9
	MV^［26］	79.0	94.9	97.7	59.1	84.6	90.6	505.8
	CHAN^［27］	79.7	94.5	97.3	60.2	85.3	90.7	507.8
	IFRE（Image）	76.9	94.4	97.3	59.0	84.5	90.2	502.3
	IFRE（Caption）	78.7	95.9	98.4	59.1	84.1	89.9	506.1
	IFRE（Ensemble）	80.0	97.0	98.5	61.6	86.7	91.6	515.5
MS-COCO 1K	DPC^［24］	65.6	89.8	95.5	47.1	79.9	90.0	467.9
	SCO^［1］	69.9	92.9	97.5	56.7	87.5	94.8	499.3
	SCAN^［3］	72.7	94.8	98.4	58.8	88.4	94.8	507.9
	CAAN^［25］	75.5	95.4	98.5	61.3	89.7	95.2	515.6
	IMRAM^［6］	76.7	95.6	98.5	61.7	89.1	95.0	516.6
	GSMN^［4］	78.4	96.4	98.6	63.3	90.1	95.7	522.5
	SGRAF^［22］	79.6	96.2	98.5	63.2	90.7	96.1	524.3
	ADAPT^［23］	76.5	95.6	98.9	62.2	90.5	96.0	519.8
	MV^［26］	78.7	95.7	98.7	62.7	90.4	95.7	521.9
	CHAN^［27］	79.7	96.7	98.7	63.8	90.4	95.8	525.0
	IFRE（Image）	75.3	96.0	98.5	61.7	90.1	95.7	517.3
	IFRE（Caption）	76.8	96.2	98.8	62.1	89.6	95.4	518.9
	IFRE（Ensemble）	77.9	96.6	98.9	64.4	91.0	96.3	525.1
MS-COCO 5K	DPC^［24］	41.2	70.5	81.1	25.3	53.4	66.4	337.9
	SCO^［1］	42.8	72.3	83.0	33.1	62.9	75.5	369.6
	SCAN^［3］	50.4	82.2	90.0	38.6	69.3	80.4	410.9
	IMRAM^［6］	53.7	83.2	91.0	39.7	69.1	79.8	416.5
	MV^［26］	56.7	84.1	91.4	40.3	70.6	81.6	424.6
	IFRE（Image）	53.4	82.6	90.2	39.8	69.7	80.7	416.4
	IFRE（Caption）	54.7	83.0	90.9	39.9	69.0	80.0	417.6
	IFRE（Ensemble）	57.1	84.5	91.8	42.1	71.2	81.7	428.4

方法	图像检索文本			文本检索图像			R@sum
方法	R@1/%	R@5/%	R@10/%	R@1/%	R@5/%	R@10/%	R@sum
IFRE（Image）	76.9	94.4	97.3	59.0	84.5	90.2	502.3
w/o IF	75.4	94.0	97.2	57.7	83.7	89.5	497.6
w/o PG	76.3	94.9	97.3	58.0	83.7	89.9	500.2
w/o SG	76.6	94.3	97.2	58.1	84.1	89.7	500.0
w/o AL	76.5	94.9	97.4	57.1	83.5	89.7	499.0
IFRE（Caption）	78.7	95.9	98.4	59.1	84.1	89.9	506.1
w/o IF	76.2	95.8	97.9	57.7	83.2	89.8	500.6
w/o AL	77.8	94.9	97.5	59.2	84.1	90.1	503.6
IFRE（Ensemble）	80.0	97.0	98.5	61.6	86.7	91.6	515.5
w/o Ensemble	76.9	94.6	98.0	58.5	83.6	89.8	501.4

方法	图像检索文本			文本检索图像			R@sum
方法	R@1/%	R@5/%	R@10/%	R@1/%	R@5/%	R@10/%	R@sum
IFRE（Image）	76.9	94.4	97.3	59.0	84.5	90.2	502.3
w/o IF	75.4	94.0	97.2	57.7	83.7	89.5	497.6
w/o PG	76.3	94.9	97.3	58.0	83.7	89.9	500.2
w/o SG	76.6	94.3	97.2	58.1	84.1	89.7	500.0
w/o AL	76.5	94.9	97.4	57.1	83.5	89.7	499.0
IFRE（Caption）	78.7	95.9	98.4	59.1	84.1	89.9	506.1
w/o IF	76.2	95.8	97.9	57.7	83.2	89.8	500.6
w/o AL	77.8	94.9	97.5	59.2	84.1	90.1	503.6
IFRE（Ensemble）	80.0	97.0	98.5	61.6	86.7	91.6	515.5
w/o Ensemble	76.9	94.6	98.0	58.5	83.6	89.8	501.4

[1]	Guixiang XUE, Hui WANG, Weifeng ZHOU, Yu LIU, Yan LI. Port traffic flow prediction based on knowledge graph and spatio-temporal diffusion graph convolutional network [J]. Journal of Computer Applications, 2024, 44(9): 2952-2957.
[2]	Chuanlin PANG, Rui TANG, Ruizhi ZHANG, Chuan LIU, Jia LIU, Shibo YUE. Distributed power allocation algorithm based on graph convolutional network for D2D communication systems [J]. Journal of Computer Applications, 2024, 44(9): 2855-2862.
[3]	Huanhuan LI, Tianqiang HUANG, Xuemei DING, Haifeng LUO, Liqing HUANG. Public traffic demand prediction based on multi-scale spatial-temporal graph convolutional network [J]. Journal of Computer Applications, 2024, 44(7): 2065-2072.
[4]	Shibin LI, Jun GONG, Shengjun TANG. Semi-supervised heterophilic graph representation learning model based on Graph Transformer [J]. Journal of Computer Applications, 2024, 44(6): 1816-1823.
[5]	Longtao GAO, Nana LI. Aspect sentiment triplet extraction based on aspect-aware attention enhancement [J]. Journal of Computer Applications, 2024, 44(4): 1049-1057.
[6]	Xianfeng YANG, Yilei TANG, Ziqiang LI. Aspect-level sentiment analysis model based on alternating‑attention mechanism and graph convolutional network [J]. Journal of Computer Applications, 2024, 44(4): 1058-1064.
[7]	Kaitian WANG, Qing YE, Chunlei CHENG. Classification method for traditional Chinese medicine electronic medical records based on heterogeneous graph representation [J]. Journal of Computer Applications, 2024, 44(2): 411-417.
[8]	Xinrong HU, Jingxue CHEN, Zijian HUANG, Bangchao WANG, Xun YAO, Junping LIU, Qiang ZHU, Jie YANG. Graph convolution network-based masked data augmentation [J]. Journal of Computer Applications, 2024, 44(11): 3335-3344.
[9]	Xinyue YAN, Shuqun YANG, Yongbin GAO. Document-level relationship extraction based on evidence enhancement and multi-feature fusion [J]. Journal of Computer Applications, 2024, 44(11): 3379-3385.
[10]	Nengqiang XIANG, Xiaofei ZHU, Zhaoze GAO. Information diffusion prediction model of prototype-aware dual-channel graph convolutional neural network [J]. Journal of Computer Applications, 2024, 44(10): 3260-3266.
[11]	Yanbo LI, Qing HE, Shunyi LU. Aspect sentiment triplet extraction integrating semantic and syntactic information [J]. Journal of Computer Applications, 2024, 44(10): 3275-3280.
[12]	Wanting JI, Wenyi LU, Yuhang MA, Linlin DING, Baoyan SONG, Haolin ZHANG. Machine reading comprehension event detection based on relation-enhanced graph convolutional network [J]. Journal of Computer Applications, 2024, 44(10): 3288-3293.
[13]	Jinke DENG, Wenjie DUAN, Shunxiang ZHANG, Yuqing WANG, Shuyu LI, Jiawei LI. Complex causal relationship extraction based on prompt enhancement and bi-graph attention network [J]. Journal of Computer Applications, 2024, 44(10): 3081-3089.
[14]	Hanxiao SHI, Leichun WANG. Short-term power load forecasting by graph convolutional network combining LSTM and self-attention mechanism [J]. Journal of Computer Applications, 2024, 44(1): 311-317.
[15]	Qiujie LIU, Yuan WAN, Jie WU. Deep bi-modal source domain symmetrical transfer learning for cross-modal retrieval [J]. Journal of Computer Applications, 2024, 44(1): 24-31.