Cross-modal tensor fusion network based on semantic relation graph for image-text retrieval

doi:10.11772/j.issn.1001-9081.2021091622

Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (10): 3018-3024.DOI: 10.11772/j.issn.1001-9081.2021091622

• Artificial intelligence • Previous Articles

Cross-modal tensor fusion network based on semantic relation graph for image-text retrieval

Changhong LIU¹, Sheng ZENG¹, Bin ZHANG¹, Yong CHEN²

^1.School of Computer and Information Engineering，Jiangxi Normal University，Nanchang Jiangxi 330022，China
^2.School of Business Administration，Nanchang Institute of Technology，Nanchang Jiangxi 330029，China

Received:2021-09-14 Revised:2021-12-20 Accepted:2021-12-30 Online:2022-10-14 Published:2022-10-10
Contact: Changhong LIU
About author:LIU Changhong， born in 1977， Ph. D. ， associate professor. Her research interests include computer vision， cross-modal information retrieval， hyper-spectral image processing.
ZENG Sheng， born in 1996， M. S. candidate. His research interests include cross-modal information retrieval， computer vision.
ZHANG Bin， born in 1997， M. S. candidate. His research interests include cross-modal generation， computer vision.
CHEN Yong，born in 1973， M. S. ， lecturer. His research interests include e-commerce， image processing.
Supported by:
National Natural Science Foundation of China(62067004)

基于语义关系图的跨模态张量融合网络的图像文本检索

刘长红¹, 曾胜¹, 张斌¹, 陈勇²

^1.江西师范大学计算机信息工程学院，南昌 330022
^2.南昌工程学院工商管理学院，南昌 330029

通讯作者: 刘长红
作者简介:第一联系人：刘长红（1977—），女，江西南丰人，副教授，博士，CCF会员，主要研究方向：计算机视觉、跨模态信息检索、高光谱图像处理; liuch@jxnu.edu.cn
曾胜（1996—），男，江西九江人，硕士研究生，主要研究方向：跨模态信息检索、计算机视觉
张斌（1997—），男，江西南昌人，硕士研究生，主要研究方向：跨模态生成、计算机视觉
陈勇（1973—），男，江西南昌人，讲师，硕士，主要研究方向：电子商务、图像处理。
基金资助:
国家自然科学基金资助项目(62067004)

Abstract

Abstract:

The key of cross-modal image-text retrieval is how to capture the semantic correlation between images and text effectively. Most of the existing methods learn the global semantic correlation between image region features and text features or local semantic correlation between inter-modality objects， and ignore the correlation between the intra-modality object relationships and inter-modality object relationships. To solve this problem， a method of Cross-Modal Tensor Fusion Network based on Semantic Relation Graph （CMTFN-SRG） for image-text retrieval was proposed. Firstly， the relationships of image regions and text words were generated by Graph Convolutional Network （GCN） and Bidirectional Gated Recurrent Unit （Bi-GRU） respectively. Then， the fine-grained semantic correlation between the data of two modals was learned by using the tensor fusion network to match the learned semantic relation graph of image regions and the graph of text words. At the same time， Gated Recurrent Unit （GRU） was used to learn global features of the image， and the global features of the image and the text were matched to capture the inter-modality global semantic correlation. The proposed method was compared with the Multi-Modality Cross Attention （MMCA） method on the benchmark datasets Flickr30K and MS-COCO. Experimental results show that the proposed method improves the Recall@1 of text-to-image retrieval task by 2.6%， 9.0% and 4.1% respectively on the test datasets Flickr30K， MS-COCO1K and MS-COCO5K.And mean Recall （mR） improves by 0.4， 1.3 and 0.1 percentage points respectively. It can be seen that the proposed method can effectively improve the precision of image-text retrieval.

Key words: cross-modal retrieval, tensor fusion network, Graph Convolutional Network (GCN), semantic correlation, semantic relation graph

摘要：

跨模态图像文本检索的难点是如何有效地学习图像和文本间的语义相关性。现有的大多数方法都是学习图像区域特征和文本特征的全局语义相关性或模态间对象间的局部语义相关性，而忽略了模态内对象之间的关系和模态间对象关系的关联。针对上述问题，提出了一种基于语义关系图的跨模态张量融合网络（CMTFN-SRG）的图像文本检索方法。首先，采用图卷积网络（GCN）学习图像区域间的关系并使用双向门控循环单元（Bi-GRU）构建文本单词间的关系；然后，将所学习到的图像区域和文本单词间的语义关系图通过张量融合网络进行匹配以学习两种不同模态数据间的细粒度语义关联；同时，采用门控循环单元（GRU）学习图像的全局特征，并将图像和文本的全局特征进行匹配以捕获模态间的全局语义相关性。将所提方法在Flickr30K和MS-COCO两个基准数据集上与多模态交叉注意力（MMCA）方法进行了对比分析。实验结果表明，所提方法在Flickr30K测试集、MS-COCO1K测试集以及MS-COCO5K测试集上文本检索图像任务的Recall@1分别提升了2.6%、9.0%和4.1%，召回率均值（mR）分别提升了0.4、1.3和0.1个百分点，可见该方法能有效提升图像文本检索的精度。

关键词: 跨模态检索, 张量融合网络, 图卷积网络, 语义相关性, 语义关系图

CLC Number:

TP391.3

Changhong LIU, Sheng ZENG, Bin ZHANG, Yong CHEN. Cross-modal tensor fusion network based on semantic relation graph for image-text retrieval[J]. Journal of Computer Applications, 2022, 42(10): 3018-3024.

刘长红, 曾胜, 张斌, 陈勇. 基于语义关系图的跨模态张量融合网络的图像文本检索[J]. 《计算机应用》唯一官方网站, 2022, 42(10): 3018-3024.

Figures/Tables 11

References 30

1	HUANG Y， WANG W， WANG L. Instance-aware image and sentence matching with selective multimodal LSTM［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 7254-7262. 10.1109/cvpr.2017.767
2	KIROS R， SALAKHUTDINOV R， ZEMEL R S. Unifying visual-semantic embeddings with multimodal neural language models［EB/OL］. （2014-11-10）［2021-08-03］..
3	邓一姣，张凤荔，陈学勤，等. 面向跨模态检索的协同注意力网络模型［J］. 计算机科学， 2020， 47（4）： 54-59. 10.11896/jsjkx.190600181
	DENG Y J， ZHANG F L， CHEN X Q， et al. Collaborative attention network modal for cross-modal retrieval［J］. Computer Science， 2020， 47（4）： 54-59. 10.11896/jsjkx.190600181
4	FAGHRI F， FLEET D J， KIROS J R， et al. VSE++： improving visual-semantic embeddings with hard negatives［C］// Proceedings of the 2018 British Machine Vision Conference. Durham： BMVA Press， 2018： No.344.
5	FROME A， CORRADO G S， SHLENS J， et al. DeViSE： a deep visual-semantic embedding model［C］// Proceedings of the 26th International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2013： 2121-2129.
6	GU J X， CAI J F， JOTY S， et al. Look， imagine and match： improving textual-visual cross-modal retrieval with generative models［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 7181-7189. 10.1109/cvpr.2018.00750
7	KLEIN B， LEV G， SADEH G， et al. Associating neural word embeddings with deep image representations using Fisher vectors［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2015： 4437-4446. 10.1109/cvpr.2015.7299073
8	PENG Y X， QI J E. CM-GANs： cross-modal generative adversarial networks for common representation learning［J］. ACM Transactions on Multimedia Computing， Communications， and Applications， 2019， 15（1）： No.22. 10.1145/3284750
9	YAN F， MIKOLAJCZYK K. Deep correlation for matching images and text［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2015： 3441-3450. 10.1109/cvpr.2015.7298966
10	黄育，张鸿. 基于潜语义主题加强的跨媒体检索算法［J］. 计算机应用， 2017， 37（4）： 1061-1064， 1110. 10.11772/j.issn.1001-9081.2017.04.1061
	HUANG Y， ZHANG H. Cross-media retrieval based on latent semantic topic reinforce［J］. Journal of Computer Applications， 2017， 37（4）： 1061-1064， 1110. 10.11772/j.issn.1001-9081.2017.04.1061
11	严双咏，刘长红，江爱文，等. 语义耦合相关的判别式跨模态哈希学习算法［J］. 计算机学报， 2019， 42（1）： 164-175. 10.11897/SP.J.1016.2019.00164
	YAN S Y， LIU C H， JIANG A W， et al. Discriminative cross-modal hashing with coupled semantic correlation［J］. Chinese Journal of Computers， 2019， 42（1）： 164-175. 10.11897/SP.J.1016.2019.00164
12	KARPATHY A， LI F F. Deep visual-semantic alignments for generating image descriptions［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2015： 3128-3137. 10.1109/cvpr.2015.7298932
13	NIU Z X， ZHOU M， WANG L， et al. Hierarchical multimodal LSTM for dense visual-semantic embedding［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 1899-1907. 10.1109/iccv.2017.208
14	NAM H， HA J W， KIM J. Dual attention networks for multimodal reasoning and matching［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 2156-2164. 10.1109/cvpr.2017.232
15	LEE K H， CHEN X， HUA G， et al. Stacked cross attention for image-text matching［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11208. Cham： Springer， 2018： 212-228.
16	LI K P， ZHANG Y L， LI K， et al. Visual semantic reasoning for image-text matching［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 4653-4661. 10.1109/iccv.2019.00475
17	KIPF T N， WELLING M. Semi-supervised classification with graph convolutional networks［EB/OL］. （2017-02-22）［2021-06-20］.. 10.48550/arXiv.1609.02907
18	WEI X， ZHANG T Z， LI Y， et al. Multi-modality cross attention network for image and sentence matching［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 10938-10947. 10.1109/cvpr42600.2020.01095
19	CHUNG J， GULCEHRE C， CHO K， et al. Empirical evaluation of gated recurrent neural networks on sequence modeling［EB/OL］. （2014-12-11）［2021-06-20］.. 10.1007/978-3-030-89929-5_3
20	CHEN H， DING G G， LIU X D， et al. IMRAM： iterative matching with recurrent attention memory for cross-modal image-text retrieval［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 12652-12660. 10.1109/cvpr42600.2020.01267
21	WANG L W， LI Y， LAZEBNIK S. Learning deep structure-preserving image-text embeddings［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 5005-5013. 10.1109/cvpr.2016.541
22	LIU Y， GUO Y M， BAKKER E M， et al. Learning a recurrent residual fusion network for multimodal matching［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 4127-4136. 10.1109/iccv.2017.442
23	REN S Q， HE K M， GIRSHICK R. Faster R-CNN： towards real-time object detection with region proposal networks［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2017， 39（6）： 1137-1149. 10.1109/tpami.2016.2577031
24	ANDERSON P， HE X D， BUEHLER C， et al. Bottom-up and top-down attention for image captioning and visual question answering［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6077-6086. 10.1109/cvpr.2018.00636
25	KRISHNA R， ZHU Y， GROTH O， et al. Visual Genome： connecting language and vision using crowdsourced dense image annotations［J］. International Journal of Computer Vision， 2017， 123（1）： 32-73. 10.1007/s11263-016-0981-7
26	VENUGOPALAN S， ROHRBACH M， DONAHUE J， et al. Sequence to sequence - video to text［C］// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2015： 4534-4542. 10.1109/iccv.2015.515
27	ZHENG Z D， ZHENG L， GARRETT M， et al. Dual-path convolutional image-text embedding with instance loss［J］. ACM Transactions on Multimedia Computing， Communications and Applications， 2020， 16（2）： No.51. 10.1145/3383184
28	HUANG Y， WU Q， SONG C F， et al. Learning semantic concepts and order for image and sentence matching［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6163-6171. 10.1109/cvpr.2018.00645
29	WANG T， XU X， YANG Y， et al. Matching images and text with multi-modal tensor fusion and re-ranking［C］// Proceedings of the 27th ACM International Conference on Multimedia. New York： ACM， 2019： 12-20. 10.1145/3343031.3350875
30	KINGMA D P， BA J L. Adam： a method for stochastic optimization［EB/OL］. （2017-01-30）［2021-08-03］..

数据集	样本数			总样本数
数据集	训练集数	验证集数	测试集数	总样本数
MS-COCO	113 287	5 000	5 000	123 287
Flickr30K	28 000	1 000	1 000	31 783

数据集	样本数			总样本数
数据集	训练集数	验证集数	测试集数	总样本数
MS-COCO	113 287	5 000	5 000	123 287
Flickr30K	28 000	1 000	1 000	31 783

方法	图像检索文本			文本检索图像			mR
方法	R@1	R@5	R@10	R@1	R@5	R@10	mR
RRF^［22］	47.6	77.4	87.1	35.4	68.3	79.9	65.9
VSE++^［4］	52.9	79.1	87.2	39.6	69.6	79.5	67.9
DPC^［27］	55.6	81.9	89.5	39.1	69.2	80.9	69.3
SCO^［28］	55.5	82.0	89.3	41.1	70.5	88.1	71.0
SCAN^［15］	67.4	90.3	95.8	48.6	77.7	85.2	77.5
MTFN^［29］	65.3	88.3	93.3	52.0	80.1	86.1	77.5
VSRN^［16］	71.3	90.6	96.0	54.7	81.8	88.2	80.4
MMCA^［18］	74.2	92.8	96.4	54.8	81.4	87.8	81.2
CMTFN-SRG	73.6	91.7	96.3	56.2	82.5	89.4	81.6

方法	图像检索文本			文本检索图像			mR
方法	R@1	R@5	R@10	R@1	R@5	R@10	mR
RRF^［22］	47.6	77.4	87.1	35.4	68.3	79.9	65.9
VSE++^［4］	52.9	79.1	87.2	39.6	69.6	79.5	67.9
DPC^［27］	55.6	81.9	89.5	39.1	69.2	80.9	69.3
SCO^［28］	55.5	82.0	89.3	41.1	70.5	88.1	71.0
SCAN^［15］	67.4	90.3	95.8	48.6	77.7	85.2	77.5
MTFN^［29］	65.3	88.3	93.3	52.0	80.1	86.1	77.5
VSRN^［16］	71.3	90.6	96.0	54.7	81.8	88.2	80.4
MMCA^［18］	74.2	92.8	96.4	54.8	81.4	87.8	81.2
CMTFN-SRG	73.6	91.7	96.3	56.2	82.5	89.4	81.6

方法	图像检索文本			文本检索图像			mR
方法	R @1	R @5	R@10	R @1	R @5	R @10	mR
VSE++^［4］	41.3	69.2	81.2	30.3	59.1	72.4	58.9
DPC^［27］	41.2	70.5	81.1	25.3	53.4	66.4	56.3
SCO^［28］	42.8	72.3	83.0	33.1	62.9	75.5	61.6
SCAN^［15］	50.4	82.2	90.0	38.6	69.3	80.4	68.4
MTFN^［29］	48.3	77.6	87.3	35.9	66.1	76.1	65.2
VSRN^［16］	53.0	81.1	89.4	40.5	70.6	81.1	69.2
MMCA^［18］	54.0	82.5	90.7	38.7	69.7	80.8	69.4
CMTFN-SRG	53.0	81.5	89.7	40.3	71.0	81.7	69.5

Cross-modal tensor fusion network based on semantic relation graph for image-text retrieval

基于语义关系图的跨模态张量融合网络的图像文本检索

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 11

References 30

Related Articles 15

Recommended Articles

Metrics

[1]	Xiaoyu WANG, Zhanqing WANG, Wei XIONG. Deep asymmetric discrete cross-modal hashing method [J]. Journal of Computer Applications, 2022, 42(8): 2461-2470.
[2]	Bo LIU, Linbo QING, Zhengyong WANG, Mei LIU, Xue JIANG. Group activity recognition based on partitioned attention mechanism and interactive position relationship [J]. Journal of Computer Applications, 2022, 42(7): 2052-2057.
[3]	Binhong XIE, Shuning LI, Yingjun ZHANG. Fine-grained entity typing method based on hierarchy awareness [J]. Journal of Computer Applications, 2022, 42(10): 3003-3010.
[4]	Jijie ZHANG, Yan YANG, Yong LIU. Adaptive deep graph convolution using initial residual and decoupling operations [J]. Journal of Computer Applications, 2022, 42(1): 9-15.
[5]	Donglin MA, Sizhou MA, Weijie WANG. Multi-site temperature prediction model based on graph convolutional network and gated recurrent unit [J]. Journal of Computer Applications, 2022, 42(1): 287-293.
[6]	LIU Fangming, ZHANG Hong. Cross-modal retrieval algorithm based on multi-level semantic discriminative guided hashing [J]. Journal of Computer Applications, 2021, 41(8): 2187-2192.
[7]	ZHANG Yuanjun, ZHANG Xihuang. Dynamic network representation learning model based on graph convolutional network and long short-term memory network [J]. Journal of Computer Applications, 2021, 41(7): 1857-1864.
[8]	LI Yangzhi, YUAN Jiazheng, LIU Hongzhe. Human skeleton-based action recognition algorithm based on spatiotemporal attention graph convolutional network model [J]. Journal of Computer Applications, 2021, 41(7): 1915-1921.
[9]	WANG Xiaoxia, QIAN Xuezhong, SONG Wei. Relation extraction model via attention-based graph convolutional network [J]. Journal of Computer Applications, 2021, 41(2): 350-356.
[10]	XU Li, LI Jianhua. Biomedical named entity recognition with graph network based on syntactic dependency parsing [J]. Journal of Computer Applications, 2021, 41(2): 357-362.
[11]	Xueying PENG, Yongquan JIANG, Yan YANG. Transfer learning based on graph convolutional network in bearing service fault diagnosis [J]. Journal of Computer Applications, 2021, 41(12): 3626-3631.
[12]	Chuang GAO, Mian TANG, Liang ZHAO. B-cell epitope prediction model with overlapping subgraph mining based on L-Metric [J]. Journal of Computer Applications, 2021, 41(12): 3702-3706.
[13]	Xujian ZHAO, Chongwei WANG. Storyline extraction method from Weibo news based on graph convolutional network [J]. Journal of Computer Applications, 2021, 41(11): 3139-3144.
[14]	CHEN Jiawei, HAN Fang, WANG Zhijie. Aspect-based sentiment analysis with self-attention gated graph convolutional network [J]. Journal of Computer Applications, 2020, 40(8): 2202-2206.
[15]	HUANG Weijian, LI Danyang, HUANG Yuan. Spatio-temporal hybrid prediction model for air quality [J]. Journal of Computer Applications, 2020, 40(11): 3385-3392.

方法	图像检索文本			文本检索图像			mR
方法	R@1	R@5	R@10	R@1	R@5	R@10	mR
RRF^［22］	56.4	85.3	91.5	43.9	78.1	88.6	73.9
VSE++^［4］	64.6	89.1	95.7	52.0	83.1	92.0	79.4
DPC^［27］	65.6	89.8	95.5	47.1	79.9	90.0	77.9
SCO^［28］	69.9	92.9	97.5	56.7	87.5	94.8	83.2
SCAN^［15］	72.7	94.8	98.2	58.8	88.4	94.8	84.6
MTFN^［29］	74.3	94.9	97.9	60.1	89.1	95.0	85.2
VSRN^［16］	76.2	94.8	98.2	62.8	89.7	95.1	86.1
MMCA^［18］	74.8	95.6	97.7	57.8	88.6	94.9	84.9
CMTFN-SRG	75.6	95.2	98.3	63.0	90.0	95.4	86.2

方法	图像检索文本			文本检索图像			mR
方法	R@1	R@5	R@10	R@1	R@5	R@10	mR
RRF^［22］	56.4	85.3	91.5	43.9	78.1	88.6	73.9
VSE++^［4］	64.6	89.1	95.7	52.0	83.1	92.0	79.4
DPC^［27］	65.6	89.8	95.5	47.1	79.9	90.0	77.9
SCO^［28］	69.9	92.9	97.5	56.7	87.5	94.8	83.2
SCAN^［15］	72.7	94.8	98.2	58.8	88.4	94.8	84.6
MTFN^［29］	74.3	94.9	97.9	60.1	89.1	95.0	85.2
VSRN^［16］	76.2	94.8	98.2	62.8	89.7	95.1	86.1
MMCA^［18］	74.8	95.6	97.7	57.8	88.6	94.9	84.9
CMTFN-SRG	75.6	95.2	98.3	63.0	90.0	95.4	86.2

方法	图像检索文本		文本检索图像
方法	R@1	R@10	R@1	R@10
Avg-pool（基线模型）	64.3	90.5	49.2	83.4
IR	74.0	94.2	61.3	89.8
10TF+IR	74.4	94.6	61.4	90.0
20TF+IR	75.6	95.2	63.0	90.0
30TF+IR	74.2	94.4	61.8	89.9

方法	图像检索文本		文本检索图像
方法	R@1	R@10	R@1	R@10
Avg-pool（基线模型）	64.3	90.5	49.2	83.4
IR	74.0	94.2	61.3	89.8
10TF+IR	74.4	94.6	61.4	90.0
20TF+IR	75.6	95.2	63.0	90.0
30TF+IR	74.2	94.4	61.8	89.9