基于语义关系图的跨模态张量融合网络的图像文本检索

doi:10.11772/j.issn.1001-9081.2021091622

《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (10): 3018-3024.DOI: 10.11772/j.issn.1001-9081.2021091622

所属专题：人工智能

基于语义关系图的跨模态张量融合网络的图像文本检索

刘长红¹, 曾胜¹, 张斌¹, 陈勇²

^1.江西师范大学计算机信息工程学院，南昌 330022
^2.南昌工程学院工商管理学院，南昌 330029

收稿日期:2021-09-14 修回日期:2021-12-20 接受日期:2021-12-30 发布日期:2022-10-14 出版日期:2022-10-10
通讯作者: 刘长红
作者简介:第一联系人：刘长红（1977—），女，江西南丰人，副教授，博士，CCF会员，主要研究方向：计算机视觉、跨模态信息检索、高光谱图像处理; liuch@jxnu.edu.cn
曾胜（1996—），男，江西九江人，硕士研究生，主要研究方向：跨模态信息检索、计算机视觉
张斌（1997—），男，江西南昌人，硕士研究生，主要研究方向：跨模态生成、计算机视觉
陈勇（1973—），男，江西南昌人，讲师，硕士，主要研究方向：电子商务、图像处理。
基金资助:
国家自然科学基金资助项目(62067004)

Cross-modal tensor fusion network based on semantic relation graph for image-text retrieval

Changhong LIU¹, Sheng ZENG¹, Bin ZHANG¹, Yong CHEN²

^1.School of Computer and Information Engineering，Jiangxi Normal University，Nanchang Jiangxi 330022，China
^2.School of Business Administration，Nanchang Institute of Technology，Nanchang Jiangxi 330029，China

Received:2021-09-14 Revised:2021-12-20 Accepted:2021-12-30 Online:2022-10-14 Published:2022-10-10
Contact: Changhong LIU
About author:LIU Changhong， born in 1977， Ph. D. ， associate professor. Her research interests include computer vision， cross-modal information retrieval， hyper-spectral image processing.
ZENG Sheng， born in 1996， M. S. candidate. His research interests include cross-modal information retrieval， computer vision.
ZHANG Bin， born in 1997， M. S. candidate. His research interests include cross-modal generation， computer vision.
CHEN Yong，born in 1973， M. S. ， lecturer. His research interests include e-commerce， image processing.
Supported by:
National Natural Science Foundation of China(62067004)

摘要/Abstract

摘要：

跨模态图像文本检索的难点是如何有效地学习图像和文本间的语义相关性。现有的大多数方法都是学习图像区域特征和文本特征的全局语义相关性或模态间对象间的局部语义相关性，而忽略了模态内对象之间的关系和模态间对象关系的关联。针对上述问题，提出了一种基于语义关系图的跨模态张量融合网络（CMTFN-SRG）的图像文本检索方法。首先，采用图卷积网络（GCN）学习图像区域间的关系并使用双向门控循环单元（Bi-GRU）构建文本单词间的关系；然后，将所学习到的图像区域和文本单词间的语义关系图通过张量融合网络进行匹配以学习两种不同模态数据间的细粒度语义关联；同时，采用门控循环单元（GRU）学习图像的全局特征，并将图像和文本的全局特征进行匹配以捕获模态间的全局语义相关性。将所提方法在Flickr30K和MS-COCO两个基准数据集上与多模态交叉注意力（MMCA）方法进行了对比分析。实验结果表明，所提方法在Flickr30K测试集、MS-COCO1K测试集以及MS-COCO5K测试集上文本检索图像任务的Recall@1分别提升了2.6%、9.0%和4.1%，召回率均值（mR）分别提升了0.4、1.3和0.1个百分点，可见该方法能有效提升图像文本检索的精度。

关键词: 跨模态检索, 张量融合网络, 图卷积网络, 语义相关性, 语义关系图

Abstract:

The key of cross-modal image-text retrieval is how to capture the semantic correlation between images and text effectively. Most of the existing methods learn the global semantic correlation between image region features and text features or local semantic correlation between inter-modality objects， and ignore the correlation between the intra-modality object relationships and inter-modality object relationships. To solve this problem， a method of Cross-Modal Tensor Fusion Network based on Semantic Relation Graph （CMTFN-SRG） for image-text retrieval was proposed. Firstly， the relationships of image regions and text words were generated by Graph Convolutional Network （GCN） and Bidirectional Gated Recurrent Unit （Bi-GRU） respectively. Then， the fine-grained semantic correlation between the data of two modals was learned by using the tensor fusion network to match the learned semantic relation graph of image regions and the graph of text words. At the same time， Gated Recurrent Unit （GRU） was used to learn global features of the image， and the global features of the image and the text were matched to capture the inter-modality global semantic correlation. The proposed method was compared with the Multi-Modality Cross Attention （MMCA） method on the benchmark datasets Flickr30K and MS-COCO. Experimental results show that the proposed method improves the Recall@1 of text-to-image retrieval task by 2.6%， 9.0% and 4.1% respectively on the test datasets Flickr30K， MS-COCO1K and MS-COCO5K.And mean Recall （mR） improves by 0.4， 1.3 and 0.1 percentage points respectively. It can be seen that the proposed method can effectively improve the precision of image-text retrieval.

Key words: cross-modal retrieval, tensor fusion network, Graph Convolutional Network (GCN), semantic correlation, semantic relation graph

中图分类号:

TP391.3

刘长红, 曾胜, 张斌, 陈勇. 基于语义关系图的跨模态张量融合网络的图像文本检索[J]. 计算机应用, 2022, 42(10): 3018-3024.

Changhong LIU, Sheng ZENG, Bin ZHANG, Yong CHEN. Cross-modal tensor fusion network based on semantic relation graph for image-text retrieval[J]. Journal of Computer Applications, 2022, 42(10): 3018-3024.

图/表 11

参考文献 30

1	HUANG Y， WANG W， WANG L. Instance-aware image and sentence matching with selective multimodal LSTM［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 7254-7262. 10.1109/cvpr.2017.767
2	KIROS R， SALAKHUTDINOV R， ZEMEL R S. Unifying visual-semantic embeddings with multimodal neural language models［EB/OL］. （2014-11-10）［2021-08-03］..
3	邓一姣，张凤荔，陈学勤，等. 面向跨模态检索的协同注意力网络模型［J］. 计算机科学， 2020， 47（4）： 54-59. 10.11896/jsjkx.190600181
	DENG Y J， ZHANG F L， CHEN X Q， et al. Collaborative attention network modal for cross-modal retrieval［J］. Computer Science， 2020， 47（4）： 54-59. 10.11896/jsjkx.190600181
4	FAGHRI F， FLEET D J， KIROS J R， et al. VSE++： improving visual-semantic embeddings with hard negatives［C］// Proceedings of the 2018 British Machine Vision Conference. Durham： BMVA Press， 2018： No.344.
5	FROME A， CORRADO G S， SHLENS J， et al. DeViSE： a deep visual-semantic embedding model［C］// Proceedings of the 26th International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2013： 2121-2129.
6	GU J X， CAI J F， JOTY S， et al. Look， imagine and match： improving textual-visual cross-modal retrieval with generative models［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 7181-7189. 10.1109/cvpr.2018.00750
7	KLEIN B， LEV G， SADEH G， et al. Associating neural word embeddings with deep image representations using Fisher vectors［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2015： 4437-4446. 10.1109/cvpr.2015.7299073
8	PENG Y X， QI J E. CM-GANs： cross-modal generative adversarial networks for common representation learning［J］. ACM Transactions on Multimedia Computing， Communications， and Applications， 2019， 15（1）： No.22. 10.1145/3284750
9	YAN F， MIKOLAJCZYK K. Deep correlation for matching images and text［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2015： 3441-3450. 10.1109/cvpr.2015.7298966
10	黄育，张鸿. 基于潜语义主题加强的跨媒体检索算法［J］. 计算机应用， 2017， 37（4）： 1061-1064， 1110. 10.11772/j.issn.1001-9081.2017.04.1061
	HUANG Y， ZHANG H. Cross-media retrieval based on latent semantic topic reinforce［J］. Journal of Computer Applications， 2017， 37（4）： 1061-1064， 1110. 10.11772/j.issn.1001-9081.2017.04.1061
11	严双咏，刘长红，江爱文，等. 语义耦合相关的判别式跨模态哈希学习算法［J］. 计算机学报， 2019， 42（1）： 164-175. 10.11897/SP.J.1016.2019.00164
	YAN S Y， LIU C H， JIANG A W， et al. Discriminative cross-modal hashing with coupled semantic correlation［J］. Chinese Journal of Computers， 2019， 42（1）： 164-175. 10.11897/SP.J.1016.2019.00164
12	KARPATHY A， LI F F. Deep visual-semantic alignments for generating image descriptions［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2015： 3128-3137. 10.1109/cvpr.2015.7298932
13	NIU Z X， ZHOU M， WANG L， et al. Hierarchical multimodal LSTM for dense visual-semantic embedding［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 1899-1907. 10.1109/iccv.2017.208
14	NAM H， HA J W， KIM J. Dual attention networks for multimodal reasoning and matching［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 2156-2164. 10.1109/cvpr.2017.232
15	LEE K H， CHEN X， HUA G， et al. Stacked cross attention for image-text matching［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11208. Cham： Springer， 2018： 212-228.
16	LI K P， ZHANG Y L， LI K， et al. Visual semantic reasoning for image-text matching［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 4653-4661. 10.1109/iccv.2019.00475
17	KIPF T N， WELLING M. Semi-supervised classification with graph convolutional networks［EB/OL］. （2017-02-22）［2021-06-20］.. 10.48550/arXiv.1609.02907
18	WEI X， ZHANG T Z， LI Y， et al. Multi-modality cross attention network for image and sentence matching［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 10938-10947. 10.1109/cvpr42600.2020.01095
19	CHUNG J， GULCEHRE C， CHO K， et al. Empirical evaluation of gated recurrent neural networks on sequence modeling［EB/OL］. （2014-12-11）［2021-06-20］.. 10.1007/978-3-030-89929-5_3
20	CHEN H， DING G G， LIU X D， et al. IMRAM： iterative matching with recurrent attention memory for cross-modal image-text retrieval［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 12652-12660. 10.1109/cvpr42600.2020.01267
21	WANG L W， LI Y， LAZEBNIK S. Learning deep structure-preserving image-text embeddings［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 5005-5013. 10.1109/cvpr.2016.541
22	LIU Y， GUO Y M， BAKKER E M， et al. Learning a recurrent residual fusion network for multimodal matching［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 4127-4136. 10.1109/iccv.2017.442
23	REN S Q， HE K M， GIRSHICK R. Faster R-CNN： towards real-time object detection with region proposal networks［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2017， 39（6）： 1137-1149. 10.1109/tpami.2016.2577031
24	ANDERSON P， HE X D， BUEHLER C， et al. Bottom-up and top-down attention for image captioning and visual question answering［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6077-6086. 10.1109/cvpr.2018.00636
25	KRISHNA R， ZHU Y， GROTH O， et al. Visual Genome： connecting language and vision using crowdsourced dense image annotations［J］. International Journal of Computer Vision， 2017， 123（1）： 32-73. 10.1007/s11263-016-0981-7
26	VENUGOPALAN S， ROHRBACH M， DONAHUE J， et al. Sequence to sequence - video to text［C］// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2015： 4534-4542. 10.1109/iccv.2015.515
27	ZHENG Z D， ZHENG L， GARRETT M， et al. Dual-path convolutional image-text embedding with instance loss［J］. ACM Transactions on Multimedia Computing， Communications and Applications， 2020， 16（2）： No.51. 10.1145/3383184
28	HUANG Y， WU Q， SONG C F， et al. Learning semantic concepts and order for image and sentence matching［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6163-6171. 10.1109/cvpr.2018.00645
29	WANG T， XU X， YANG Y， et al. Matching images and text with multi-modal tensor fusion and re-ranking［C］// Proceedings of the 27th ACM International Conference on Multimedia. New York： ACM， 2019： 12-20. 10.1145/3343031.3350875
30	KINGMA D P， BA J L. Adam： a method for stochastic optimization［EB/OL］. （2017-01-30）［2021-08-03］..

数据集	样本数			总样本数
数据集	训练集数	验证集数	测试集数	总样本数
MS-COCO	113 287	5 000	5 000	123 287
Flickr30K	28 000	1 000	1 000	31 783

数据集	样本数			总样本数
数据集	训练集数	验证集数	测试集数	总样本数
MS-COCO	113 287	5 000	5 000	123 287
Flickr30K	28 000	1 000	1 000	31 783

方法	图像检索文本			文本检索图像			mR
方法	R@1	R@5	R@10	R@1	R@5	R@10	mR
RRF^［22］	47.6	77.4	87.1	35.4	68.3	79.9	65.9
VSE++^［4］	52.9	79.1	87.2	39.6	69.6	79.5	67.9
DPC^［27］	55.6	81.9	89.5	39.1	69.2	80.9	69.3
SCO^［28］	55.5	82.0	89.3	41.1	70.5	88.1	71.0
SCAN^［15］	67.4	90.3	95.8	48.6	77.7	85.2	77.5
MTFN^［29］	65.3	88.3	93.3	52.0	80.1	86.1	77.5
VSRN^［16］	71.3	90.6	96.0	54.7	81.8	88.2	80.4
MMCA^［18］	74.2	92.8	96.4	54.8	81.4	87.8	81.2
CMTFN-SRG	73.6	91.7	96.3	56.2	82.5	89.4	81.6

方法	图像检索文本			文本检索图像			mR
方法	R@1	R@5	R@10	R@1	R@5	R@10	mR
RRF^［22］	47.6	77.4	87.1	35.4	68.3	79.9	65.9
VSE++^［4］	52.9	79.1	87.2	39.6	69.6	79.5	67.9
DPC^［27］	55.6	81.9	89.5	39.1	69.2	80.9	69.3
SCO^［28］	55.5	82.0	89.3	41.1	70.5	88.1	71.0
SCAN^［15］	67.4	90.3	95.8	48.6	77.7	85.2	77.5
MTFN^［29］	65.3	88.3	93.3	52.0	80.1	86.1	77.5
VSRN^［16］	71.3	90.6	96.0	54.7	81.8	88.2	80.4
MMCA^［18］	74.2	92.8	96.4	54.8	81.4	87.8	81.2
CMTFN-SRG	73.6	91.7	96.3	56.2	82.5	89.4	81.6

方法	图像检索文本			文本检索图像			mR
方法	R @1	R @5	R@10	R @1	R @5	R @10	mR
VSE++^［4］	41.3	69.2	81.2	30.3	59.1	72.4	58.9
DPC^［27］	41.2	70.5	81.1	25.3	53.4	66.4	56.3
SCO^［28］	42.8	72.3	83.0	33.1	62.9	75.5	61.6
SCAN^［15］	50.4	82.2	90.0	38.6	69.3	80.4	68.4
MTFN^［29］	48.3	77.6	87.3	35.9	66.1	76.1	65.2
VSRN^［16］	53.0	81.1	89.4	40.5	70.6	81.1	69.2
MMCA^［18］	54.0	82.5	90.7	38.7	69.7	80.8	69.4
CMTFN-SRG	53.0	81.5	89.7	40.3	71.0	81.7	69.5

基于语义关系图的跨模态张量融合网络的图像文本检索

Cross-modal tensor fusion network based on semantic relation graph for image-text retrieval

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 30

相关文章 15

编辑推荐

Metrics

[1]	庞川林, 唐睿, 张睿智, 刘川, 刘佳, 岳士博. D2D通信系统中基于图卷积网络的分布式功率控制算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2855-2862.
[2]	薛桂香, 王辉, 周卫峰, 刘瑜, 李岩. 基于知识图谱和时空扩散图卷积网络的港口交通流量预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2952-2957.
[3]	刘禹含, 吉根林, 张红苹. 基于骨架图与混合注意力的视频行人异常检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2551-2557.
[4]	李欢欢, 黄添强, 丁雪梅, 罗海峰, 黄丽清. 基于多尺度时空图卷积网络的交通出行需求预测[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2065-2072.
[5]	吕锡婷, 赵敬华, 荣海迎, 赵嘉乐. 基于Transformer和关系图卷积网络的信息传播预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1760-1766.
[6]	黎施彬, 龚俊, 汤圣君. 基于Graph Transformer的半监督异配图表示学习模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1816-1823.
[7]	高龙涛, 李娜娜. 基于方面感知注意力增强的方面情感三元组抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1049-1057.
[8]	杨先凤, 汤依磊, 李自强. 基于交替注意力机制和图卷积网络的方面级情感分析模型[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1058-1064.
[9]	王楷天, 叶青, 程春雷. 基于异构图表示的中医电子病历分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 411-417.
[10]	吴祖成, 吴小俊, 徐天阳. 基于模态内细粒度特征关系提取的图像文本检索模型[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3776-3783.
[11]	梁睿衍, 杨慧. 基于RPEpose和XJ-GCN的轻量级跌倒检测算法框架[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3639-3646.
[12]	王利琴, 张特, 许智宏, 董永峰, 杨国伟. 融合实体语义及结构信息的知识图谱推理[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3371-3378.
[13]	胡新荣, 陈静雪, 黄子键, 王帮超, 姚迅, 刘军平, 朱强, 杨捷. 基于图卷积网络的掩码数据增强[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3335-3344.
[14]	项能强, 朱小飞, 高肇泽. 原型感知双通道图卷积神经网络的信息传播预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3260-3266.
[15]	李言博, 何庆, 陆顺意. 融合语义和句法信息的方面情感三元组抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3275-3280.

方法	图像检索文本			文本检索图像			mR
方法	R@1	R@5	R@10	R@1	R@5	R@10	mR
RRF^［22］	56.4	85.3	91.5	43.9	78.1	88.6	73.9
VSE++^［4］	64.6	89.1	95.7	52.0	83.1	92.0	79.4
DPC^［27］	65.6	89.8	95.5	47.1	79.9	90.0	77.9
SCO^［28］	69.9	92.9	97.5	56.7	87.5	94.8	83.2
SCAN^［15］	72.7	94.8	98.2	58.8	88.4	94.8	84.6
MTFN^［29］	74.3	94.9	97.9	60.1	89.1	95.0	85.2
VSRN^［16］	76.2	94.8	98.2	62.8	89.7	95.1	86.1
MMCA^［18］	74.8	95.6	97.7	57.8	88.6	94.9	84.9
CMTFN-SRG	75.6	95.2	98.3	63.0	90.0	95.4	86.2

方法	图像检索文本			文本检索图像			mR
方法	R@1	R@5	R@10	R@1	R@5	R@10	mR
RRF^［22］	56.4	85.3	91.5	43.9	78.1	88.6	73.9
VSE++^［4］	64.6	89.1	95.7	52.0	83.1	92.0	79.4
DPC^［27］	65.6	89.8	95.5	47.1	79.9	90.0	77.9
SCO^［28］	69.9	92.9	97.5	56.7	87.5	94.8	83.2
SCAN^［15］	72.7	94.8	98.2	58.8	88.4	94.8	84.6
MTFN^［29］	74.3	94.9	97.9	60.1	89.1	95.0	85.2
VSRN^［16］	76.2	94.8	98.2	62.8	89.7	95.1	86.1
MMCA^［18］	74.8	95.6	97.7	57.8	88.6	94.9	84.9
CMTFN-SRG	75.6	95.2	98.3	63.0	90.0	95.4	86.2

方法	图像检索文本		文本检索图像
方法	R@1	R@10	R@1	R@10
Avg-pool（基线模型）	64.3	90.5	49.2	83.4
IR	74.0	94.2	61.3	89.8
10TF+IR	74.4	94.6	61.4	90.0
20TF+IR	75.6	95.2	63.0	90.0
30TF+IR	74.2	94.4	61.8	89.9

方法	图像检索文本		文本检索图像
方法	R@1	R@10	R@1	R@10
Avg-pool（基线模型）	64.3	90.5	49.2	83.4
IR	74.0	94.2	61.3	89.8
10TF+IR	74.4	94.6	61.4	90.0
20TF+IR	75.6	95.2	63.0	90.0
30TF+IR	74.2	94.4	61.8	89.9