《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (12): 3776-3783.DOI: 10.11772/j.issn.1001-9081.2023121860
收稿日期:
2024-01-10
修回日期:
2024-04-25
接受日期:
2024-05-07
发布日期:
2024-06-07
出版日期:
2024-12-10
通讯作者:
吴小俊
作者简介:
吴祖成(1998—),男,江苏苏州人,硕士研究生,主要研究方向:跨模态检索、深度学习基金资助:
Zucheng WU, Xiaojun WU(), Tianyang XU
Received:
2024-01-10
Revised:
2024-04-25
Accepted:
2024-05-07
Online:
2024-06-07
Published:
2024-12-10
Contact:
Xiaojun WU
About author:
WU Zucheng, born in 1998, M. S. candidate, His research interests include cross-modal retrieval, deep learning.Supported by:
摘要:
针对跨模态检索任务中关系具有多样性,以及基于外观的传统范式无法准确反映图像中显著物体间的关联,使得它在复杂场景中的应用效果不佳的问题,提出一种基于模态内细粒度特征关系提取的图像-文本检索模型。首先,为了获得更直观的位置信息,将图像划分为网格,并通过物体与网格的位置关系建立位置表征;其次,为了在关系建模阶段保持节点信息的稳定性和独立性,使用一个跨模态信息指导的特征融合模块;最后,提出一种自适应三元组损失用于动态平衡正负样本的训练权重。实验结果表明,所提模型在Flickr30K和MS-COCO 1K数据集上与模型CHAN(Cross-modal Hard Aligning Network)相比,在R@sum指标(前1,5,10个图像检索文本和文本检索图像的召回率之和)上分别提升了1.5%和0.02%,以上结果验证了所提模型在检索的召回率上的有效性。
中图分类号:
吴祖成, 吴小俊, 徐天阳. 基于模态内细粒度特征关系提取的图像文本检索模型[J]. 计算机应用, 2024, 44(12): 3776-3783.
Zucheng WU, Xiaojun WU, Tianyang XU. Image-text retrieval model based on intra-modal fine-grained feature relationship extraction[J]. Journal of Computer Applications, 2024, 44(12): 3776-3783.
数据集 | 模型 | 图像检索文本 | 文本检索图像 | R@sum | ||||
---|---|---|---|---|---|---|---|---|
R@1/% | R@5/% | R@10/% | R@1/% | R@5/% | R@10/% | |||
Flickr30K | DPC[ | 55.6 | 81.9 | 89.5 | 39.1 | 69.2 | 80.9 | 416.2 |
SCO[ | 55.5 | 82.0 | 89.3 | 41.1 | 70.5 | 80.1 | 418.5 | |
SCAN[ | 67.4 | 90.3 | 95.8 | 48.6 | 77.7 | 85.2 | 465.0 | |
CAAN[ | 70.1 | 91.6 | 97.2 | 52.8 | 79.0 | 87.9 | 478.6 | |
IMRAM[ | 74.1 | 93.0 | 96.6 | 53.9 | 79.4 | 87.2 | 484.2 | |
GSMN[ | 76.4 | 94.3 | 97.3 | 57.4 | 82.3 | 89.0 | 496.7 | |
SGRAF[ | 77.8 | 94.1 | 97.4 | 58.5 | 83.0 | 88.8 | 499.6 | |
ADAPT[ | 76.6 | 95.4 | 97.6 | 92.0 | ||||
MV[ | 79.0 | 94.9 | 97.7 | 59.1 | 84.6 | 90.6 | 505.8 | |
CHAN[ | 94.5 | 97.3 | 60.2 | 85.3 | 90.7 | 507.8 | ||
IFRE(Image) | 76.9 | 94.4 | 97.3 | 59.0 | 84.5 | 90.2 | 502.3 | |
IFRE(Caption) | 78.7 | 59.1 | 84.1 | 89.9 | 506.1 | |||
IFRE(Ensemble) | 80.0 | 97.0 | 98.5 | 61.6 | 86.7 | 515.5 | ||
MS-COCO 1K | DPC[ | 65.6 | 89.8 | 95.5 | 47.1 | 79.9 | 90.0 | 467.9 |
SCO[ | 69.9 | 92.9 | 97.5 | 56.7 | 87.5 | 94.8 | 499.3 | |
SCAN[ | 72.7 | 94.8 | 98.4 | 58.8 | 88.4 | 94.8 | 507.9 | |
CAAN[ | 75.5 | 95.4 | 98.5 | 61.3 | 89.7 | 95.2 | 515.6 | |
IMRAM[ | 76.7 | 95.6 | 98.5 | 61.7 | 89.1 | 95.0 | 516.6 | |
GSMN[ | 78.4 | 96.4 | 98.6 | 63.3 | 90.1 | 95.7 | 522.5 | |
SGRAF[ | 96.2 | 98.5 | 63.2 | 524.3 | ||||
ADAPT[ | 76.5 | 95.6 | 98.9 | 62.2 | 90.5 | 96.0 | 519.8 | |
MV[ | 78.7 | 95.7 | 98.7 | 62.7 | 90.4 | 95.7 | 521.9 | |
CHAN[ | 79.7 | 96.7 | 98.7 | 90.4 | 95.8 | |||
IFRE(Image) | 75.3 | 96.0 | 98.5 | 61.7 | 90.1 | 95.7 | 517.3 | |
IFRE(Caption) | 76.8 | 96.2 | 62.1 | 89.6 | 95.4 | 518.9 | ||
IFRE(Ensemble) | 77.9 | 98.9 | 64.4 | 91.0 | 96.3 | 525.1 | ||
MS-COCO 5K | DPC[ | 41.2 | 70.5 | 81.1 | 25.3 | 53.4 | 66.4 | 337.9 |
SCO[ | 42.8 | 72.3 | 83.0 | 33.1 | 62.9 | 75.5 | 369.6 | |
SCAN[ | 50.4 | 82.2 | 90.0 | 38.6 | 69.3 | 80.4 | 410.9 | |
IMRAM[ | 53.7 | 83.2 | 91.0 | 39.7 | 69.1 | 79.8 | 416.5 | |
MV[ | ||||||||
IFRE(Image) | 53.4 | 82.6 | 90.2 | 39.8 | 69.7 | 80.7 | 416.4 | |
IFRE(Caption) | 54.7 | 83.0 | 90.9 | 39.9 | 69.0 | 80.0 | 417.6 | |
IFRE(Ensemble) | 57.1 | 84.5 | 91.8 | 42.1 | 71.2 | 81.7 | 428.4 |
表1 不同模型在3个数据集上的定量结果对比
Tab. 1 Comparison of quantitative results of different models on three datasets
数据集 | 模型 | 图像检索文本 | 文本检索图像 | R@sum | ||||
---|---|---|---|---|---|---|---|---|
R@1/% | R@5/% | R@10/% | R@1/% | R@5/% | R@10/% | |||
Flickr30K | DPC[ | 55.6 | 81.9 | 89.5 | 39.1 | 69.2 | 80.9 | 416.2 |
SCO[ | 55.5 | 82.0 | 89.3 | 41.1 | 70.5 | 80.1 | 418.5 | |
SCAN[ | 67.4 | 90.3 | 95.8 | 48.6 | 77.7 | 85.2 | 465.0 | |
CAAN[ | 70.1 | 91.6 | 97.2 | 52.8 | 79.0 | 87.9 | 478.6 | |
IMRAM[ | 74.1 | 93.0 | 96.6 | 53.9 | 79.4 | 87.2 | 484.2 | |
GSMN[ | 76.4 | 94.3 | 97.3 | 57.4 | 82.3 | 89.0 | 496.7 | |
SGRAF[ | 77.8 | 94.1 | 97.4 | 58.5 | 83.0 | 88.8 | 499.6 | |
ADAPT[ | 76.6 | 95.4 | 97.6 | 92.0 | ||||
MV[ | 79.0 | 94.9 | 97.7 | 59.1 | 84.6 | 90.6 | 505.8 | |
CHAN[ | 94.5 | 97.3 | 60.2 | 85.3 | 90.7 | 507.8 | ||
IFRE(Image) | 76.9 | 94.4 | 97.3 | 59.0 | 84.5 | 90.2 | 502.3 | |
IFRE(Caption) | 78.7 | 59.1 | 84.1 | 89.9 | 506.1 | |||
IFRE(Ensemble) | 80.0 | 97.0 | 98.5 | 61.6 | 86.7 | 515.5 | ||
MS-COCO 1K | DPC[ | 65.6 | 89.8 | 95.5 | 47.1 | 79.9 | 90.0 | 467.9 |
SCO[ | 69.9 | 92.9 | 97.5 | 56.7 | 87.5 | 94.8 | 499.3 | |
SCAN[ | 72.7 | 94.8 | 98.4 | 58.8 | 88.4 | 94.8 | 507.9 | |
CAAN[ | 75.5 | 95.4 | 98.5 | 61.3 | 89.7 | 95.2 | 515.6 | |
IMRAM[ | 76.7 | 95.6 | 98.5 | 61.7 | 89.1 | 95.0 | 516.6 | |
GSMN[ | 78.4 | 96.4 | 98.6 | 63.3 | 90.1 | 95.7 | 522.5 | |
SGRAF[ | 96.2 | 98.5 | 63.2 | 524.3 | ||||
ADAPT[ | 76.5 | 95.6 | 98.9 | 62.2 | 90.5 | 96.0 | 519.8 | |
MV[ | 78.7 | 95.7 | 98.7 | 62.7 | 90.4 | 95.7 | 521.9 | |
CHAN[ | 79.7 | 96.7 | 98.7 | 90.4 | 95.8 | |||
IFRE(Image) | 75.3 | 96.0 | 98.5 | 61.7 | 90.1 | 95.7 | 517.3 | |
IFRE(Caption) | 76.8 | 96.2 | 62.1 | 89.6 | 95.4 | 518.9 | ||
IFRE(Ensemble) | 77.9 | 98.9 | 64.4 | 91.0 | 96.3 | 525.1 | ||
MS-COCO 5K | DPC[ | 41.2 | 70.5 | 81.1 | 25.3 | 53.4 | 66.4 | 337.9 |
SCO[ | 42.8 | 72.3 | 83.0 | 33.1 | 62.9 | 75.5 | 369.6 | |
SCAN[ | 50.4 | 82.2 | 90.0 | 38.6 | 69.3 | 80.4 | 410.9 | |
IMRAM[ | 53.7 | 83.2 | 91.0 | 39.7 | 69.1 | 79.8 | 416.5 | |
MV[ | ||||||||
IFRE(Image) | 53.4 | 82.6 | 90.2 | 39.8 | 69.7 | 80.7 | 416.4 | |
IFRE(Caption) | 54.7 | 83.0 | 90.9 | 39.9 | 69.0 | 80.0 | 417.6 | |
IFRE(Ensemble) | 57.1 | 84.5 | 91.8 | 42.1 | 71.2 | 81.7 | 428.4 |
方法 | 图像检索文本 | 文本检索图像 | R@sum | ||||
---|---|---|---|---|---|---|---|
R@1/% | R@5/% | R@10/% | R@1/% | R@5/% | R@10/% | ||
IFRE(Image) | 76.9 | 94.4 | 97.3 | 59.0 | 84.5 | 90.2 | 502.3 |
w/o IF | 75.4 | 94.0 | 97.2 | 57.7 | 83.7 | 89.5 | 497.6 |
w/o PG | 76.3 | 94.9 | 97.3 | 58.0 | 83.7 | 89.9 | 500.2 |
w/o SG | 76.6 | 94.3 | 97.2 | 58.1 | 84.1 | 89.7 | 500.0 |
w/o AL | 76.5 | 94.9 | 97.4 | 57.1 | 83.5 | 89.7 | 499.0 |
IFRE(Caption) | 78.7 | 95.9 | 98.4 | 59.1 | 84.1 | 89.9 | 506.1 |
w/o IF | 76.2 | 95.8 | 97.9 | 57.7 | 83.2 | 89.8 | 500.6 |
w/o AL | 77.8 | 94.9 | 97.5 | 59.2 | 84.1 | 90.1 | 503.6 |
IFRE(Ensemble) | 80.0 | 97.0 | 98.5 | 61.6 | 86.7 | 91.6 | 515.5 |
w/o Ensemble | 76.9 | 94.6 | 98.0 | 58.5 | 83.6 | 89.8 | 501.4 |
表2 IFRE模型在Flickr30K数据集上的消融实验结果
Tab. 2 Ablation experiment results of IFRE model on Flickr30K dataset
方法 | 图像检索文本 | 文本检索图像 | R@sum | ||||
---|---|---|---|---|---|---|---|
R@1/% | R@5/% | R@10/% | R@1/% | R@5/% | R@10/% | ||
IFRE(Image) | 76.9 | 94.4 | 97.3 | 59.0 | 84.5 | 90.2 | 502.3 |
w/o IF | 75.4 | 94.0 | 97.2 | 57.7 | 83.7 | 89.5 | 497.6 |
w/o PG | 76.3 | 94.9 | 97.3 | 58.0 | 83.7 | 89.9 | 500.2 |
w/o SG | 76.6 | 94.3 | 97.2 | 58.1 | 84.1 | 89.7 | 500.0 |
w/o AL | 76.5 | 94.9 | 97.4 | 57.1 | 83.5 | 89.7 | 499.0 |
IFRE(Caption) | 78.7 | 95.9 | 98.4 | 59.1 | 84.1 | 89.9 | 506.1 |
w/o IF | 76.2 | 95.8 | 97.9 | 57.7 | 83.2 | 89.8 | 500.6 |
w/o AL | 77.8 | 94.9 | 97.5 | 59.2 | 84.1 | 90.1 | 503.6 |
IFRE(Ensemble) | 80.0 | 97.0 | 98.5 | 61.6 | 86.7 | 91.6 | 515.5 |
w/o Ensemble | 76.9 | 94.6 | 98.0 | 58.5 | 83.6 | 89.8 | 501.4 |
1 | HUANG Y, WU Q, SONG C, et al. Learning semantic concepts and order for image and sentence matching [C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 6163-6171. |
2 | GRAVES A. Long short-term memory[M]// Supervised sequence labelling with recurrent neural networks, SCI 385. Berlin: Springer, 2012: 37-45. |
3 | LEE K H, CHEN X, HUA G, et al. Stacked cross attention for image-text matching[C]// Proceedings of the 2018 European Conference on Computer Vision, LNCS 11208. Cham: Springer, 2018: 212-228. |
4 | LIU C, MAO Z, ZHANG T, et al. Graph structured network for image-text matching[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 10918-10927. |
5 | WEI X, ZHANG T, LI Y, et al. Multi-modality cross attention network for image and sentence matching [C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 10938-10947. |
6 | CHEN H, DING G, LIU X, et al. IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval [C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 12652-12660. |
7 | EISENSCHTAT A, WOLF L. Linking image and text with 2-way nets[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 1855-1865. |
8 | GU J, CAI J, JOTY S, et al. Look, imagine and match: improving textual-visual cross-modal retrieval with generative models[C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 7181-7189. |
9 | LIU Y, GUO Y, BAKKER E M, et al. Learning a recurrent residual fusion network for multimodal matching [C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 4127-4136. |
10 | WANG L, LI Y, HUANG J, et al. Learning two-branch neural networks for image-text matching tasks [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(2): 394-407. |
11 | WANG L, LI Y, LAZEBNIK S. Learning deep structure-preserving image-text embeddings [C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 5005-5013. |
12 | SONG Y, SOLEYMANI M. Polysemous visual-semantic embedding for cross-modal retrieval [C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 1979-1988. |
13 | HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778. |
14 | REN S, HE K, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. |
15 | ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering [C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 6077-6086. |
16 | KRISHNA R, ZHU Y, GROTH O, et al. Visual Genome: connecting language and vision using crowdsourced dense image annotations [J]. International Journal of Computer Vision, 2017, 123(1): 32-73. |
17 | PENNINGTON J, SOCHER R, MANNING C D. GloVe: global vectors for word representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2014: 1532-1543. |
18 | FAGHRI F, FLEET D J, KIROS J R, et al. VSE++: improving visual-semantic embeddings with hard negatives [C]// Proceedings of the 2018 British Machine Vision Conference . Durham: BMVA Press, 2018: 1-14. |
19 | KIROS R, SALAKHUTDINOV R, ZEMEL R S. Unifying visual-semantic embeddings with multimodal neural language models [EB/OL]. [2023-12-05].. |
20 | LIU C, MAO Z, LIU A A, et al. Focus your attention: a bidirectional focal attention network for image-text matching [C]// Proceedings of the 27th ACM International Conference on Multimedia. New York: ACM, 2019: 3-11. |
21 | 邓一姣,张凤荔,陈学勤,等. 面向跨模态检索的协同注意力网络模型[J]. 计算机科学, 2020, 47(4): 54-59. |
DENG Y J, ZHANG F L, CHEN X Q, et al. Collaborative attention network modal for cross-modal retrieval [J]. Computer Science, 2020, 47(4): 54-59. | |
22 | DIAO H, ZHANG Y, MA L, et al. Similarity reasoning and filtration for image-text matching[C]// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2021: 1218-1226. |
23 | WEHRMANN J, KOLLING C, BARROS R C. Adaptive cross-modal embeddings for image-text alignment[C]// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2020: 12313-12320. |
24 | ZHENG Z, ZHENG L, GARRETT M, et al. Dual-path convolutional image-text embeddings with instance loss [J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2020, 16(2): No.51. |
25 | ZHANG Q, LEI Z, ZHANG Z, et al. Context-aware attention network for image-text retrieval[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 3533-3542. |
26 | LI Z, GUO C, FENG Z, et al. Multi-view visual semantic embedding[C]// Proceedings of the 31st International Joint Conference on Artificial Intelligence. California: ijcai.org, 2022: 1130-1136. |
27 | PAN Z, WU F, ZHANG B. Fine-grained image-text matching by cross-modal hard aligning network[C]// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 19275-19284. |
[1] | 薛桂香, 王辉, 周卫峰, 刘瑜, 李岩. 基于知识图谱和时空扩散图卷积网络的港口交通流量预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2952-2957. |
[2] | 庞川林, 唐睿, 张睿智, 刘川, 刘佳, 岳士博. D2D通信系统中基于图卷积网络的分布式功率控制算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2855-2862. |
[3] | 刘禹含, 吉根林, 张红苹. 基于骨架图与混合注意力的视频行人异常检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2551-2557. |
[4] | 李欢欢, 黄添强, 丁雪梅, 罗海峰, 黄丽清. 基于多尺度时空图卷积网络的交通出行需求预测[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2065-2072. |
[5] | 吕锡婷, 赵敬华, 荣海迎, 赵嘉乐. 基于Transformer和关系图卷积网络的信息传播预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1760-1766. |
[6] | 黎施彬, 龚俊, 汤圣君. 基于Graph Transformer的半监督异配图表示学习模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1816-1823. |
[7] | 高龙涛, 李娜娜. 基于方面感知注意力增强的方面情感三元组抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1049-1057. |
[8] | 杨先凤, 汤依磊, 李自强. 基于交替注意力机制和图卷积网络的方面级情感分析模型[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1058-1064. |
[9] | 王楷天, 叶青, 程春雷. 基于异构图表示的中医电子病历分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 411-417. |
[10] | 梁睿衍, 杨慧. 基于RPEpose和XJ-GCN的轻量级跌倒检测算法框架[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3639-3646. |
[11] | 胡新荣, 陈静雪, 黄子键, 王帮超, 姚迅, 刘军平, 朱强, 杨捷. 基于图卷积网络的掩码数据增强[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3335-3344. |
[12] | 王利琴, 张特, 许智宏, 董永峰, 杨国伟. 融合实体语义及结构信息的知识图谱推理[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3371-3378. |
[13] | 项能强, 朱小飞, 高肇泽. 原型感知双通道图卷积神经网络的信息传播预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3260-3266. |
[14] | 李言博, 何庆, 陆顺意. 融合语义和句法信息的方面情感三元组抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3275-3280. |
[15] | 纪婉婷, 鲁闻一, 马宇航, 丁琳琳, 宋宝燕, 张浩林. 基于关系增强图卷积网络的机器阅读理解式事件检测[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3288-3293. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||