Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (12): 3776-3783.DOI: 10.11772/j.issn.1001-9081.2023121860
• Artificial intelligence • Previous Articles Next Articles
Zucheng WU, Xiaojun WU(), Tianyang XU
Received:
2024-01-10
Revised:
2024-04-25
Accepted:
2024-05-07
Online:
2024-06-07
Published:
2024-12-10
Contact:
Xiaojun WU
About author:
WU Zucheng, born in 1998, M. S. candidate, His research interests include cross-modal retrieval, deep learning.Supported by:
通讯作者:
吴小俊
作者简介:
吴祖成(1998—),男,江苏苏州人,硕士研究生,主要研究方向:跨模态检索、深度学习基金资助:
CLC Number:
Zucheng WU, Xiaojun WU, Tianyang XU. Image-text retrieval model based on intra-modal fine-grained feature relationship extraction[J]. Journal of Computer Applications, 2024, 44(12): 3776-3783.
吴祖成, 吴小俊, 徐天阳. 基于模态内细粒度特征关系提取的图像文本检索模型[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3776-3783.
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2023121860
数据集 | 模型 | 图像检索文本 | 文本检索图像 | R@sum | ||||
---|---|---|---|---|---|---|---|---|
R@1/% | R@5/% | R@10/% | R@1/% | R@5/% | R@10/% | |||
Flickr30K | DPC[ | 55.6 | 81.9 | 89.5 | 39.1 | 69.2 | 80.9 | 416.2 |
SCO[ | 55.5 | 82.0 | 89.3 | 41.1 | 70.5 | 80.1 | 418.5 | |
SCAN[ | 67.4 | 90.3 | 95.8 | 48.6 | 77.7 | 85.2 | 465.0 | |
CAAN[ | 70.1 | 91.6 | 97.2 | 52.8 | 79.0 | 87.9 | 478.6 | |
IMRAM[ | 74.1 | 93.0 | 96.6 | 53.9 | 79.4 | 87.2 | 484.2 | |
GSMN[ | 76.4 | 94.3 | 97.3 | 57.4 | 82.3 | 89.0 | 496.7 | |
SGRAF[ | 77.8 | 94.1 | 97.4 | 58.5 | 83.0 | 88.8 | 499.6 | |
ADAPT[ | 76.6 | 95.4 | 97.6 | 92.0 | ||||
MV[ | 79.0 | 94.9 | 97.7 | 59.1 | 84.6 | 90.6 | 505.8 | |
CHAN[ | 94.5 | 97.3 | 60.2 | 85.3 | 90.7 | 507.8 | ||
IFRE(Image) | 76.9 | 94.4 | 97.3 | 59.0 | 84.5 | 90.2 | 502.3 | |
IFRE(Caption) | 78.7 | 59.1 | 84.1 | 89.9 | 506.1 | |||
IFRE(Ensemble) | 80.0 | 97.0 | 98.5 | 61.6 | 86.7 | 515.5 | ||
MS-COCO 1K | DPC[ | 65.6 | 89.8 | 95.5 | 47.1 | 79.9 | 90.0 | 467.9 |
SCO[ | 69.9 | 92.9 | 97.5 | 56.7 | 87.5 | 94.8 | 499.3 | |
SCAN[ | 72.7 | 94.8 | 98.4 | 58.8 | 88.4 | 94.8 | 507.9 | |
CAAN[ | 75.5 | 95.4 | 98.5 | 61.3 | 89.7 | 95.2 | 515.6 | |
IMRAM[ | 76.7 | 95.6 | 98.5 | 61.7 | 89.1 | 95.0 | 516.6 | |
GSMN[ | 78.4 | 96.4 | 98.6 | 63.3 | 90.1 | 95.7 | 522.5 | |
SGRAF[ | 96.2 | 98.5 | 63.2 | 524.3 | ||||
ADAPT[ | 76.5 | 95.6 | 98.9 | 62.2 | 90.5 | 96.0 | 519.8 | |
MV[ | 78.7 | 95.7 | 98.7 | 62.7 | 90.4 | 95.7 | 521.9 | |
CHAN[ | 79.7 | 96.7 | 98.7 | 90.4 | 95.8 | |||
IFRE(Image) | 75.3 | 96.0 | 98.5 | 61.7 | 90.1 | 95.7 | 517.3 | |
IFRE(Caption) | 76.8 | 96.2 | 62.1 | 89.6 | 95.4 | 518.9 | ||
IFRE(Ensemble) | 77.9 | 98.9 | 64.4 | 91.0 | 96.3 | 525.1 | ||
MS-COCO 5K | DPC[ | 41.2 | 70.5 | 81.1 | 25.3 | 53.4 | 66.4 | 337.9 |
SCO[ | 42.8 | 72.3 | 83.0 | 33.1 | 62.9 | 75.5 | 369.6 | |
SCAN[ | 50.4 | 82.2 | 90.0 | 38.6 | 69.3 | 80.4 | 410.9 | |
IMRAM[ | 53.7 | 83.2 | 91.0 | 39.7 | 69.1 | 79.8 | 416.5 | |
MV[ | ||||||||
IFRE(Image) | 53.4 | 82.6 | 90.2 | 39.8 | 69.7 | 80.7 | 416.4 | |
IFRE(Caption) | 54.7 | 83.0 | 90.9 | 39.9 | 69.0 | 80.0 | 417.6 | |
IFRE(Ensemble) | 57.1 | 84.5 | 91.8 | 42.1 | 71.2 | 81.7 | 428.4 |
Tab. 1 Comparison of quantitative results of different models on three datasets
数据集 | 模型 | 图像检索文本 | 文本检索图像 | R@sum | ||||
---|---|---|---|---|---|---|---|---|
R@1/% | R@5/% | R@10/% | R@1/% | R@5/% | R@10/% | |||
Flickr30K | DPC[ | 55.6 | 81.9 | 89.5 | 39.1 | 69.2 | 80.9 | 416.2 |
SCO[ | 55.5 | 82.0 | 89.3 | 41.1 | 70.5 | 80.1 | 418.5 | |
SCAN[ | 67.4 | 90.3 | 95.8 | 48.6 | 77.7 | 85.2 | 465.0 | |
CAAN[ | 70.1 | 91.6 | 97.2 | 52.8 | 79.0 | 87.9 | 478.6 | |
IMRAM[ | 74.1 | 93.0 | 96.6 | 53.9 | 79.4 | 87.2 | 484.2 | |
GSMN[ | 76.4 | 94.3 | 97.3 | 57.4 | 82.3 | 89.0 | 496.7 | |
SGRAF[ | 77.8 | 94.1 | 97.4 | 58.5 | 83.0 | 88.8 | 499.6 | |
ADAPT[ | 76.6 | 95.4 | 97.6 | 92.0 | ||||
MV[ | 79.0 | 94.9 | 97.7 | 59.1 | 84.6 | 90.6 | 505.8 | |
CHAN[ | 94.5 | 97.3 | 60.2 | 85.3 | 90.7 | 507.8 | ||
IFRE(Image) | 76.9 | 94.4 | 97.3 | 59.0 | 84.5 | 90.2 | 502.3 | |
IFRE(Caption) | 78.7 | 59.1 | 84.1 | 89.9 | 506.1 | |||
IFRE(Ensemble) | 80.0 | 97.0 | 98.5 | 61.6 | 86.7 | 515.5 | ||
MS-COCO 1K | DPC[ | 65.6 | 89.8 | 95.5 | 47.1 | 79.9 | 90.0 | 467.9 |
SCO[ | 69.9 | 92.9 | 97.5 | 56.7 | 87.5 | 94.8 | 499.3 | |
SCAN[ | 72.7 | 94.8 | 98.4 | 58.8 | 88.4 | 94.8 | 507.9 | |
CAAN[ | 75.5 | 95.4 | 98.5 | 61.3 | 89.7 | 95.2 | 515.6 | |
IMRAM[ | 76.7 | 95.6 | 98.5 | 61.7 | 89.1 | 95.0 | 516.6 | |
GSMN[ | 78.4 | 96.4 | 98.6 | 63.3 | 90.1 | 95.7 | 522.5 | |
SGRAF[ | 96.2 | 98.5 | 63.2 | 524.3 | ||||
ADAPT[ | 76.5 | 95.6 | 98.9 | 62.2 | 90.5 | 96.0 | 519.8 | |
MV[ | 78.7 | 95.7 | 98.7 | 62.7 | 90.4 | 95.7 | 521.9 | |
CHAN[ | 79.7 | 96.7 | 98.7 | 90.4 | 95.8 | |||
IFRE(Image) | 75.3 | 96.0 | 98.5 | 61.7 | 90.1 | 95.7 | 517.3 | |
IFRE(Caption) | 76.8 | 96.2 | 62.1 | 89.6 | 95.4 | 518.9 | ||
IFRE(Ensemble) | 77.9 | 98.9 | 64.4 | 91.0 | 96.3 | 525.1 | ||
MS-COCO 5K | DPC[ | 41.2 | 70.5 | 81.1 | 25.3 | 53.4 | 66.4 | 337.9 |
SCO[ | 42.8 | 72.3 | 83.0 | 33.1 | 62.9 | 75.5 | 369.6 | |
SCAN[ | 50.4 | 82.2 | 90.0 | 38.6 | 69.3 | 80.4 | 410.9 | |
IMRAM[ | 53.7 | 83.2 | 91.0 | 39.7 | 69.1 | 79.8 | 416.5 | |
MV[ | ||||||||
IFRE(Image) | 53.4 | 82.6 | 90.2 | 39.8 | 69.7 | 80.7 | 416.4 | |
IFRE(Caption) | 54.7 | 83.0 | 90.9 | 39.9 | 69.0 | 80.0 | 417.6 | |
IFRE(Ensemble) | 57.1 | 84.5 | 91.8 | 42.1 | 71.2 | 81.7 | 428.4 |
方法 | 图像检索文本 | 文本检索图像 | R@sum | ||||
---|---|---|---|---|---|---|---|
R@1/% | R@5/% | R@10/% | R@1/% | R@5/% | R@10/% | ||
IFRE(Image) | 76.9 | 94.4 | 97.3 | 59.0 | 84.5 | 90.2 | 502.3 |
w/o IF | 75.4 | 94.0 | 97.2 | 57.7 | 83.7 | 89.5 | 497.6 |
w/o PG | 76.3 | 94.9 | 97.3 | 58.0 | 83.7 | 89.9 | 500.2 |
w/o SG | 76.6 | 94.3 | 97.2 | 58.1 | 84.1 | 89.7 | 500.0 |
w/o AL | 76.5 | 94.9 | 97.4 | 57.1 | 83.5 | 89.7 | 499.0 |
IFRE(Caption) | 78.7 | 95.9 | 98.4 | 59.1 | 84.1 | 89.9 | 506.1 |
w/o IF | 76.2 | 95.8 | 97.9 | 57.7 | 83.2 | 89.8 | 500.6 |
w/o AL | 77.8 | 94.9 | 97.5 | 59.2 | 84.1 | 90.1 | 503.6 |
IFRE(Ensemble) | 80.0 | 97.0 | 98.5 | 61.6 | 86.7 | 91.6 | 515.5 |
w/o Ensemble | 76.9 | 94.6 | 98.0 | 58.5 | 83.6 | 89.8 | 501.4 |
Tab. 2 Ablation experiment results of IFRE model on Flickr30K dataset
方法 | 图像检索文本 | 文本检索图像 | R@sum | ||||
---|---|---|---|---|---|---|---|
R@1/% | R@5/% | R@10/% | R@1/% | R@5/% | R@10/% | ||
IFRE(Image) | 76.9 | 94.4 | 97.3 | 59.0 | 84.5 | 90.2 | 502.3 |
w/o IF | 75.4 | 94.0 | 97.2 | 57.7 | 83.7 | 89.5 | 497.6 |
w/o PG | 76.3 | 94.9 | 97.3 | 58.0 | 83.7 | 89.9 | 500.2 |
w/o SG | 76.6 | 94.3 | 97.2 | 58.1 | 84.1 | 89.7 | 500.0 |
w/o AL | 76.5 | 94.9 | 97.4 | 57.1 | 83.5 | 89.7 | 499.0 |
IFRE(Caption) | 78.7 | 95.9 | 98.4 | 59.1 | 84.1 | 89.9 | 506.1 |
w/o IF | 76.2 | 95.8 | 97.9 | 57.7 | 83.2 | 89.8 | 500.6 |
w/o AL | 77.8 | 94.9 | 97.5 | 59.2 | 84.1 | 90.1 | 503.6 |
IFRE(Ensemble) | 80.0 | 97.0 | 98.5 | 61.6 | 86.7 | 91.6 | 515.5 |
w/o Ensemble | 76.9 | 94.6 | 98.0 | 58.5 | 83.6 | 89.8 | 501.4 |
1 | HUANG Y, WU Q, SONG C, et al. Learning semantic concepts and order for image and sentence matching [C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 6163-6171. |
2 | GRAVES A. Long short-term memory[M]// Supervised sequence labelling with recurrent neural networks, SCI 385. Berlin: Springer, 2012: 37-45. |
3 | LEE K H, CHEN X, HUA G, et al. Stacked cross attention for image-text matching[C]// Proceedings of the 2018 European Conference on Computer Vision, LNCS 11208. Cham: Springer, 2018: 212-228. |
4 | LIU C, MAO Z, ZHANG T, et al. Graph structured network for image-text matching[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 10918-10927. |
5 | WEI X, ZHANG T, LI Y, et al. Multi-modality cross attention network for image and sentence matching [C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 10938-10947. |
6 | CHEN H, DING G, LIU X, et al. IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval [C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 12652-12660. |
7 | EISENSCHTAT A, WOLF L. Linking image and text with 2-way nets[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 1855-1865. |
8 | GU J, CAI J, JOTY S, et al. Look, imagine and match: improving textual-visual cross-modal retrieval with generative models[C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 7181-7189. |
9 | LIU Y, GUO Y, BAKKER E M, et al. Learning a recurrent residual fusion network for multimodal matching [C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 4127-4136. |
10 | WANG L, LI Y, HUANG J, et al. Learning two-branch neural networks for image-text matching tasks [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(2): 394-407. |
11 | WANG L, LI Y, LAZEBNIK S. Learning deep structure-preserving image-text embeddings [C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 5005-5013. |
12 | SONG Y, SOLEYMANI M. Polysemous visual-semantic embedding for cross-modal retrieval [C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 1979-1988. |
13 | HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778. |
14 | REN S, HE K, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. |
15 | ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering [C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 6077-6086. |
16 | KRISHNA R, ZHU Y, GROTH O, et al. Visual Genome: connecting language and vision using crowdsourced dense image annotations [J]. International Journal of Computer Vision, 2017, 123(1): 32-73. |
17 | PENNINGTON J, SOCHER R, MANNING C D. GloVe: global vectors for word representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2014: 1532-1543. |
18 | FAGHRI F, FLEET D J, KIROS J R, et al. VSE++: improving visual-semantic embeddings with hard negatives [C]// Proceedings of the 2018 British Machine Vision Conference . Durham: BMVA Press, 2018: 1-14. |
19 | KIROS R, SALAKHUTDINOV R, ZEMEL R S. Unifying visual-semantic embeddings with multimodal neural language models [EB/OL]. [2023-12-05].. |
20 | LIU C, MAO Z, LIU A A, et al. Focus your attention: a bidirectional focal attention network for image-text matching [C]// Proceedings of the 27th ACM International Conference on Multimedia. New York: ACM, 2019: 3-11. |
21 | 邓一姣,张凤荔,陈学勤,等. 面向跨模态检索的协同注意力网络模型[J]. 计算机科学, 2020, 47(4): 54-59. |
DENG Y J, ZHANG F L, CHEN X Q, et al. Collaborative attention network modal for cross-modal retrieval [J]. Computer Science, 2020, 47(4): 54-59. | |
22 | DIAO H, ZHANG Y, MA L, et al. Similarity reasoning and filtration for image-text matching[C]// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2021: 1218-1226. |
23 | WEHRMANN J, KOLLING C, BARROS R C. Adaptive cross-modal embeddings for image-text alignment[C]// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2020: 12313-12320. |
24 | ZHENG Z, ZHENG L, GARRETT M, et al. Dual-path convolutional image-text embeddings with instance loss [J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2020, 16(2): No.51. |
25 | ZHANG Q, LEI Z, ZHANG Z, et al. Context-aware attention network for image-text retrieval[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 3533-3542. |
26 | LI Z, GUO C, FENG Z, et al. Multi-view visual semantic embedding[C]// Proceedings of the 31st International Joint Conference on Artificial Intelligence. California: ijcai.org, 2022: 1130-1136. |
27 | PAN Z, WU F, ZHANG B. Fine-grained image-text matching by cross-modal hard aligning network[C]// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 19275-19284. |
[1] | Guixiang XUE, Hui WANG, Weifeng ZHOU, Yu LIU, Yan LI. Port traffic flow prediction based on knowledge graph and spatio-temporal diffusion graph convolutional network [J]. Journal of Computer Applications, 2024, 44(9): 2952-2957. |
[2] | Chuanlin PANG, Rui TANG, Ruizhi ZHANG, Chuan LIU, Jia LIU, Shibo YUE. Distributed power allocation algorithm based on graph convolutional network for D2D communication systems [J]. Journal of Computer Applications, 2024, 44(9): 2855-2862. |
[3] | Huanhuan LI, Tianqiang HUANG, Xuemei DING, Haifeng LUO, Liqing HUANG. Public traffic demand prediction based on multi-scale spatial-temporal graph convolutional network [J]. Journal of Computer Applications, 2024, 44(7): 2065-2072. |
[4] | Shibin LI, Jun GONG, Shengjun TANG. Semi-supervised heterophilic graph representation learning model based on Graph Transformer [J]. Journal of Computer Applications, 2024, 44(6): 1816-1823. |
[5] | Longtao GAO, Nana LI. Aspect sentiment triplet extraction based on aspect-aware attention enhancement [J]. Journal of Computer Applications, 2024, 44(4): 1049-1057. |
[6] | Xianfeng YANG, Yilei TANG, Ziqiang LI. Aspect-level sentiment analysis model based on alternating‑attention mechanism and graph convolutional network [J]. Journal of Computer Applications, 2024, 44(4): 1058-1064. |
[7] | Kaitian WANG, Qing YE, Chunlei CHENG. Classification method for traditional Chinese medicine electronic medical records based on heterogeneous graph representation [J]. Journal of Computer Applications, 2024, 44(2): 411-417. |
[8] | Xinrong HU, Jingxue CHEN, Zijian HUANG, Bangchao WANG, Xun YAO, Junping LIU, Qiang ZHU, Jie YANG. Graph convolution network-based masked data augmentation [J]. Journal of Computer Applications, 2024, 44(11): 3335-3344. |
[9] | Xinyue YAN, Shuqun YANG, Yongbin GAO. Document-level relationship extraction based on evidence enhancement and multi-feature fusion [J]. Journal of Computer Applications, 2024, 44(11): 3379-3385. |
[10] | Nengqiang XIANG, Xiaofei ZHU, Zhaoze GAO. Information diffusion prediction model of prototype-aware dual-channel graph convolutional neural network [J]. Journal of Computer Applications, 2024, 44(10): 3260-3266. |
[11] | Yanbo LI, Qing HE, Shunyi LU. Aspect sentiment triplet extraction integrating semantic and syntactic information [J]. Journal of Computer Applications, 2024, 44(10): 3275-3280. |
[12] | Wanting JI, Wenyi LU, Yuhang MA, Linlin DING, Baoyan SONG, Haolin ZHANG. Machine reading comprehension event detection based on relation-enhanced graph convolutional network [J]. Journal of Computer Applications, 2024, 44(10): 3288-3293. |
[13] | Jinke DENG, Wenjie DUAN, Shunxiang ZHANG, Yuqing WANG, Shuyu LI, Jiawei LI. Complex causal relationship extraction based on prompt enhancement and bi-graph attention network [J]. Journal of Computer Applications, 2024, 44(10): 3081-3089. |
[14] | Hanxiao SHI, Leichun WANG. Short-term power load forecasting by graph convolutional network combining LSTM and self-attention mechanism [J]. Journal of Computer Applications, 2024, 44(1): 311-317. |
[15] | Qiujie LIU, Yuan WAN, Jie WU. Deep bi-modal source domain symmetrical transfer learning for cross-modal retrieval [J]. Journal of Computer Applications, 2024, 44(1): 24-31. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||