| 1 | 刘颖,郭莹莹,房杰,等.深度学习跨模态图文检索研究综述[J].计算机科学与探索, 2022, 16(3): 489-511.  10.3778/j.issn.1673-9418.2107076 | 
																													
																							|  | LIU Y, GUO Y Y, FANG J, et al. Survey of research on deep learning image-text cross-modal retrieval [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(3): 489-511.  10.3778/j.issn.1673-9418.2107076 | 
																													
																							| 2 | LI X, WANG Y, SHA Z. Deep learning methods of cross-modal tasks for conceptual design of product shapes: a review [J]. Journal of Mechanical Design, 2023, 145(4): 041401.  10.1115/1.4056436 | 
																													
																							| 3 | 刘长红,曾胜,张斌,等.基于语义关系图的跨模态张量融合网络的图像文本检索[J].计算机应用, 2022, 42(10): 3018-3024.  10.11772/j.issn.1001-9081.2021091622 | 
																													
																							|  | LIU C H, ZENG S, ZHANG B, et al. Cross-modal tensor fusion network based on semantic relation graph for image-text retrieval [J]. Journal of Computer Applications, 2022, 42(10): 3018-3024.  10.11772/j.issn.1001-9081.2021091622 | 
																													
																							| 4 | 李志欣,凌锋,张灿龙,等.融合两级相似度的跨媒体图像文本检索[J].电子学报, 2021, 49(2): 268-274.  10.12263/DZXB.20191037 | 
																													
																							|  | LI Z X, LING F, ZHANG C L, et al. Cross-media image-text retrieval with two level similarity [J]. Acta Electronica Sinica, 2021, 49(2): 268-274.  10.12263/DZXB.20191037 | 
																													
																							| 5 | FROME A, CORRADO G S, SHLENS J, et al. DeViSE: a deep visual-semantic embedding model [C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc., 2013: 2121-2129. | 
																													
																							| 6 | FAGHRI F, FLEET D J, KIROS J R, et al. VSE++: improving visual-semantic embeddings with hard negatives [C]// Proceedings of the 2018 British Machine Vision Conference. Durham: BMVA Press, 2018: No.344. | 
																													
																							| 7 | GU J, CAI J, JOTY S R, et al. Look, imagine and match: improving textual-visual cross-modal retrieval with generative models [C]// Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 7181-7189.  10.1109/cvpr.2018.00750 | 
																													
																							| 8 | ZHEN L, HU P, WANG X, et al. Deep supervised cross-modal retrieval [C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 10386-10395.  10.1109/cvpr.2019.01064 | 
																													
																							| 9 | WEN K, GU X, CHENG Q. Learning dual semantic relations with graph attention for image-text matching [J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31(7): 2866-2879.  10.1109/tcsvt.2020.3030656 | 
																													
																							| 10 | CHEN J, HU H, WU H, et al. Learning the best pooling strategy for visual semantic embedding [C]// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 15784-15793.  10.1109/cvpr46437.2021.01553 | 
																													
																							| 11 | KARPATHY A, JOULIN A, LI F-F. Deep fragment embeddings for bidirectional image sentence mapping [C]// Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2014: 1889-1897. | 
																													
																							| 12 | NIU Z, ZHOU M, WANG L, et al. Hierarchical multimodal LSTM for dense visual-semantic embedding [C]// Proceedings of the 2017 IEEE International Conference on computer Vision. Piscataway: IEEE, 2017: 1899-1907.  10.1109/iccv.2017.208 | 
																													
																							| 13 | NAM H, J-W HA, KIM J. Dual attention networks for multimodal reasoning and matching [C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 2156-2164.  10.1109/cvpr.2017.232 | 
																													
																							| 14 | LEE K-H, CHEN X, HUA G, et al. Stacked cross attention for image-text matching [C]// Proceedings of the 2018 European Conference on Computer Vision. Cham: Springer, 2018: 212-228.  10.1007/978-3-030-01225-0_13 | 
																													
																							| 15 | CHEN H, DING G, LIU X, et al. IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval [C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 12652-12660.  10.1109/cvpr42600.2020.01267 | 
																													
																							| 16 | QU L, LIU M, WU J, et al. Dynamic modality interaction modeling for image-text retrieval [C]// Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2021: 1104-1113.  10.1145/3404835.3462829 | 
																													
																							| 17 | JI Z, CHEN K, WANG H. Step-wise hierarchical alignment network for image-text matching [EB/OL]. [2021-01-11]. .  10.24963/ijcai.2021/106 | 
																													
																							| 18 | CHEN R, WANG H, WANG L, et al. Two-stream hierarchical similarity reasoning for image-text matching [EB/OL]. [2022-03-10]. . | 
																													
																							| 19 | ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering [C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 6077-6086.  10.1109/cvpr.2018.00636 | 
																													
																							| 20 | KRISHNA R, ZHU Y, GROTH O, et al. Visual Genome: connecting language and vision using crowdsourced dense image annotations [J]. International Journal of Computer Vision, 2017, 123(1): 32-73.  10.1007/s11263-016-0981-7 | 
																													
																							| 21 | REN S, HE K, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.  10.1109/tpami.2016.2577031 | 
																													
																							| 22 | HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778.  10.1109/cvpr.2016.90 | 
																													
																							| 23 | DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database [C]// Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2009: 248-255.  10.1109/cvpr.2009.5206848 | 
																													
																							| 24 | SCHUSTER M, PALIWAL K K. Bidirectional recurrent neural networks [J]. IEEE Transactions on Signal Processing, 1997, 45(11): 2673-2681.  10.1109/78.650093 | 
																													
																							| 25 | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6000-6010. | 
																													
																							| 26 | PLUMMER B A, WANG L, CERVANTES C M, et al. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models [C]// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015: 2641-2649.  10.1109/iccv.2015.303 | 
																													
																							| 27 | VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 652-663.  10.1109/tpami.2016.2587640 | 
																													
																							| 28 | JIANG Z, LIAN Z. Mutil-level local alignment and semantic matching network for image-text retrieval [C]// Proceedings of the 2022 International Conference on Artificial Neural Networks. Cham: Springer, 2022: 212-224.  10.1007/978-3-031-15934-3_18 | 
																													
																							| 29 | KINGMA D P, BA J. Adam: a method for stochastic optimization [EB/OL]. (2017-01-30) [2021-08-03]. . |