深度双模态源域对称迁移学习的跨模态检索

doi:10.11772/j.issn.1001-9081.2023010047

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (1): 24-31.DOI: 10.11772/j.issn.1001-9081.2023010047

• 跨媒体表征学习与认知推理 • 上一篇下一篇

深度双模态源域对称迁移学习的跨模态检索

刘秋杰, 万源(), 吴杰

武汉理工大学理学院，武汉 430070

收稿日期:2023-01-17 修回日期:2023-05-11 接受日期:2023-05-12 发布日期:2023-06-06 出版日期:2024-01-10
通讯作者: 万源
作者简介:刘秋杰（1999—），男，河南驻马店人，硕士研究生，主要研究方向：机器学习、模式识别；
万源（1976—），女，湖北武汉人，教授，博士，CCF会员，主要研究方向：机器学习、模式识别、图像处理；
吴杰（1999—），男，江西南昌人，硕士研究生，主要研究方向：机器学习、模式识别。
基金资助:
中央高校基本科研业务费专项资金资助项目(2021III030JC)

Deep bi-modal source domain symmetrical transfer learning for cross-modal retrieval

Qiujie LIU, Yuan WAN(), Jie WU

School of Science，Wuhan University of Technology，Wuhan Hubei 430070，China

Received:2023-01-17 Revised:2023-05-11 Accepted:2023-05-12 Online:2023-06-06 Published:2024-01-10
Contact: Yuan WAN
About author:LIU Qiujie， born in 1999， M. S. candidate. His research interests include machine learning， pattern recognition.
WU Jie， born in 1999， M. S. candidate. His research interests include machine learning， pattern recognition.
Supported by:
Fundamental Research Funds for Central Universities(2021III030JC)

摘要/Abstract

摘要：

基于深度网络的跨模态检索经常面临交叉训练数据不足的挑战，这限制了训练效果并容易导致过拟合。迁移学习在源域中训练数据的知识迁移学习到目标域中，能有效解决训练数据不足的问题。然而，现有的大部分迁移学习方法致力于将知识从单模态（如图像）源域迁移到多模态（如图像和文本）目标域，而如果源域中已存在多种模态信息，这样的非对称迁移会忽略源域中包含的潜在的模态间语义信息；同时这些方法不能很好地提取源域与目标域中相同模态的相似性，进而减小域差异。因此，提出一种深度双模态源域对称迁移学习的跨模态检索（DBSTL）方法。该方法旨在实现从双模态源域到跨模态目标域的知识迁移，并获得跨模态数据的公共表示。DBSTL由模态对称迁移子网和语义一致性学习子网构成。模态对称迁移子网采用混合对称结构，在知识迁移过程中，使模态间信息具有更高的一致性，并能减小源域与目标域间的差异；而语义一致性学习子网中，所有模态共享相同的公共表示层，并在目标域的监督信息指导下保证跨模态语义的一致性。实验结果表明，在Pascal、NUS-WIDE-10k和Wikipedia数据集上，所提方法的平均精度均值（mAP）较对比方法得到的最好结果分别提升了大约8.4、0.4和1.2个百分点。DBSTL充分利用了双模态源域的潜在信息进行对称迁移学习，在监督信息的指导下保证了模态间语义的一致性，并提高了公共表示空间中图像文本分布的相似性。

关键词: 跨模态检索, 迁移学习, 双模态源域, 语义一致性

Abstract:

Cross-modal retrieval based on deep network often faces the challenge of insufficient cross-training data， which limits the training effect and easily leads to over-fitting. Transfer learning is an effective way to solve the problem of insufficient training data by learning the training data in the source domain and transferring the acquired knowledge to the target domain. However， most of the existing transfer learning methods focus on transferring knowledge from single-modal （like image） source domain to cross-modal （like image and text） target domain. If there is multiple modal information in the source domain， this asymmetric transfer would ignore the potential inter-modal semantic information contained in the source domain. At the same time， the similarity of the same modals in the source domain and the target domain cannot be well extracted， thereby reducing the domain difference. Therefore， a Deep Bi-modal source domain Symmetrical Transfer Learning for cross-modal retrieval （DBSTL） method was proposed. The purpose of this method is to realize the knowledge transfer from bi-modal source domain to multi-modal target domain， and obtain the common representation of cross-modal data. DBSTL consists of modal symmetric transfer subnet and semantic consistency learning subnet. With hybrid symmetric structure adopted in symmetric modal transfer subnet， the information between modals was more consistent to each other and the difference between source domain and target domain was reduced by this subnet. In semantic consistency learning subnet， all modalities shared the same common presentation layer， and the cross-modal semantic consistency was ensured under the guidance of the supervision information of the target domain. Experimental results show that on Pascal， NUS-WIDE-10k and Wikipedia datasets， the mean Average Precision （mAP） of the proposed method is improved by about 8.4， 0.4 and 1.2 percentage points compared with the best result obtained by the comparison methods respectively. DBSTL makes full use of the potential information of the dual-modal source domain to conduct symmetric transfer learning， ensures the semantic consistency between modals under the guidance of the supervision information， and improves the similarity of image and text distribution in the public representation space.

Key words: cross-modal retrieval, transfer learning, bi-modal source domain, semantic consistency

中图分类号:

TP391.3

刘秋杰, 万源, 吴杰. 深度双模态源域对称迁移学习的跨模态检索[J]. 计算机应用, 2024, 44(1): 24-31.

Qiujie LIU, Yuan WAN, Jie WU. Deep bi-modal source domain symmetrical transfer learning for cross-modal retrieval[J]. Journal of Computer Applications, 2024, 44(1): 24-31.

图/表 7

参考文献 29

1	HOTELLING H. Relations between two sets of variates ［M］// KOTZ S， JOHNSON N L. Breakthroughs in Statistics： Methodology and Distribution， Springer Series in Statistics. New York： Springer， 1992： 162-190. 10.1007/978-1-4612-4380-9_14
2	FENG F， WANG X， LI R. Cross-modal retrieval with correspondence autoencoder ［C］// Proceedings of the 22nd ACM International Conference on Multimedia. New York： ACM， 2014： 7-16. 10.1145/2647868.2654902
3	PENG Y， CHI J. Unsupervised cross-media retrieval using domain adaptation with scene graph ［J］. IEEE Transactions on Circuits and Systems for Video Technology， 2020， 30（11）： 4368-4379. 10.1109/tcsvt.2019.2953692
4	HU P， ZHEN L， PENG D， et al. Scalable deep multimodal learning for cross-modal retrieval ［C］// Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York： ACM， 2019： 635-644. 10.1145/3331184.3331213
5	WANG J， HE Y， KANG C， et al. Image-text cross-modal retrieval via modality-specific feature learning ［C］// Proceedings of the 5th ACM International Conference on Multimedia Retrieval. New York： ACM， 2015： 347-354. 10.1145/2671188.2749341
6	PENG Y， HUANG X， ZHAO Y. An overview of cross-media retrieval： concepts， methodologies， benchmarks and challenges ［J］. IEEE Transactions on Circuits and Systems for Video Technology， 2018， 28（9）： 2372-2385. 10.1109/tcsvt.2017.2705068
7	TSAI Y H H， YEH Y R， WANG Y C F. Learning cross-domain landmarks for heterogeneous domain adaptation ［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 5081-5090. 10.1109/cvpr.2016.549
8	HUANG X， PENG Y， YUAN M. Cross-modal common representation learning by hybrid transfer network ［C］// Proceedings of the 26th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2017： 1893-1900. 10.24963/ijcai.2017/263
9	HUANG X， PENG Y. Deep cross-media knowledge transfer ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 8837-8846. 10.1109/cvpr.2018.00921
10	WEN X， HAN Z， YIN X， et al. Adversarial cross-modal retrieval via learning and transferring single-modal similarities ［C］// Proceedings of the 2019 IEEE International Conference on Multimedia and Expo. Piscataway： IEEE， 2019： 478-483. 10.1109/icme.2019.00089
11	COSTA PEREIRA J， COVIELLO E， DOYLE G， et al. On the role of correlation and abstraction in cross-modal multimedia retrieval ［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2014， 36（3）： 521-535. 10.1109/tpami.2013.142
12	LI D， DIMITROVA N， LI M， et al. Multimedia content processing through cross-modal association ［C］// Proceedings of the 11th ACM International Conference on Multimedia. New York： ACM， 2003： 604-611. 10.1145/957013.957143
13	ANDREW G， ARORA R， BILMES J， et al. Deep canonical correlation analysis ［C］// Proceedings of the 30th International Conference on Machine Learning. New York： JMLR.org， 2013： 1247-1255.
14	WANG B， YANG Y， XU X， et al. Adversarial cross-modal retrieval ［C］// Proceedings of the 25th ACM International Conference on Multimedia. New York： ACM， 2017： 154-162. 10.1145/3123266.3123326
15	ZHEN L， HU P， WANG X， et al. Deep supervised cross-modal retrieval ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 10386-10395. 10.1109/cvpr.2019.01064
16	PENG Y， HUANG X， QI J. Cross-media shared representation by hierarchical learning with multiple deep networks ［C］// Proceedings of the 25th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2016： 3846-3853.
17	WEI Y， ZHAO Y， LU C， et al. Cross-modal retrieval with CNN visual features： a new baseline ［J］. IEEE Transactions on Cybernetics， 2017， 47（2）： 449-460.
18	PAN S J， YANG Q. A survey on transfer learning ［J］. IEEE Transactions on Knowledge and Data Engineering， 2010， 22（10）： 1345-1359. 10.1109/tkde.2009.191
19	LONG M， WANG J， WANG J， et al. Learning transferable features with deep adaptation networks ［C］// Proceedings of the 32nd International Conference on Machine Learning. New York： JMLR.org， 2015： 97-105.
20	HUANG X， PENG Y， YUAN M. MHTN： modal-adversarial hybrid transfer network for cross-modal retrieval ［J］. IEEE Transactions on Cybernetics， 2020， 50（3）： 1047-1059. 10.1109/tcyb.2018.2879846
21	ZHEN L， HU P， PENG X， et al. Deep multimodal transfer learning for cross-modal retrieval ［J］. IEEE Transactions on Neural Networks and Learning Systems， 2022， 33（2）： 798-810. 10.1109/tnnls.2020.3029181
22	GRETTON A， BORGWARDT K M， RASCH M J， et al. A kernel two-sample test ［J］. Journal of Machine Learning Research， 2012， 13： 723-773.
23	KINGMA D， BA J L. Adam： a method for stochastic optimization ［EB/OL］. （2017-01-30）［2021-08-03］. .
24	RASHTCHIAN C， YOUNG P， HODOSH M， et al. Collecting image annotations using Amazon's Mechanical Turk ［C］// Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk. Stroudsburg， PA： ACL， 2010： 139-147.
25	CHUA T S， TANG J， HONG R， et al. NUS-WIDE： a real-world web image database from National University of Singapore ［C］// Proceedings of the 2009 ACM International Conference on Image and Video Retrieval. New York： ACM， 2009： No.48. 10.1145/1646396.1646452
26	HARDOON D R， SZEDMAK S， SHAWETAYLOR J. Canonical correlation analysis： an overview with application to learning methods ［J］. Neural Computation， 2004， 16（12）： 2639-2664. 10.1162/0899766042321814
27	HENDERSON P， FERRARI V. End-to-end training of object class detectors for mean average precision ［C］// Proceedings of the 2016 Asian Conference on Computer Vision， LNCS 10115. Cham： Springer， 2017： 198-213.
28	GOUTTE C， GAUSSIER E. A probabilistic interpretation of precision， recall and F-score， with implication for evaluation ［C］// Proceedings of the 2005 European Conference on Information Retrieval， LNCS 3408. Berlin： Springer， 2005： 345-359.
29	VAN DER MAATEN L， HINTON G. Visualizing data using t-SNE ［J］. Journal of Machine Learning Research， 2008， 9： 2579-2605.

数据集	分类	CCA	CFA	KCCA	CMDN	Deep-SM	DSCMR	CHTN	DCKT	MHTN	DMTL	本文方法
Pascal	图像→文本	0.110	0.341	0.271	0.458	0.440	0.710	0.467	0.582	0.496	0.632	0.712
	文本→图像	0.116	0.308	0.280	0.444	0.414	0.722	0.477	0.587	0.500	0.637	0.729
	平均	0.113	0.325	0.276	0.451	0.427	0.716	0.472	0.585	0.498	0.634	0.718
NUS-WIDE-10k	图像→文本	0.159	0.299	0.129	0.410	0.389	0.611	0.518	0.556	0.520	0.656	0.640
	文本→图像	0.189	0.301	0.157	0.450	0.496	0.615	0.516	0.584	0.534	0.634	0.656
	平均	0.174	0.300	0.143	0.430	0.443	0.613	0.517	0.570	0.527	0.645	0.649
Wikipedia	图像→文本	0.176	0.330	0.230	0.409	0.458	0.521	0.508	0.537	0.514	0.531	0.570
	文本→图像	0.178	0.306	0.224	0.364	0.345	0.478	0.432	0.485	0.444	0.574	0.505
	平均	0.177	0.318	0.227	0.387	0.402	0.499	0.470	0.511	0.479	0.552	0.564

数据集	分类	CCA	CFA	KCCA	CMDN	Deep-SM	DSCMR	CHTN	DCKT	MHTN	DMTL	本文方法
Pascal	图像→文本	0.110	0.341	0.271	0.458	0.440	0.710	0.467	0.582	0.496	0.632	0.712
	文本→图像	0.116	0.308	0.280	0.444	0.414	0.722	0.477	0.587	0.500	0.637	0.729
	平均	0.113	0.325	0.276	0.451	0.427	0.716	0.472	0.585	0.498	0.634	0.718
NUS-WIDE-10k	图像→文本	0.159	0.299	0.129	0.410	0.389	0.611	0.518	0.556	0.520	0.656	0.640
	文本→图像	0.189	0.301	0.157	0.450	0.496	0.615	0.516	0.584	0.534	0.634	0.656
	平均	0.174	0.300	0.143	0.430	0.443	0.613	0.517	0.570	0.527	0.645	0.649
Wikipedia	图像→文本	0.176	0.330	0.230	0.409	0.458	0.521	0.508	0.537	0.514	0.531	0.570
	文本→图像	0.178	0.306	0.224	0.364	0.345	0.478	0.432	0.485	0.444	0.574	0.505
	平均	0.177	0.318	0.227	0.387	0.402	0.499	0.470	0.511	0.479	0.552	0.564

方法	Pascal			NUS-WIDE-10k			Wikipedia
方法	图像→文本	文本→图像	平均	图像→文本	文本→图像	平均	图像→文本	文本→图像	平均
DBSTL1	0.611	0.630	0.627	0.549	0.537	0.548	0.443	0.450	0.448
DBSTL2	0.691	0.701	0.694	0.529	0.540	0.538	0.420	0.410	0.413
DBSTL3	0.590	0.610	0.607	0.570	0.535	0.565	0.501	0.490	0.494
DBSTL4	0.701	0.720	0.715	0.619	0.603	0.611	0.492	0.509	0.503
DBSTL	0.712	0.729	0.718	0.640	0.656	0.649	0.570	0.505	0.564

方法	Pascal			NUS-WIDE-10k			Wikipedia
方法	图像→文本	文本→图像	平均	图像→文本	文本→图像	平均	图像→文本	文本→图像	平均
DBSTL1	0.611	0.630	0.627	0.549	0.537	0.548	0.443	0.450	0.448
DBSTL2	0.691	0.701	0.694	0.529	0.540	0.538	0.420	0.410	0.413
DBSTL3	0.590	0.610	0.607	0.570	0.535	0.565	0.501	0.490	0.494
DBSTL4	0.701	0.720	0.715	0.619	0.603	0.611	0.492	0.509	0.503
DBSTL	0.712	0.729	0.718	0.640	0.656	0.649	0.570	0.505	0.564

[1]	翟飞宇, 马汉达. 基于DenseNet的经典-量子混合分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1905-1910.
[2]	李鸿天, 史鑫昊, 潘卫国, 徐成, 徐冰心, 袁家政. 融合多尺度和注意力机制的小样本目标检测[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1437-1444.
[3]	时旺军, 王晶, 宁晓军, 林友芳. 小样本场景下的元迁移学习睡眠分期模型[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1445-1451.
[4]	王昊冉, 于丹, 杨玉丽, 马垚, 陈永乐. 面向工控系统未知攻击的域迁移入侵检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1158-1165.
[5]	吴祖成, 吴小俊, 徐天阳. 基于模态内细粒度特征关系提取的图像文本检索模型[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3776-3783.
[6]	黄懿蕊, 罗俊玮, 陈景强. 基于对比学习和GIF标记的多模态对话回复检索[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 32-38.
[7]	陈克正, 郭晓然, 钟勇, 李振平. 基于负训练和迁移学习的关系抽取方法[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2426-2430.
[8]	金泽熙, 李磊, 刘继. 基于改进领域分离网络的迁移学习模型[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2382-2389.
[9]	轩勃娜, 李进, 宋亚飞, 马泽煊. 基于改进MobileNetV2的恶意代码分类方法[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2217-2225.
[10]	张慧斌, 冯丽萍, 郝耀军, 王一宁. 基于注意力机制和迁移学习的古壁画朝代识别[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1826-1832.
[11]	谭钰, 王小琴, 蓝如师, 刘振丙, 罗笑南. 基于判别性矩阵分解的多标签跨模态哈希检索[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1349-1354.
[12]	李传彪, 毕远伟. 基于跨域自适应的立体匹配算法[J]. 《计算机应用》唯一官方网站, 2023, 43(10): 3230-3235.
[13]	王晓雨, 王展青, 熊威. 深度非对称离散跨模态哈希方法[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2461-2470.
[14]	杨瑞杰, 郑贵林. 基于InceptionV3和特征融合的人脸活体检测[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2037-2042.
[15]	陈颖, 于炯, 陈嘉颖, 杜旭升. 基于交叉层级数据共享的多任务模型[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1447-1454.

深度双模态源域对称迁移学习的跨模态检索

Deep bi-modal source domain symmetrical transfer learning for cross-modal retrieval

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献 29

相关文章 15

编辑推荐

Metrics