《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (1): 24-31.DOI: 10.11772/j.issn.1001-9081.2023010047

• 跨媒体表征学习与认知推理 • 上一篇    下一篇

深度双模态源域对称迁移学习的跨模态检索

刘秋杰, 万源(), 吴杰   

  1. 武汉理工大学 理学院,武汉 430070
  • 收稿日期:2023-01-17 修回日期:2023-05-11 接受日期:2023-05-12 发布日期:2023-06-06 出版日期:2024-01-10
  • 通讯作者: 万源
  • 作者简介:刘秋杰(1999—),男,河南驻马店人,硕士研究生,主要研究方向:机器学习、模式识别;
    万源(1976—),女,湖北武汉人,教授,博士,CCF会员,主要研究方向:机器学习、模式识别、图像处理;
    吴杰(1999—),男,江西南昌人,硕士研究生,主要研究方向:机器学习、模式识别。
  • 基金资助:
    中央高校基本科研业务费专项资金资助项目(2021III030JC)

Deep bi-modal source domain symmetrical transfer learning for cross-modal retrieval

Qiujie LIU, Yuan WAN(), Jie WU   

  1. School of Science,Wuhan University of Technology,Wuhan Hubei 430070,China
  • Received:2023-01-17 Revised:2023-05-11 Accepted:2023-05-12 Online:2023-06-06 Published:2024-01-10
  • Contact: Yuan WAN
  • About author:LIU Qiujie, born in 1999, M. S. candidate. His research interests include machine learning, pattern recognition.
    WU Jie, born in 1999, M. S. candidate. His research interests include machine learning, pattern recognition.
  • Supported by:
    Fundamental Research Funds for Central Universities(2021III030JC)

摘要:

基于深度网络的跨模态检索经常面临交叉训练数据不足的挑战,这限制了训练效果并容易导致过拟合。迁移学习在源域中训练数据的知识迁移学习到目标域中,能有效解决训练数据不足的问题。然而,现有的大部分迁移学习方法致力于将知识从单模态(如图像)源域迁移到多模态(如图像和文本)目标域,而如果源域中已存在多种模态信息,这样的非对称迁移会忽略源域中包含的潜在的模态间语义信息;同时这些方法不能很好地提取源域与目标域中相同模态的相似性,进而减小域差异。因此,提出一种深度双模态源域对称迁移学习的跨模态检索(DBSTL)方法。该方法旨在实现从双模态源域到跨模态目标域的知识迁移,并获得跨模态数据的公共表示。DBSTL由模态对称迁移子网和语义一致性学习子网构成。模态对称迁移子网采用混合对称结构,在知识迁移过程中,使模态间信息具有更高的一致性,并能减小源域与目标域间的差异;而语义一致性学习子网中,所有模态共享相同的公共表示层,并在目标域的监督信息指导下保证跨模态语义的一致性。实验结果表明,在Pascal、NUS-WIDE-10k和Wikipedia数据集上,所提方法的平均精度均值(mAP)较对比方法得到的最好结果分别提升了大约8.4、0.4和1.2个百分点。DBSTL充分利用了双模态源域的潜在信息进行对称迁移学习,在监督信息的指导下保证了模态间语义的一致性,并提高了公共表示空间中图像文本分布的相似性。

关键词: 跨模态检索, 迁移学习, 双模态源域, 语义一致性

Abstract:

Cross-modal retrieval based on deep network often faces the challenge of insufficient cross-training data, which limits the training effect and easily leads to over-fitting. Transfer learning is an effective way to solve the problem of insufficient training data by learning the training data in the source domain and transferring the acquired knowledge to the target domain. However, most of the existing transfer learning methods focus on transferring knowledge from single-modal (like image) source domain to cross-modal (like image and text) target domain. If there is multiple modal information in the source domain, this asymmetric transfer would ignore the potential inter-modal semantic information contained in the source domain. At the same time, the similarity of the same modals in the source domain and the target domain cannot be well extracted, thereby reducing the domain difference. Therefore, a Deep Bi-modal source domain Symmetrical Transfer Learning for cross-modal retrieval (DBSTL) method was proposed. The purpose of this method is to realize the knowledge transfer from bi-modal source domain to multi-modal target domain, and obtain the common representation of cross-modal data. DBSTL consists of modal symmetric transfer subnet and semantic consistency learning subnet. With hybrid symmetric structure adopted in symmetric modal transfer subnet, the information between modals was more consistent to each other and the difference between source domain and target domain was reduced by this subnet. In semantic consistency learning subnet, all modalities shared the same common presentation layer, and the cross-modal semantic consistency was ensured under the guidance of the supervision information of the target domain. Experimental results show that on Pascal, NUS-WIDE-10k and Wikipedia datasets, the mean Average Precision (mAP) of the proposed method is improved by about 8.4, 0.4 and 1.2 percentage points compared with the best result obtained by the comparison methods respectively. DBSTL makes full use of the potential information of the dual-modal source domain to conduct symmetric transfer learning, ensures the semantic consistency between modals under the guidance of the supervision information, and improves the similarity of image and text distribution in the public representation space.

Key words: cross-modal retrieval, transfer learning, bi-modal source domain, semantic consistency

中图分类号: