《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (8): 2448-2456.DOI: 10.11772/j.issn.1001-9081.2024081082

• 第21届CCF中国信息系统及应用大会 (WISA 2024) • 上一篇    

跨模态信息融合的视频-文本检索

习怡萌1, 邓箴1, 刘倩1, 刘立波1,2()   

  1. 1.宁夏大学 信息工程学院,银川 750021
    2.宁夏“东数西算”人工智能与信息安全重点实验室(宁夏大学),银川 750021
  • 收稿日期:2024-08-02 修回日期:2024-08-19 接受日期:2024-08-21 发布日期:2024-09-12 出版日期:2025-08-10
  • 通讯作者: 刘立波
  • 作者简介:习怡萌(2000—),女,陕西渭南人,硕士研究生,CCF会员,主要研究方向:视频文本跨模态检索
    邓箴(1984—),女,河南三门峡人,副教授,博士,主要研究方向:图像处理、机器视觉
    刘倩(1981—),女(满族),山西忻州人,副教授,硕士,主要研究方向:图形图像处理;
  • 基金资助:
    国家自然科学基金资助项目(62262053);宁夏科技创新领军人才项目(2022GKLRLX03);宁夏大学研究生创新项目(CXXM202406);宁夏高等学校科学研究项目(NYG2024023)

Cross-modal information fusion for video-text retrieval

Yimeng XI1, Zhen DENG1, Qian LIU1, Libo LIU1,2()   

  1. 1.School of Information Engineering,Ningxia University,Yinchuan Ningxia 750021,China
    2.Ningxia Key Laboratory of Artificial Intelligence and Information Security for Channeling Computing Resources from the East to the West (Ningxia University),Yinchuan Ningxia 750021,China
  • Received:2024-08-02 Revised:2024-08-19 Accepted:2024-08-21 Online:2024-09-12 Published:2025-08-10
  • Contact: Libo LIU
  • About author:XI Yimeng, born in 2000, M. S. candidate. Her research interests include cross-modal retrieval of video and text.
    DENG Zhen, born in 1984, Ph. D., associate professor. Her research interests include image processing, machine vision.
    LIU Qian, born in 1981, M. S., associate professor. Her research interests include graphic and image processing.
  • Supported by:
    National Natural Science Foundation of China(62262053);Ningxia Science and Technology Innovation Leading Talent Project(2022GKLRLX03);Ningxia University Graduate Innovation Project(CXXM202406);Ningxia University Scientific Research Project(NYG2024023)

摘要:

现有的视频-文本检索(VTR)方法通常假设文本描述与视频之间存在强语义关联,却忽略了数据集中广泛存在的弱相关视频文本对,导致模型虽然擅长识别常见的通用概念,但无法充分挖掘弱语义描述的潜在信息,进而影响模型的检索性能。针对上述问题,提出一种跨模态信息融合的VTR模型,该模型以跨模态的方式利用相关的外部知识改进模型的检索性能。首先,构建2个外部知识检索模块,分别用于实现视频与外部知识的检索以及文本与外部知识的检索,以便后续借助外部知识强化原始视频和文本的特征表示;其次,设计自适应交叉注意力的跨模态信息融合模块,以去除视频和文本中的冗余信息,并利用不同模态间的互补信息融合特征,学习更具判别性的特征表示;最后,引入模态间和模态内的相似性损失函数,以确保数据在融合特征空间、视频特征空间和文本特征空间下信息表征的完整性,从而实现跨模态数据间的精准检索。实验结果表明,与MuLTI模型相比,所提模型在公共数据集MSR-VTT (Microsoft Research Video to Text)和DiDeMo (Distinct Describable Moments)上的召回率R@1分别提升了2.0和1.9个百分点;与CLIP-ViP模型相比,所提模型在公共数据集LSMDC (Large Scale Movie Description Challenge)上的R@1提高了2.9个百分点。可见,所提模型能有效解决VTR任务中的弱相关数据的问题,从而提升模型的检索准确率。

关键词: 跨模态检索, 视频-文本检索, 多特征融合, 弱语义数据, 自适应

Abstract:

The existing Video-Text Retrieval (VTR) methods usually assume a strong semantic association between the text descriptions and the videos, but ignore the widely existing weakly related video-text pairs in datasets, so that the models are good at recognizing common general concepts but unable to fully mine the potential information of weak semantic descriptions, thus affecting retrieval performance of models. To address the above problems, a VTR model based on cross-modal information fusion was proposed. In this model, relevant external knowledge was utilized in a cross-modal way to improve retrieval performance of the model. Firstly, two external knowledge retrieval modules were constructed and were used to implement the retrieval of videos and external knowledge as well as the retrieval of texts and external knowledge respectively, so as to strengthen the original video and text feature representations with the help of external knowledge subsequently. Secondly, a cross-modal information fusion module with adaptive cross-attention was designed to remove redundant information in the videos and texts as well as conduct feature fusion by using complementary information between different modalities, thereby learning more discriminative feature representations. Finally, inter-modal and intra-modal similarity loss functions were introduced to ensure the integrity of information representation of the data in the fusion feature space, video feature space, and text feature space, so as to achieve accurate retrieval between cross-modal data. Experimental results show that compared with model MuLTI, the proposed model has the recall R@1 on public datasets MSR-VTT (Microsoft Research Video to Text) and DiDeMo (Distinct Describable Moments) increased by 2.0 and 1.9 percentage points respectively; compared with model CLIP-ViP, the proposed model has the R@1 on public dataset LSMDC (Large Scale Movie Description Challenge) increased by 2.9 percentage points. It can be seen that the proposed model can solve the problem of weakly related data pairs in VTR tasks effectively, thereby improving retrieval accuracy of the model.

Key words: cross-modal retrieval, Video-Text Retrieval (VTR), multi-feature fusion, weak semantic data, adaptive

中图分类号: