Journal of Computer Applications
Next Articles
Received:
Revised:
Online:
Published:
习怡萌,邓箴,刘倩,刘立波
通讯作者:
基金资助:
Abstract: Video-text Retrieval, as a basic task in visual language learning, aims to retrieve Video data with the same semantic information based on a given Text description, or to retrieve text description with the same semantic information based on a given video. How to accurately mine the potential semantic correspondence between video and text is the key difficulty to solve this task. However, the existing video text retrieval methods usually assume that there is a strong semantic association between text description and video, but ignore the widely existing video text pairs with weak semantic description in the data set. As a result, the model is good at recognizing common general concepts, but cannot fully tap the potential information of weak semantic description, and ignores the hidden detailed information interaction between different modes. Thus, the model retrieval performance is affected. To solve these problems, a video text retrieval model based on cross-modal information fusion is proposed in this paper. Specifically, the model utilizes relevant external knowledge in a cross-modal manner to improve model retrieval performance. First, two external knowledge retrieval modules are constructed, which are used to retrieve video and external knowledge and text and external knowledge respectively, so as to strengthen the feature representation of original video and text with the help of external knowledge. Secondly, an adaptive cross-modal information fusion module is designed to remove redundant information in video and text, and use the complementary information between different modes for feature fusion, so as to learn more discriminative feature representation. Finally, inter-modal and intra-modal similarity loss functions are introduced to ensure the integrity of information representation in the fusion feature space, video feature space and text feature space, and to achieve accurate retrieval between cross-modal data. Experimental results on three public datasets, MSR-VTT, DiDeMo and LSMDC, show that the proposed method is superior to existing video text retrieval methods.
Key words: Cross-modal retrieval, Video-text retrieval, Multi-feature fusion, Weak semantic data, adaptive
摘要: 视频文本检索(Video-Text Retrieval)作为视觉语言学习的一项基本任务,旨在根据给定的文本描述检索出具有相同语义信息的视频数据,或者根据给定的视频检索出具有相同语义信息的文本描述。如何准确挖掘视频和文本之间的潜在语义对应关系是解决这一任务的关键难点。然而,现有的视频文本检索方法通常假设文本描述与视频之间存在强语义关联,却忽略了数据集中广泛存在的弱语义描述的视频文本对,导致模型擅长识别常见的通用概念,但无法充分挖掘弱语义描述的潜在信息,忽略了不同模态间隐藏的细节信息交互,从而影响模型检索性能。针对上述问题,文中提出一种跨模态信息融合的视频文本检索模型。具体来说,该模型以跨模态的方式利用相关的外部知识来改进模型检索性能。首先,构建两个外部知识检索模块,分别用于实现视频与外部知识的检索以及文本与外部知识的检索,以便后续借助外部知识来强化原始视频和文本特征表示;其次,设计了自适应交叉注意力的跨模态信息融合模块,以去除视频和文本中的冗余信息,并利用不同模态间的互补信息进行特征融合,从而学习更具判别性的特征表示;最后,引入模态间和模态内相似性损失函数,以确保数据在融合特征空间、视频特征空间和文本特征空间下信息表征的完整性,实现跨模态数据间的精准检索。在 MSR-VTT、DiDeMo 和 LSMDC 三个公共数据集上的实验结果表明,所提方法优于现有的视频文本检索方法。
关键词: 跨模态检索, 视频文本检索, 多特征融合, 弱语义数据, 自适应
CLC Number:
TP391.3
习怡萌 邓箴 刘倩 刘立波. 跨模态信息融合的视频文本检索[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2024081082.
0 / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2024081082