跨模态信息融合的视频-文本检索

doi:10.11772/j.issn.1001-9081.2024081082

《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (8): 2448-2456.DOI: 10.11772/j.issn.1001-9081.2024081082

• 第21届CCF中国信息系统及应用大会 (WISA 2024) • 上一篇

跨模态信息融合的视频-文本检索

习怡萌¹, 邓箴¹, 刘倩¹, 刘立波¹^,²()

^1.宁夏大学信息工程学院，银川 750021
^2.宁夏“东数西算”人工智能与信息安全重点实验室（宁夏大学），银川 750021

收稿日期:2024-08-02 修回日期:2024-08-19 接受日期:2024-08-21 发布日期:2024-09-12 出版日期:2025-08-10
通讯作者: 刘立波
作者简介:习怡萌（2000—），女，陕西渭南人，硕士研究生，CCF会员，主要研究方向：视频文本跨模态检索
邓箴（1984—），女，河南三门峡人，副教授，博士，主要研究方向：图像处理、机器视觉
刘倩（1981—），女（满族），山西忻州人，副教授，硕士，主要研究方向：图形图像处理；
基金资助:
国家自然科学基金资助项目(62262053);宁夏科技创新领军人才项目(2022GKLRLX03);宁夏大学研究生创新项目(CXXM202406);宁夏高等学校科学研究项目(NYG2024023)

Cross-modal information fusion for video-text retrieval

Yimeng XI¹, Zhen DENG¹, Qian LIU¹, Libo LIU¹^,²()

^1.School of Information Engineering，Ningxia University，Yinchuan Ningxia 750021，China
^2.Ningxia Key Laboratory of Artificial Intelligence and Information Security for Channeling Computing Resources from the East to the West （Ningxia University），Yinchuan Ningxia 750021，China

Received:2024-08-02 Revised:2024-08-19 Accepted:2024-08-21 Online:2024-09-12 Published:2025-08-10
Contact: Libo LIU
About author:XI Yimeng， born in 2000， M. S. candidate. Her research interests include cross-modal retrieval of video and text.
DENG Zhen， born in 1984， Ph. D.， associate professor. Her research interests include image processing， machine vision.
LIU Qian， born in 1981， M. S.， associate professor. Her research interests include graphic and image processing.
Supported by:
National Natural Science Foundation of China(62262053);Ningxia Science and Technology Innovation Leading Talent Project(2022GKLRLX03);Ningxia University Graduate Innovation Project(CXXM202406);Ningxia University Scientific Research Project(NYG2024023)

摘要/Abstract

摘要：

现有的视频-文本检索（VTR）方法通常假设文本描述与视频之间存在强语义关联，却忽略了数据集中广泛存在的弱相关视频文本对，导致模型虽然擅长识别常见的通用概念，但无法充分挖掘弱语义描述的潜在信息，进而影响模型的检索性能。针对上述问题，提出一种跨模态信息融合的VTR模型，该模型以跨模态的方式利用相关的外部知识改进模型的检索性能。首先，构建2个外部知识检索模块，分别用于实现视频与外部知识的检索以及文本与外部知识的检索，以便后续借助外部知识强化原始视频和文本的特征表示；其次，设计自适应交叉注意力的跨模态信息融合模块，以去除视频和文本中的冗余信息，并利用不同模态间的互补信息融合特征，学习更具判别性的特征表示；最后，引入模态间和模态内的相似性损失函数，以确保数据在融合特征空间、视频特征空间和文本特征空间下信息表征的完整性，从而实现跨模态数据间的精准检索。实验结果表明，与MuLTI模型相比，所提模型在公共数据集MSR-VTT （Microsoft Research Video to Text）和DiDeMo （Distinct Describable Moments）上的召回率R@1分别提升了2.0和1.9个百分点；与CLIP-ViP模型相比，所提模型在公共数据集LSMDC （Large Scale Movie Description Challenge）上的R@1提高了2.9个百分点。可见，所提模型能有效解决VTR任务中的弱相关数据的问题，从而提升模型的检索准确率。

关键词: 跨模态检索, 视频-文本检索, 多特征融合, 弱语义数据, 自适应

Abstract:

The existing Video-Text Retrieval （VTR） methods usually assume a strong semantic association between the text descriptions and the videos， but ignore the widely existing weakly related video-text pairs in datasets， so that the models are good at recognizing common general concepts but unable to fully mine the potential information of weak semantic descriptions， thus affecting retrieval performance of models. To address the above problems， a VTR model based on cross-modal information fusion was proposed. In this model， relevant external knowledge was utilized in a cross-modal way to improve retrieval performance of the model. Firstly， two external knowledge retrieval modules were constructed and were used to implement the retrieval of videos and external knowledge as well as the retrieval of texts and external knowledge respectively， so as to strengthen the original video and text feature representations with the help of external knowledge subsequently. Secondly， a cross-modal information fusion module with adaptive cross-attention was designed to remove redundant information in the videos and texts as well as conduct feature fusion by using complementary information between different modalities， thereby learning more discriminative feature representations. Finally， inter-modal and intra-modal similarity loss functions were introduced to ensure the integrity of information representation of the data in the fusion feature space， video feature space， and text feature space， so as to achieve accurate retrieval between cross-modal data. Experimental results show that compared with model MuLTI， the proposed model has the recall R@1 on public datasets MSR-VTT （Microsoft Research Video to Text） and DiDeMo （Distinct Describable Moments） increased by 2.0 and 1.9 percentage points respectively； compared with model CLIP-ViP， the proposed model has the R@1 on public dataset LSMDC （Large Scale Movie Description Challenge） increased by 2.9 percentage points. It can be seen that the proposed model can solve the problem of weakly related data pairs in VTR tasks effectively， thereby improving retrieval accuracy of the model.

Key words: cross-modal retrieval, Video-Text Retrieval (VTR), multi-feature fusion, weak semantic data, adaptive

中图分类号:

TP391.3

习怡萌, 邓箴, 刘倩, 刘立波. 跨模态信息融合的视频-文本检索[J]. 计算机应用, 2025, 45(8): 2448-2456.

Yimeng XI, Zhen DENG, Qian LIU, Libo LIU. Cross-modal information fusion for video-text retrieval[J]. Journal of Computer Applications, 2025, 45(8): 2448-2456.

图/表 10

图1 强语义描述和弱语义描述的示例

Fig. 1 Examples of strong and weak semantic descriptions

图2 本文模型架构

Fig. 2 Architecture of proposed model

图3 跨模态信息融合模块

Fig. 3 Cross-modal information fusion module

表1 数据集划分

Tab. 1 Dataset partitioning

数据集	训练数据对	验证数据对	测试数据对
MSR-VTT^［17］	180 000	0	1 000
DiDeMo^［18］	8 395	1 065	1 004
LSMDC^［19］	109 673	7 408	1 000

表2 α和β取不同值时在MSR-VTT、DiDeMo和LSMDC数据集上的R@1值

Tab. 2 R@1 values on MSR-VTT， DiDeMo， and LSMDC datasets with different α and β values

$α$ 值	$β$ 值	MSR-VTT上的R@1值/%		DiDeMo上的R@1值/%		LSMDC上的R@1值/%
$α$ 值	$β$ 值	文本检索视频	视频检索文本	文本检索视频	视频检索文本	文本检索视频	视频检索文本
0.25	0.25	49.4	49.2	50.3	49.6	33.6	32.1
	0.50	52.5	51.4	49.4	54.3	30.4	29.3
	0.75	48.9	48.2	47.6	47.3	29.7	28.5
0.50	0.25	54.6	52.3	54.6	52.8	29.5	32.1
	0.50	56.7	54.9	58.4	56.0	28.8	31.0
	0.75	51.5	48.8	52.7	53.5	27.5	30.6
0.75	0.25	52.1	50.9	53.2	49.7	28.6	30.5
	0.50	53.2	53.6	55.9	53.8	27.9	29.9
	0.75	51.2	48.4	52.0	52.6	27.3	29.4

表2 α和β取不同值时在MSR-VTT、DiDeMo和LSMDC数据集上的R@1值

Tab. 2 R@1 values on MSR-VTT， DiDeMo， and LSMDC datasets with different α and β values

$α$ 值	$β$ 值	MSR-VTT上的R@1值/%		DiDeMo上的R@1值/%		LSMDC上的R@1值/%
$α$ 值	$β$ 值	文本检索视频	视频检索文本	文本检索视频	视频检索文本	文本检索视频	视频检索文本
0.25	0.25	49.4	49.2	50.3	49.6	33.6	32.1
	0.50	52.5	51.4	49.4	54.3	30.4	29.3
	0.75	48.9	48.2	47.6	47.3	29.7	28.5
0.50	0.25	54.6	52.3	54.6	52.8	29.5	32.1
	0.50	56.7	54.9	58.4	56.0	28.8	31.0
	0.75	51.5	48.8	52.7	53.5	27.5	30.6
0.75	0.25	52.1	50.9	53.2	49.7	28.6	30.5
	0.50	53.2	53.6	55.9	53.8	27.9	29.9
	0.75	51.2	48.4	52.0	52.6	27.3	29.4

表3 本文方法与基准方法在MSR-VTT数据集上的性能对比

Tab. 3 Performance comparison of proposed method and benchmark methods on MSR-VTT dataset

方法	参数量/10⁶	文本检索视频				视频检索文本
方法	参数量/10⁶	R@1/%	R@5/%	R@10/%	MedR	R@1/%	R@5/%	R@10/%	MedR
ClipBERT^［20］	—	22.0	46.8	59.9	6.0	—	—	—	—
MMT^［21］	133.3	26.6	57.1	69.6	4.0	27.0	57.5	69.7	3.7
Frozen^［22］	142.4	32.5	61.5	71.2	3.0	—	—	—	—
TMVM^［23］	—	36.2	64.2	75.7	3.0	34.8	63.8	73.7	3.0
CenterCLIP^［24］	—	48.4	73.8	82.0	2.0	47.7	75.0	83.3	2.0
TS2-Net^［25］	—	47.0	74.5	83.8	2.0	46.6	75.9	84.9	2.0
DRL^［26］	—	50.2	76.5	84.7	1.0	50.2	76.5	84.7	1.0
MSIA^［7］	—	49.3	75.1	85.5	2.0	—	—	—	—
Cap4Video^［27］	—	51.4	75.7	83.9	1.0	51.4	75.7	83.9	1.0
MuLTI^［28］	247.0	54.7	77.7	86.0	1.0	—	—	—	—
本文方法	160.6	56.7	79.6	87.4	1.0	54.9	79.9	86.6	1.0

表4 本文方法与基准方法在DiDeMo数据集上的性能对比

Tab. 4 Performance comparison of proposed method and benchmark methods on DiDeMo dataset

方法	参数量/10⁶	文本检索视频
方法	参数量/10⁶	R@1/%	R@5/%	R@10/%	MedR
CE^［29］	119.5	16.1	41.1	—	8.3
CLIP4Clip^［5］	151.2	43.4	70.2	80.6	2.0
Frozen^［22］	142.4	31.0	59.8	72.4	3.0
MSIA^［7］	—	43.6	70.2	79.6	2.0
Cap4Video^［27］	—	52.0	79.4	87.5	1.0
MuLTI^［28］	247.0	56.5	80.2	87.0	1.0
本文方法	160.6	58.4	82.3	88.4	1.0

表5 本文方法与基准方法在LSMDC数据集上的性能对比

Tab. 5 Performance comparison of proposed method and benchmark methods on LSMDC dataset

方法	参数量/10⁶	文本检索视频
方法	参数量/10⁶	R@1/%	R@5/%	R@10/%	MedR
CE^［29］	119.5	11.2	26.9	34.8	25.3
MMT^［21］	133.3	12.9	29.9	40.1	19.3
Frozen^［22］	142.4	15.0	30.8	39.8	20.0
MSIA^［7］	—	19.7	38.1	47.5	12.0
CLIP4Clip^［5］	151.2	21.6	41.8	49.8	11.0
CLIP-ViP^［30］	—	30.7	51.4	60.6	5.0
本文方法	160.6	33.6	54.1	62.8	5.0

表6 本文方法在MSR-VTT数据集上的消融实验结果

Tab. 6 Ablation experimental results of proposed method on MSR-VTT dataset

方法	检索策略	融合策略	融合方法	参数量/10⁶	文本检索视频
方法	检索策略	融合策略	融合方法	参数量/10⁶	R@1/%	R@5/%	R@10/%	MedR
本文方法	单模态	跨模态	自适应交叉注意力	160.6	56.7	79.6	87.4	1.0
基线模型	—	—	—	151.2	44.5	71.4	81.6	2.0
调整检索策略	跨模态	跨模态	自适应交叉注意力	160.6	48.6	67.4	78.2	2.0
调整融合策略	单模态	单模态	自适应交叉注意力	160.6	49.1	67.6	79.4	2.0
调整融合方法	单模态	跨模态	Sum信息融合	155.8	50.9	73.2	83.3	1.0
	单模态	跨模态	MLP信息融合	163.2	53.1	74.5	85.1	1.0
	单模态	跨模态	Cross Transformer信息融合	158.1	54.4	76.7	85.9	1.0

图4 文本检索视频的示例

Fig. 4 Examples of text-to-video retrieval

参考文献 30

[1]	MITHUN N C， LI J， METZE F， et al. Learning joint embedding with multimodal cues for cross-modal video-text retrieval［C］// Proceedings of the 2018 ACM International Conference on Multimedia Retrieval. New York： ACM， 2018： 19-27.
[2]	彭宇新，綦金玮，黄鑫. 多媒体内容理解的研究现状与展望［J］. 计算机研究与发展， 2019， 56（1）：183-208.
	PENG Y X， QI J W， HUANG X. Current research status and prospects on multimedia content understanding［J］. Journal of Computer Research and Development， 2019， 56（1）：183-208.
[3]	JIA C， YANG Y， XIA Y， et al. Scaling up visual and vision-language representation learning with noisy text supervision［C］// Proceedings of the 38th International Conference on Machine Learning. New York： JMLR.org， 2021： 4904-4916.
[4]	RADFORD A， KIM J W， HALLACY C， et al. Learning transferable visual models from natural language supervision［C］// Proceedings of the 38th International Conference on Machine Learning. New York： JMLR.org， 2021： 8748-8763.
[5]	LUO H， JI L， ZHONG M， et al. CLIP4Clip： an empirical study of CLIP for end to end video clip retrieval［J］. Neurocomputing， 2021， 508： 293-304.
[6]	GORTI S K， VOUITSIS N， MA J， et al. X-Pool： cross-modal language-video attention for text-video retrieval［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 4996-5005.
[7]	CHEN L， DENG Z， LIU L， et al. Multilevel semantic interaction alignment for video-text cross-modal retrieval［J］. IEEE Transactions on Circuits and Systems for Video Technology， 2024， 34（7）： 6559-6575.
[8]	KO K C， CHEON Y M， KIM G Y， et al. Video shot boundary detection algorithm［C］// Proceedings of the 2006 Indian Conference on Computer Vision， Graphics and Image Processing， LNCS 4338. Berlin ： Springer， 2006： 388-396.
[9]	CHEN C Y， WANG J C， WANG J F. Efficient news video querying and browsing based on distributed news video servers［J］. IEEE Transactions on Multimedia， 2006， 8（2）： 257-269.
[10]	RASIWASIA N， COSTA PEREIRA J， COVIELLO E， et al. A new approach to cross-modal multimedia retrieval［C］// Proceedings of the 18th ACM International Conference on Multimedia. New York： ACM， 2010： 251-260.
[11]	SUN X， LONG X， HE D， et al. VSRNet： end-to-end video segment retrieval with text query［J］. Pattern Recognition， 2021， 119： No.108027.
[12]	MIN S， KONG W， TU R C， et al. HunYuan_tvr for text-video retrieval［EB/OL］. ［2024-08-20］..
[13]	CHOI S， KIM J T， CHOO J. Cars can’t fly up in the sky： improving urban-scene segmentation via height-driven attention networks［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020：9370-9380.
[14]	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 6000-6010.
[15]	ZHAI A， WU H Y. Classification is a strong baseline for deep metric learning［C］// Proceedings of the 2019 British Machine Vision Conference. Durham： BMVA Press， 2019： 1-12.
[16]	李天煜，刘立波. 基于模态内相似性与语义保留的深度跨模态哈希［J］. 数据分析与知识发现， 2023， 7（5）： 105-115.
	LI T Y， LIU L B. Deep cross-modal hashing based on intra-modal similarity and semantic preservation［J］. Data Analysis and Knowledge Discovery， 2023， 7（5）： 105-115.
[17]	XU J， MEI T， YAO T， et al. MSR-VTT： a large video description dataset for bridging video and language［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 5288-5296.
[18]	WU Z， YAO T， FU Y， et al. Deep learning for video classification and captioning［M］// CHANG S F. Frontiers of multimedia research. New York： ACM， 2017： 3-29.
[19]	HENDRICKS L A， WANG O， SHECHTMAN E， et al. Localizing moments in video with natural language［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 5804-5813.
[20]	LEI J， LI L， ZHOU L， et al. Less is more： ClipBERT for video-and-language learning via sparse sampling［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 7327-7337.
[21]	GABEUR V， SUN C， ALAHARI K， et al. Multi-modal transformer for video retrieval［C］// Proceedings of the 2020 European Conference on Computer Vision， LNCS 12349. Cham： Springer， 2020： 214-229.
[22]	BAIN M， NAGRANI A， VAROL G， et al. Frozen in time： a joint video and image encoder for end-to-end retrieval［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 1708-1718.
[23]	LIN C， WU A， LIANG J， et al. Text-adaptive multiple visual prototype matching for video-text retrieval［C］// Proceedings of the 36th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2022： 38655-38666.
[24]	ZHAO S， ZHU L， WANG X， et al. CenterCLIP： token clustering for efficient text-video retrieval［C］// Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York： ACM， 2022： 970-981.
[25]	LIU Y， XIONG P， XU L， et al. TS2-Net： token shift and selection Transformer for text-video retrieval［C］// Proceedings of the 2022 European Conference on Computer Vision， LNCS 13674. Cham： Springer， 2022： 319-335.
[26]	WANG Q， ZHANG Y， ZHENG Y， et al. Disentangled representation learning for text-video retrieval［EB/OL］. ［2024-06-20］..
[27]	WU W， LUO H， FANG B， et al. Cap4Video： what can auxiliary captions do for text-video retrieval？［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023： 10704-10713.
[28]	XU J， LIU B， CHEN Y， et al. MuLTI： efficient video-and-language understanding with text-guided multi way-sampler and multiple choice modeling［C］// Proceedings of the 38th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2024： 6297-6305.
[29]	LIU Y， ALBANIE S， NAGRANI A， et al. Use what you have： video retrieval using representations from collaborative experts［C］// Proceedings of the 2019 British Machine Vision Conference. Durham： BMVA Press， 2019： 1-19.
[30]	XUE H， SUN Y， LIU B， et al. CLIP-ViP： adapting pre-trained image-text model to video-language representation alignment［EB/OL］. ［2024-08-20］..

跨模态信息融合的视频-文本检索

Cross-modal information fusion for video-text retrieval

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 30

相关文章 15

编辑推荐

Metrics

[1]	白瑞峰, 苟光磊, 文浪, 缪宛谕. 基于粒球原型网络的小样本图像分类方法[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2269-2277.
[2]	王建华, 吴传宇, 许莉萍. 多因素柔性作业车间绿色调度的改进进化算法[J]. 《计算机应用》唯一官方网站, 2025, 45(6): 1954-1962.
[3]	李维刚, 李歆怡, 王永强, 赵云涛. 基于自适应动态图卷积和无参注意力的点云分类分割方法[J]. 《计算机应用》唯一官方网站, 2025, 45(6): 1980-1986.
[4]	崔双双, 王宏志, 朱加昊, 吴昊. 面向低能耗高性能的分类器两阶段数据选择方法[J]. 《计算机应用》唯一官方网站, 2025, 45(6): 1703-1711.
[5]	蒋杰, 骆功宁, 董素宇, 李凡丁, 李向宇, 李钦策, 袁永峰, 王宽全. 信息瓶颈引导的颅内出血分割方法[J]. 《计算机应用》唯一官方网站, 2025, 45(6): 1998-2006.
[6]	李道全, 徐正, 陈思慧, 刘嘉宇. 融合变分自编码器与自适应增强卷积神经网络的网络流量分类模型[J]. 《计算机应用》唯一官方网站, 2025, 45(6): 1841-1848.
[7]	胡婕, 武帅星, 曹芝兰, 张龑. 基于全域信息融合和多维关系感知的命名实体识别模型[J]. 《计算机应用》唯一官方网站, 2025, 45(5): 1511-1519.
[8]	王泉, 陆啟想, 施珮. 用于交通流量预测的多图扩散注意力网络[J]. 《计算机应用》唯一官方网站, 2025, 45(5): 1472-1479.
[9]	丁美荣, 卓金鑫, 陆玉武, 刘庆龙, 郎济聪. 融合环境标签平滑与核范数差异的领域自适应[J]. 《计算机应用》唯一官方网站, 2025, 45(4): 1130-1138.
[10]	王兴旺, 张清杨, 姜守勇, 董永权. 基于改进鲸鱼优化算法的动态无人机路径规划[J]. 《计算机应用》唯一官方网站, 2025, 45(3): 928-936.
[11]	王蔡琪, 崔西宁, 熊毅, 伍世虔. 基于节点到障碍物距离的自适应扩展RRT^*路径规划算法[J]. 《计算机应用》唯一官方网站, 2025, 45(3): 920-927.
[12]	张传浩, 屠晓涵, 谷学汇, 轩波. 基于多模态信息相互引导补充的雷达-相机三维目标检测[J]. 《计算机应用》唯一官方网站, 2025, 45(3): 946-952.
[13]	王雅伦, 张仰森, 朱思文. 面向知识推理的位置编码标题生成模型[J]. 《计算机应用》唯一官方网站, 2025, 45(2): 345-353.
[14]	李林昊, 王逸泽, 李英双, 董永峰, 王振. 基于关系特征强化的全景场景图生成方法[J]. 《计算机应用》唯一官方网站, 2025, 45(2): 584-593.
[15]	徐超, 张淑芬, 陈海田, 彭璐璐, 张帅华. 基于自适应差分隐私与客户选择优化的联邦学习方法[J]. 《计算机应用》唯一官方网站, 2025, 45(2): 482-489.