Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
Cross-modal information fusion for video-text retrieval
Yimeng XI, Zhen DENG, Qian LIU, Libo LIU
Journal of Computer Applications    2025, 45 (8): 2448-2456.   DOI: 10.11772/j.issn.1001-9081.2024081082
Abstract24)   HTML0)    PDF (2509KB)(23)       Save

The existing Video-Text Retrieval (VTR) methods usually assume a strong semantic association between the text descriptions and the videos, but ignore the widely existing weakly related video-text pairs in datasets, so that the models are good at recognizing common general concepts but unable to fully mine the potential information of weak semantic descriptions, thus affecting retrieval performance of models. To address the above problems, a VTR model based on cross-modal information fusion was proposed. In this model, relevant external knowledge was utilized in a cross-modal way to improve retrieval performance of the model. Firstly, two external knowledge retrieval modules were constructed and were used to implement the retrieval of videos and external knowledge as well as the retrieval of texts and external knowledge respectively, so as to strengthen the original video and text feature representations with the help of external knowledge subsequently. Secondly, a cross-modal information fusion module with adaptive cross-attention was designed to remove redundant information in the videos and texts as well as conduct feature fusion by using complementary information between different modalities, thereby learning more discriminative feature representations. Finally, inter-modal and intra-modal similarity loss functions were introduced to ensure the integrity of information representation of the data in the fusion feature space, video feature space, and text feature space, so as to achieve accurate retrieval between cross-modal data. Experimental results show that compared with model MuLTI, the proposed model has the recall R@1 on public datasets MSR-VTT (Microsoft Research Video to Text) and DiDeMo (Distinct Describable Moments) increased by 2.0 and 1.9 percentage points respectively; compared with model CLIP-ViP, the proposed model has the R@1 on public dataset LSMDC (Large Scale Movie Description Challenge) increased by 2.9 percentage points. It can be seen that the proposed model can solve the problem of weakly related data pairs in VTR tasks effectively, thereby improving retrieval accuracy of the model.

Table and Figures | Reference | Related Articles | Metrics