Technological evolution of multimodal retrieval

doi:10.11772/j.issn.1001-9081.2026010028

Abstract

Abstract: The core mission of multimodal retrieval is to address the semantic gap between different modalities of data, thereby achieving cross-modal similarity matching and content search. Its technical evolution has generally progressed through three stages: "shallow alignment," "deep representation," and "large model-driven" approaches. Early work primarily relied on shallow models such as common subspace learning and hashing methods due to computational power and data scale. The core idea is to find a shared representation space for different modal features to achieve alignment. Although intuitive and efficient, these methods had limited capacity for representing complex semantics. The introduction of deep learning brought fundamental changes. Through deep neural networks and attention mechanisms, models can automatically learn complex nonlinear mappings between modalities, which significantly improves the accuracy of semantic matching. However, this also introduced new challenges such as model bloating and strong data dependency. Recently, an era driven by large models is emerging. Researchers have begun constructing unified multimodal semantic spaces by vision-language pretraining models. While demonstrating powerful zero-shot retrieval abilities, these models present significant hurdles in practice, particularly in adapting to fine-grained, domain-specific tasks and in balancing their large scale with need for computational efficiency. By summarizing technological evolution across these three stages, this study analyzes core technologies and their evolutionary logic at each stage, explores their applicable scenarios, and discusses representative datasets and evaluation systems established at each stage, which provide a standardized platform for method comparison. On this basis, this study outlines prospects for future research, including balancing efficiency and capability, advancing deep reasoning abilities, and adapting to specialized domains.

摘要： 多模态检索的核心任务是解决不同模态数据之间的语义鸿沟问题，从而实现跨模态的相似性匹配与内容搜索。其技术演进大致经历了从浅层对齐、深度表示到大模型驱动的三个阶段。早期工作受限于计算能力和数据规模，主要依赖公共子空间学习和哈希方法等浅层模型，核心思路是为不同模态特征寻找一个共享的表示空间以实现对齐。这类方法虽然直观高效，但是表征复杂语义的能力有限。而深度学习的引入带来了根本性的改变，通过深度神经网络与注意力机制，模型能够自动学习模态间复杂的非线性映射关系，显著提升语义匹配的精度，但是也带来了模型臃肿、数据依赖性强等新问题。近年来进入了大模型驱动的时代。以视觉-语言预训练模型为代表，研究者们开始尝试构建统一的多模态语义空间。这种方法展现出强大的零样本检索能力，但如何使其适应特定领域的精细任务，以及如何平衡模型规模与计算效率，仍是当前实践中遇到的现实挑战。通过归纳总结这三个阶段的技术演进，本文分析了各阶段的核心技术及其演进逻辑，探讨了其适用场景，并讨论了各阶段均形成的代表性的数据集与评估体系，为方法比较提供了标准化平台。在此基础上，本文展望了未来的研究方向，包括效率与能力的平衡、深度推理能力的突破，以及专业领域的适配。

CLC Number:

TP391

于晓璞郭洁. 多模态检索的技术演进[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2026010028.

[1]	WANG Xin, AN Junxiu, MAO Ke. Image captioning with block-prototype contrastive alignment based on dynamic semantic mapping [J]. Journal of Computer Applications, 0, (): 0-0.
[2]	. Retrieval-augmented generation integrated with policy dynamic evolution mechanism for intelligent Q&A system in electricity market [J]. Journal of Computer Applications, 0, (): 0-0.
[3]	. Scene recognition method based on structured co-occurrence representation learning [J]. Journal of Computer Applications, 0, (): 0-0.
[4]	. Integrating optimal transport and prototype contrastive learning for semi-supervised domain incremental medical image segmentation [J]. Journal of Computer Applications, 0, (): 0-0.
[5]	. Attention-guided symmetric positive definite second-order representation for facial expression recognition [J]. Journal of Computer Applications, 0, (): 0-0.
[6]	. Sequential recommendation based on long- and short-term interest dual encoding and contrastive learning [J]. Journal of Computer Applications, 0, (): 0-0.
[7]	HU Jie, ZHENG Jiahao, XU Qiao. Few-shot intent classification based on adversarially enhanced feature learning and hierarchical knowledge distillation [J]. Journal of Computer Applications, 0, (): 0-0.
[8]	CHEN Xiaolei, AN Qianqian. Salient object detection-driven viewport prediction for 360-degree live video streaming [J]. Journal of Computer Applications, 0, (): 0-0.
[9]	CHENG Jian, XU Bingxin, PAN Weiguo, LIU Hongzhe, DAI Songyin, XU Cheng. Unsupervised low-light image enhancement method with diffusion priors and detection-oriented bridging [J]. Journal of Computer Applications, 0, (): 0-0.
[10]	. Red kidney bean leaf disease detection method based on Mamba feature extraction and improved YOLOv11 [J]. Journal of Computer Applications, 0, (): 0-0.
[11]	. Noninvasive fetal electrocardiogram signal extraction method based on Mamba-UNETR [J]. Journal of Computer Applications, 0, (): 0-0.
[12]	. Knowledge graph reasoning framework based on retrieval enhancement and constrained decoding [J]. Journal of Computer Applications, 0, (): 0-0.
[13]	. Fine-grained cross-modal molecular retrieval method based on reinforcement learning [J]. Journal of Computer Applications, 0, (): 0-0.
[14]	. Multimodal bio-coupling correlation driven audio-visual deepfake detection [J]. Journal of Computer Applications, 0, (): 0-0.
[15]	. UAV remote sensing image small object detection algorithm based on improved RT-DETR [J]. Journal of Computer Applications, 0, (): 0-0.

Technological evolution of multimodal retrieval

多模态检索的技术演进

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics