Journal of Computer Applications

    Next Articles

Technological evolution of multimodal retrieval

  

  • Received:2026-01-19 Revised:2026-04-24 Online:2026-05-29 Published:2026-05-29

多模态检索的技术演进

于晓璞1,郭洁2   

  1. 1. 上海第二工业大学
    2. 西安电子科技大学
  • 通讯作者: 于晓璞

Abstract: The core mission of multimodal retrieval is to address the semantic gap between different modalities of data, thereby achieving cross-modal similarity matching and content search. Its technical evolution has generally progressed through three stages: "shallow alignment," "deep representation," and "large model-driven" approaches. Early work primarily relied on shallow models such as common subspace learning and hashing methods due to computational power and data scale. The core idea is to find a shared representation space for different modal features to achieve alignment. Although intuitive and efficient, these methods had limited capacity for representing complex semantics. The introduction of deep learning brought fundamental changes. Through deep neural networks and attention mechanisms, models can automatically learn complex nonlinear mappings between modalities, which significantly improves the accuracy of semantic matching. However, this also introduced new challenges such as model bloating and strong data dependency. Recently, an era driven by large models is emerging. Researchers have begun constructing unified multimodal semantic spaces by vision-language pretraining models. While demonstrating powerful zero-shot retrieval abilities, these models present significant hurdles in practice, particularly in adapting to fine-grained, domain-specific tasks and in balancing their large scale with need for computational efficiency. By summarizing technological evolution across these three stages, this study analyzes core technologies and their evolutionary logic at each stage, explores their applicable scenarios, and discusses representative datasets and evaluation systems established at each stage, which provide a standardized platform for method comparison. On this basis, this study outlines prospects for future research, including balancing efficiency and capability, advancing deep reasoning abilities, and adapting to specialized domains.

摘要: 多模态检索的核心任务是解决不同模态数据之间的语义鸿沟问题,从而实现跨模态的相似性匹配与内容搜索。其技术演进大致经历了从浅层对齐、深度表示到大模型驱动的三个阶段。早期工作受限于计算能力和数据规模,主要依赖公共子空间学习和哈希方法等浅层模型,核心思路是为不同模态特征寻找一个共享的表示空间以实现对齐。这类方法虽然直观高效,但是表征复杂语义的能力有限。而深度学习的引入带来了根本性的改变,通过深度神经网络与注意力机制,模型能够自动学习模态间复杂的非线性映射关系,显著提升语义匹配的精度,但是也带来了模型臃肿、数据依赖性强等新问题。近年来进入了大模型驱动的时代。以视觉-语言预训练模型为代表,研究者们开始尝试构建统一的多模态语义空间。这种方法展现出强大的零样本检索能力,但如何使其适应特定领域的精细任务,以及如何平衡模型规模与计算效率,仍是当前实践中遇到的现实挑战。通过归纳总结这三个阶段的技术演进,本文分析了各阶段的核心技术及其演进逻辑,探讨了其适用场景,并讨论了各阶段均形成的代表性的数据集与评估体系,为方法比较提供了标准化平台。在此基础上,本文展望了未来的研究方向,包括效率与能力的平衡、深度推理能力的突破,以及专业领域的适配。

CLC Number: