Journal of Computer Applications ›› 2026, Vol. 46 ›› Issue (6): 1703-1711.DOI: 10.11772/j.issn.1001-9081.2025060695

• Artificial intelligence •    

Review of vision-language model architecture development

Ziquan LIU, Xuyang SHI(), Ke LI, Liang LIU, Zhewei ZHU   

  1. School of Information and Control Engineering,Southwest University of Science and Technology,Mianyang Sichuan 621010,China
  • Received:2025-06-20 Revised:2025-08-15 Accepted:2025-08-21 Online:2025-09-01 Published:2026-06-10
  • Contact: Xuyang SHI
  • About author:LIU Ziquan, born in 2002, M. S. candidate, His research interests include pattern recognition, image segmentation, multimodality.
    LI Ke, born in 1994, Ph. D., lecturer. Her research interests include medical image processing, intelligent optical signal processing.
    LIU Liang, born in 1994, Ph. D., lecturer. His research interests include millimeter wave antenna, meta-surface, reflective array antenna.
    ZHU Zhewei, born in 2001, M. S. candidate. Her research interests include pattern recognition, medical image processing, real-time object detection.
    First author contact:SHI Xuyang, born in 1989, Ph. D., professor. His research interests include biosensor and intelligent detection, machine learning, medical image processing.
  • Supported by:
    Sichuan Science and Technology Program(2024NSFSC2040);Doctoral Fund Project of Southwest University of Science and Technology(23zx7136)

视觉语言模型架构发展综述

刘紫权, 史旭阳(), 李珂, 刘良, 朱哲维   

  1. 西南科技大学 信息与控制工程学院,四川 绵阳 621010
  • 通讯作者: 史旭阳
  • 作者简介:刘紫权(2002—),男,四川内江人,硕士研究生,主要研究方向:模式识别、图像分割、多模态
    李珂(1994—),女,贵州兴义人,讲师,博士,主要研究方向:医学图像处理、智能光学信号处理
    刘良(1994—),男,山东泰安人,讲师,博士,主要研究方向:毫米波天线,超表面、反射阵天线
    朱哲维(2001—),女,贵州毕节人,硕士研究生,主要研究方向:模式识别、医学图像处理、实时目标检测。
    第一联系人:史旭阳(1989—),男,陕西渭南人,教授,博士,CCF会员,主要研究方向:生物传感与智能检测、机器学习、医学图像处理
  • 基金资助:
    四川省科技计划项目(2024NSFSC2040);西南科技大学博士基金资助项目(23zx7136)

Abstract:

With the advancement of deep learning technologies, artificial intelligence has been driven to evolve from single-modality intelligence toward multimodal intelligence. Vision?Language Models (VLMs), which serve as the pivotal means of bridging vision and language, have been established as a core research area. Aiming at the technological evolution of VLMs, architecture development of VLM was reviewed systematically, and the core technologies and latest research progress in this field were summarized. Firstly, the progression of VLM from early explorations to the current flourishing state was traced, key technological nodes and development trends were analyzed, and a technology roadmap with “architecture development” as the core theme was delineated. Secondly, the current foundational techniques of VLM were analyzed deeply, including core architectures built around vision encoders, language encoders, and cross‐modal fusion mechanisms, as well as key pretraining optimization objectives such as Masked Language Modeling (MLM), Masked Image Modeling (MIM), and Contrastive Learning (CL). Concurrently, the mainstream datasets, which VLM pretraining relies on, such as COCO and LAION-5B, were listed systematically. Finally, representative VLMs were compared and analyzed to discover the relationships among model performance, data scale, architectural innovations, and training strategies, and the advantages and limitations of the related core technologies were commented, thereby providing a comprehensive VLM technology map for researchers of related fields, and offering reference and inspiration for future research.

Key words: Vision-Language Model (VLM), model architectural evolution, cross-modal fusion, multimodal pretraining

摘要:

随着深度学习技术的发展,人工智能正从单模态智能向多模态智能演进。视觉语言模型(VLM)作为连接视觉与语言的关键技术,已成为核心研究领域。针对VLM的技术演进历程,系统地综述它的架构发展,并总结该领域的核心技术和最新研究进展。首先,回顾VLM从早期探索到当前蓬勃发展的演进历程,分析关键技术节点和发展趋势,进而勾勒出以“架构发展”为核心主线的VLM技术发展图谱;其次,深入剖析当前VLM的基础技术,包括围绕视觉编码器、语言编码器和跨模态融合机制构建的核心架构,以及掩码语言建模(MLM)、掩码图像建模(MIM)和对比学习(CL)等关键预训练优化目标;同时,系统梳理当前VLM预训练所依赖的主流数据集如COCO和LAION-5B等;最后,对比分析代表性VLM,以阐明模型性能与数据规模、架构创新及训练策略间的关联,并评述相关核心技术的优势与局限性,从而为相关领域研究者提供全面的VLM技术图谱,助力把握发展脉络,并为未来研究提供参考与启发。

关键词: 视觉语言模型, 模型架构演进, 跨模态融合, 多模态预训练

CLC Number: