With the advancement of deep learning technologies, artificial intelligence has been driven to evolve from single-modality intelligence toward multimodal intelligence. Vision?Language Models (VLMs), which serve as the pivotal means of bridging vision and language, have been established as a core research area. Aiming at the technological evolution of VLMs, architecture development of VLM was reviewed systematically, and the core technologies and latest research progress in this field were summarized. Firstly, the progression of VLM from early explorations to the current flourishing state was traced, key technological nodes and development trends were analyzed, and a technology roadmap with “architecture development” as the core theme was delineated. Secondly, the current foundational techniques of VLM were analyzed deeply, including core architectures built around vision encoders, language encoders, and cross‐modal fusion mechanisms, as well as key pretraining optimization objectives such as Masked Language Modeling (MLM), Masked Image Modeling (MIM), and Contrastive Learning (CL). Concurrently, the mainstream datasets, which VLM pretraining relies on, such as COCO and LAION-5B, were listed systematically. Finally, representative VLMs were compared and analyzed to discover the relationships among model performance, data scale, architectural innovations, and training strategies, and the advantages and limitations of the related core technologies were commented, thereby providing a comprehensive VLM technology map for researchers of related fields, and offering reference and inspiration for future research.