《计算机应用》唯一官方网站

• •    下一篇

面向具身智能的视觉-语言-动作模型动作表征和生成策略综述

张文涛1,2,孙奥兰1,瞿晓阳1,张旭龙1,王健宗1   

  1. 1. 平安科技(深圳)有限公司
    2. 清华大学深圳国际研究生院
  • 收稿日期:2025-08-07 修回日期:2025-09-19 发布日期:2025-11-05 出版日期:2025-11-05
  • 通讯作者: 张文涛

Survey on action representation and generation strategies in Vision-Language-Action models for embodied intelligence

  • Received:2025-08-07 Revised:2025-09-19 Online:2025-11-05 Published:2025-11-05

摘要: 视觉-语言-动作模型是实现具身智能的核心路径,其核心在于将多模态感知理解无缝转化为物理世界的具体行动。然而,动作表征与生成策略作为连接“感知”与“执行”的枢纽环节,面临着高维连续空间、动作多样性与机器人实时控制需求间的复杂挑战。该综述系统性地梳理和总结了VLA模型中动作表征和生成策略的演进脉络、核心技术与未来方向,内容详细剖析了离散和连续两种动作表征方式,以及自回归、非自回归和混合生成策略,并深入探讨了它们在动作精度、生成多样性与推理效率之间的内在权衡。此外,综述还涵盖了面向实时控制的新兴高效策略,如混合生成架构等。最后,通过比较分析对现有技术图景进行了总结,并展望了未来在与世界模型结合、跨机器人形态通用表征等方向上的前沿挑战与研究机遇,旨在为构建更通用、更高效的具身智能体提供参考。

Abstract: Vision–Language–Action (VLA) models constitute a critical pathway toward embodied intelligence, with their core function being the seamless transformation of multimodal perception and understanding into concrete actions in the physical world. Action representation and generation strategies, serving as the pivotal bridge between perception and execution, face significant challenges stemming from high-dimensional continuous spaces, the diversity of action modalities, and the stringent demands of real-time robotic control. This survey provides a systematic review of the evolution, key methodologies, and future directions of action representation and generation in VLA models. We analyze discrete and continuous representations in depth, and examine autoregressive, non-autoregressive, and hybrid generation strategies, highlighting their inherent trade-offs in terms of precision, diversity, and efficiency. In addition, we cover emerging high-efficiency approaches designed for real-time control, such as hybrid generation architectures. Finally, we present a comparative synthesis of the current technological landscape and outline frontier challenges and research opportunities, including integration with world models and the development of generalizable representations across heterogeneous robotic embodiments. This work aims to provide a comprehensive reference for advancing more general, efficient, and reliable embodied agents.