Journal of Computer Applications ›› 2026, Vol. 46 ›› Issue (1): 207-215.DOI: 10.11772/j.issn.1001-9081.2025010074

• Multimedia computing and computer simulation • Previous Articles     Next Articles

Multi-target 3D visual grounding method based on monocular images

Shuwen HUANG1, Keyu GUO1, Xiangyu SONG2(), Feng HAN1, Shijie SUN2, Huansheng SONG1   

  1. 1.School of Information Engineering,Chang'an University,Xi'an Shaanxi 710064,China
    2.School of Data Science and Artificial Intelligence,Chang'an University,Xi'an Shaanxi 710064,China
  • Received:2025-01-20 Revised:2025-03-05 Accepted:2025-03-12 Online:2026-01-10 Published:2026-01-10
  • Contact: Xiangyu SONG
  • About author:HUANG Shuwen, born in 2001, M. S. candidate. Her research interests include computer vision, 3D visual grounding.
    GUO Keyu, born in 1999, Ph. D. candidate. His research interests include computer vision, object tracking, 3D visual grounding.
    HAN Feng, born in 2001, M. S. candidate. His research interests include computer vision, anomaly detection.
    SUN Shijie, born in 1989, Ph. D., associate professor. His research interests include computer vision, object tracking, pose estimation.
    SONG Huansheng, born in 1964, Ph. D., professor. His research interests include computer vision, image processing, intelligent transportation.
  • Supported by:
    National Key Research and Development Program of China(2023YFB4301800)

基于单目图像的多目标三维视觉定位方法

黄舒雯1, 郭柯宇1, 宋翔宇2(), 韩锋1, 孙士杰2, 宋焕生1   

  1. 1.长安大学 信息工程学院,西安 710064
    2.长安大学 数据科学与人工智能研究院,西安 710064
  • 通讯作者: 宋翔宇
  • 作者简介:黄舒雯(2001—),女,广西桂平人,硕士研究生, CCF会员,主要研究方向:计算机视觉、三维视觉定位
    郭柯宇(1999—),男,贵州黔南人,博士研究生,主要研究方向:计算机视觉、目标跟踪、三维视觉定位
    韩锋(2001—),男,山西吕梁人,硕士研究生, CCF会员,主要研究方向:计算机视觉、异常检测
    孙士杰(1989—),男,河南商丘人,副教授,博士,主要研究方向:计算机视觉、目标跟踪、位姿估计
    宋焕生(1964—),男,内蒙古凉城人,教授,博士,主要研究方向:计算机视觉、图像处理、智能交通。
  • 基金资助:
    国家重点研发计划项目(2023YFB4301800)

Abstract:

In view of the problems that the existing 3D visual grounding methods rely on expensive sensor equipment, have high system costs, and exhibit poor accuracy and robustness in complex multi-target grounding scenarios, a multi-target 3D visual grounding method based on monocular images was proposed. In this method, natural language descriptions were combined to achieve the recognition of multiple 3D targets in a single RGB image. To this end, a multi-target visual grounding dataset, Mmo3DRefer, was constructed, and a cross-modal matching network, TextVizNet, was designed. In TextVizNet, 3D bounding boxes for targets were generated by a pre-trained monocular detector, and visual and linguistic information was integrated deeply via an information fusion module and an information alignment module, thereby realizing text-guided multi-target 3D detection. Experimental results of comparing with 5 existing advanced methods including CORE-3DVG (Contextual Objects and RElations for 3D Visual Grounding), 3DVG-Transformer, and Multi3DRefer (Multiple 3D object Referencing dataset and task) show that TextVizNet improves the F1-score, precision, and recall by 8.92%, 8.39%, and 9.57%, respectively, on the Mmo3DRefer dataset compared with the second-best method Multi3DRefer, improving the precision of text-based multi-target grounding in complex scenarios significantly, and providing effective support for practical applications such as autonomous driving and intelligent robotics.

Key words: 3D visual grounding, monocular image, multi-modal technology, object detection, scene understanding

摘要:

针对现有的三维视觉定位方法依赖昂贵传感器设备、系统成本高且在复杂多目标定位中准确度和鲁棒性不足的问题,提出一种基于单目图像的多目标三维视觉定位方法。该方法结合自然语言描述,在单个RGB图像中实现对多个三维目标的识别。为此,构建一个多目标视觉定位数据集Mmo3DRefer,并设计跨模态匹配网络TextVizNet。TextVizNet通过预训练的单目检测器生成目标的三维边界框,并借助信息融合模块与信息对齐模块实现视觉与语言信息的深度整合,进而实现文本指导下的多目标三维检测。与CORE-3DVG (Contextual Objects and RElations for 3D Visual Grounding)、3DVG-Transformer和Multi3DRefer (Multiple 3D object Referencing dataset and task)等5种方法对比的实验结果表明,与次优方法Multi3DRefer相比,TextVizNet在Mmo3DRefer数据集上的F1-score、精确度和召回率分别提升了8.92%、8.39%和9.57%,显著提升了复杂场景下基于文本的多目标定位精度,为自动驾驶和智能机器人等实际应用提供了有效支持。

关键词: 三维视觉定位, 单目图像, 多模态技术, 目标检测, 场景理解

CLC Number: