Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
Multi-target 3D visual grounding method based on monocular images
Shuwen HUANG, Keyu GUO, Xiangyu SONG, Feng HAN, Shijie SUN, Huansheng SONG
Journal of Computer Applications    2026, 46 (1): 207-215.   DOI: 10.11772/j.issn.1001-9081.2025010074
Abstract44)   HTML0)    PDF (1812KB)(10)       Save

In view of the problems that the existing 3D visual grounding methods rely on expensive sensor equipment, have high system costs, and exhibit poor accuracy and robustness in complex multi-target grounding scenarios, a multi-target 3D visual grounding method based on monocular images was proposed. In this method, natural language descriptions were combined to achieve the recognition of multiple 3D targets in a single RGB image. To this end, a multi-target visual grounding dataset, Mmo3DRefer, was constructed, and a cross-modal matching network, TextVizNet, was designed. In TextVizNet, 3D bounding boxes for targets were generated by a pre-trained monocular detector, and visual and linguistic information was integrated deeply via an information fusion module and an information alignment module, thereby realizing text-guided multi-target 3D detection. Experimental results of comparing with 5 existing advanced methods including CORE-3DVG (Contextual Objects and RElations for 3D Visual Grounding), 3DVG-Transformer, and Multi3DRefer (Multiple 3D object Referencing dataset and task) show that TextVizNet improves the F1-score, precision, and recall by 8.92%, 8.39%, and 9.57%, respectively, on the Mmo3DRefer dataset compared with the second-best method Multi3DRefer, improving the precision of text-based multi-target grounding in complex scenarios significantly, and providing effective support for practical applications such as autonomous driving and intelligent robotics.

Table and Figures | Reference | Related Articles | Metrics