In view of the problems that the existing 3D visual grounding methods rely on expensive sensor equipment, have high system costs, and exhibit poor accuracy and robustness in complex multi-target grounding scenarios, a multi-target 3D visual grounding method based on monocular images was proposed. In this method, natural language descriptions were combined to achieve the recognition of multiple 3D targets in a single RGB image. To this end, a multi-target visual grounding dataset, Mmo3DRefer, was constructed, and a cross-modal matching network, TextVizNet, was designed. In TextVizNet, 3D bounding boxes for targets were generated by a pre-trained monocular detector, and visual and linguistic information was integrated deeply via an information fusion module and an information alignment module, thereby realizing text-guided multi-target 3D detection. Experimental results of comparing with 5 existing advanced methods including CORE-3DVG (Contextual Objects and RElations for 3D Visual Grounding), 3DVG-Transformer, and Multi3DRefer (Multiple 3D object Referencing dataset and task) show that TextVizNet improves the F1-score, precision, and recall by 8.92%, 8.39%, and 9.57%, respectively, on the Mmo3DRefer dataset compared with the second-best method Multi3DRefer, improving the precision of text-based multi-target grounding in complex scenarios significantly, and providing effective support for practical applications such as autonomous driving and intelligent robotics.