Search Result

Journals

Publication Years

Keywords

Please wait a minute...

For Selected:

Download Citations
EndNote Ris BibTeX

Toggle Thumbnails

Select

Hierarchical cross-modal fusion method for 3D object detection based on Mamba model

Mingguang LI, Chongben TAO

Journal of Computer Applications 2026, 46 (2): 572-579. DOI: 10.11772/j.issn.1001-9081.2025030281

Abstract （55）

HTML （0）

PDF （3178KB）（9）

Save

To address the issue that the existing cross-modal fusion methods based on Bird’s-Eye View （BEV） neglect effective preservation of local BEV feature information in the initial fusion stage， leading to insufficient shallow cross-modal interactions， so as to constrain subsequent deep fusion effect and reduce accuracy in 3D object detection， a hierarchical cross-modal fusion method for 3D object detection based on Mamba model was proposed. In the method， the state space modeling mechanism of Mamba was integrated with the hierarchical fusion mechanism deeply， so that cross-modal features were mapped into a hidden state space to facilitate interactions， thereby enriching local information， reducing discrepancies among modal features， and enhancing the consistency of fused feature representations. In the shallow fusion stage， a feature channel exchange mechanism was designed to exchange feature channels from different sensor modalities， thereby improving the preservation ability of shallow local details， and the Visual State Space （VSS） block of Mamba model was improved to strengthen interactions among shallow features. In the deep fusion stage， an attention mechanism and a gating mechanism were introduced to construct hidden feature transformation， so as to identify and fuse complementary long-range dependency features among modalities. Finally， a channel adaptive module was employed to calculate channel attention on normalized original features， and intra-modal channel relationships were learned adaptively to enhance single-modal BEV feature representations， thus compensating for Mamba model’s limitation in modeling inter-channel relationships. Experimental results show that the proposed method achieves superior detection performance compared to methods such as TransFusion and multi-modal fusion method combining local-global modeling LoGoNet （Local-to-Global Network） on the nuScenes and Waymo datasets. On the nuScenes test set， the proposed method has a mean Average Precision （mAP） of 72.4% and a nuScenes Detection Score （NDS） of 73.9%， surpassing the baseline method BEVFusion_mit by 2.2 and 1.0 percentage points， respectively.

Table and Figures | Reference | Related Articles | Metrics