Journal of Computer Applications

    Next Articles

Hierarchical cross-modal fusion method for 3D object detection based on Mamba model

LI Mingguang1, TAO Chongben1,2   

  1. 1. School of Electronic and Information Engineering, Suzhou University of Science and Technology 2. Suzhou Automotive Research Institute, Tsinghua University
  • Received:2025-03-18 Revised:2025-06-05 Online:2025-06-23 Published:2025-06-23
  • About author:LI Mingguang, born in 2001, M. S. candidate. His research interests include 3D object detection, multimodal fusion. TAO Chongben, born in 1985, Ph. D., associate professor. His research interests include multi modal fusion perception for autonomous driving.
  • Supported by:
    National Natural Science Foundation of China (62472300)

基于Mamba模型的分级跨模态融合三维目标检测方法

李明光1,陶重犇1,2   

  1. 1.苏州科技大学 电子与信息工程学院 2. 清华大学 苏州汽车研究院
  • 通讯作者: 陶重犇
  • 作者简介:李明光(2001—),男,山东临沂人,硕士研究生,主要研究方向:三维目标检测、多模态融合;陶重犇(1985—),男,江苏苏州人,副教授,博士,主要研究方向:自动驾驶、多模态融合感知
  • 基金资助:
    国家自然科学基金资助项目(62472300)

Abstract: To address the issue that existing cross-modal fusion methods based on Bird's-Eye View( BEV)space neglect effective preservation of local BEV feature information in the initial fusion stage, leading to insufficient shallow cross-modal interactions, constrain subsequent deep fusion, and reduced accuracy in 3D object detection, a hierarchical cross-modal fusion method based on Mamba model was proposed. Cross-modal features were mapped into a hidden state space to facilitate interactions, thereby enriching local information, reducing discrepancies among modal features, and enhancing the consistency of fused feature representations. In the shallow fusion stage, a feature-channel exchange mechanism was designed to exchange feature channels from different sensor modalities, improving the preservation of shallow-level details. Subsequently, the visual state space block of the Mamba model was improved to strengthen effective interactions among shallow features and enhance feature representation quality. In the deep fusion stage, an attention mechanism was introduced to compute inter-modal difference features, which, combined with a gating mechanism, identified and fused complementary long-range dependencies among modalities. Finally, a channel-adaptive module was employed to calculate channel attention on normalized original features, adaptively learning intra-modal channel relationships to enhance single-modal BEV feature representations, thus compensating for the Mamba model’s inherent limitation in modeling inter-channel relationships. Experimental results show that the proposed method achieves superior detection performance compared to methods such as TransFusion and LoGoNet on the nuScenes and Waymo datasets. On the nuScenes test set, it attains a mean Average Precision (mAP) of 72.4% and a nuScenes Detection Score (NDS) of 73.9%, surpassing the baseline BEVFusion_mit by 2.2 and 1.0 percentage points, respectively.

Key words: 3D object detection, cross-modal fusion, Mamba, Bird's-Eye View (BEV), autonomous driving

摘要: 针对现有基于鸟瞰视图(BEV)的跨模态融合方法,在初期融合阶段忽视了对BEV特征局部信息的有效保护,导致浅层跨模态交互不足,进而制约后续深层融合效果和三维目标检测精度的问题,提出基于Mamba模型的分级跨模态融合三维目标检测方法。将Mamba的状态空间建模机制与分级融合机制深度结合,使跨模态特征映射至隐藏状态空间以进行交互,丰富局部信息,降低跨模态特征之间的差异性,增强融合特征表达的一致性。在浅层融合阶段,设计特征通道交互机制,通过交换不同传感器模态的特征通道提升局部细节保留能力,并改进了Mamba视觉状态空间模块以强化浅层特征间的交互;在深层融合阶段,引入注意力机制与门控机制构建隐藏特征转换,用于识别并融合模态间互补的长距离依赖特征;最后,通过通道自适应模块计算归一化原始特征上的通道关注,自适应学习模态内通道关系以增强单个模态的BEV特征表示,弥补Mamba模型在建模通道间关系方面的不足。实验结果表明,所提方法在nuScenes和Waymo数据集上取得了优于TransFusion、LoGoNet等方法的检测性能,在nuScenes测试集的平均精度均值(mAP)达到72.4%,nuScenes检测得分(NDS)为73.9%,相较于基线BEVFusion_mit分别提高了2.2和1.0个百分点。

关键词: 三维目标检测, 跨模态融合, Mamba, 鸟瞰视图, 自动驾驶

CLC Number: