Journal of Computer Applications ›› 2026, Vol. 46 ›› Issue (2): 572-579.DOI: 10.11772/j.issn.1001-9081.2025030281

• Multimedia computing and computer simulation • Previous Articles    

Hierarchical cross-modal fusion method for 3D object detection based on Mamba model

Mingguang LI1, Chongben TAO1,2()   

  1. 1.School of Electronic and Information Engineering,Suzhou University of Science and Technology,Suzhou Jiangsu 215009,China
    2.Suzhou Automotive Research Institute,Tsinghua University,Suzhou Jiangsu 215200,China
  • Received:2025-03-21 Revised:2025-06-04 Accepted:2025-06-09 Online:2025-06-23 Published:2026-02-10
  • Contact: Chongben TAO
  • About author:LI Mingguang, born in 2001, M. S. candidate. His research interests include 3D object detection, multi-modal fusion.
    TAO Chongben, born in 1985, Ph. D., associate professor. His research interests include autonomous driving, multi-modal fusion perception. Email:tom1tao@163.com
  • Supported by:
    National Natural Science Foundation of China(62472300)

基于Mamba模型的分级跨模态融合三维目标检测方法

李明光1, 陶重犇1,2()   

  1. 1.苏州科技大学 电子与信息工程学院,江苏 苏州 215009
    2.清华大学 苏州汽车研究院,江苏 苏州 215200
  • 通讯作者: 陶重犇
  • 作者简介:李明光(2001—),男,山东临沂人,硕士研究生,主要研究方向:三维目标检测、多模态融合
    陶重犇(1985—),男,江苏苏州人,副教授,博士,主要研究方向:自动驾驶、多模态融合感知。 Email:tom1tao@163.com
  • 基金资助:
    国家自然科学基金资助项目(62472300)

Abstract:

To address the issue that the existing cross-modal fusion methods based on Bird’s-Eye View (BEV) neglect effective preservation of local BEV feature information in the initial fusion stage, leading to insufficient shallow cross-modal interactions, so as to constrain subsequent deep fusion effect and reduce accuracy in 3D object detection, a hierarchical cross-modal fusion method for 3D object detection based on Mamba model was proposed. In the method, the state space modeling mechanism of Mamba was integrated with the hierarchical fusion mechanism deeply, so that cross-modal features were mapped into a hidden state space to facilitate interactions, thereby enriching local information, reducing discrepancies among modal features, and enhancing the consistency of fused feature representations. In the shallow fusion stage, a feature channel exchange mechanism was designed to exchange feature channels from different sensor modalities, thereby improving the preservation ability of shallow local details, and the Visual State Space (VSS) block of Mamba model was improved to strengthen interactions among shallow features. In the deep fusion stage, an attention mechanism and a gating mechanism were introduced to construct hidden feature transformation, so as to identify and fuse complementary long-range dependency features among modalities. Finally, a channel adaptive module was employed to calculate channel attention on normalized original features, and intra-modal channel relationships were learned adaptively to enhance single-modal BEV feature representations, thus compensating for Mamba model’s limitation in modeling inter-channel relationships. Experimental results show that the proposed method achieves superior detection performance compared to methods such as TransFusion and multi-modal fusion method combining local-global modeling LoGoNet (Local-to-Global Network) on the nuScenes and Waymo datasets. On the nuScenes test set, the proposed method has a mean Average Precision (mAP) of 72.4% and a nuScenes Detection Score (NDS) of 73.9%, surpassing the baseline method BEVFusion_mit by 2.2 and 1.0 percentage points, respectively.

Key words: 3D object detection, cross-modal fusion, Mamba, Bird’s-Eye View (BEV), autonomous driving

摘要:

针对现有基于鸟瞰视图(BEV)的跨模态融合方法在初期融合阶段忽视了对BEV特征局部信息的有效保护,导致浅层跨模态交互不足,进而制约后续深层融合效果并降低三维目标检测精度的问题,提出基于Mamba模型的分级跨模态融合三维目标检测方法。该方法将Mamba的状态空间建模机制与分级融合机制深度结合,使跨模态特征映射至隐藏状态空间进行交互,以丰富局部信息,降低跨模态特征之间的差异性,并增强融合特征表达的一致性。首先,在浅层融合阶段,设计特征通道交换机制以通过交换不同传感器模态的特征通道提升浅层局部细节的保留能力,并改进Mamba模型的视觉状态空间(VSS)块以强化浅层特征间的交互;然后,在深层融合阶段,引入注意力机制与门控机制构建隐藏的特征转换,从而识别并融合模态间互补的长距离依赖特征;最后,通过通道自适应模块计算归一化原始特征上的通道关注,并自适应地学习模态内的通道关系,增强单个模态的BEV特征表示,从而弥补Mamba模型在建模通道间关系方面的不足。实验结果表明,所提方法在nuScenes和Waymo数据集上取得了优于TransFusion和结合局部-全局建模的多模态融合方法LoGoNet (Local-to-Global Network)等方法的检测性能,在nuScenes测试集上的平均精度均值(mAP)达到72.4%,nuScenes检测得分(NDS)为73.9%,相较于基线方法BEVFusion_mit分别提高了2.2和1.0个百分点。

关键词: 三维目标检测, 跨模态融合, Mamba, 鸟瞰视图, 自动驾驶

CLC Number: