Hierarchical cross-modal fusion method for 3D object detection based on Mamba model

doi:10.11772/j.issn.1001-9081.2025030281

Journal of Computer Applications

Hierarchical cross-modal fusion method for 3D object detection based on Mamba model

LI Mingguang¹, TAO Chongben^1,2

1. School of Electronic and Information Engineering, Suzhou University of Science and Technology 2. Suzhou Automotive Research Institute, Tsinghua University

Received:2025-03-18 Revised:2025-06-05 Online:2025-06-23 Published:2025-06-23
About author:LI Mingguang, born in 2001, M. S. candidate. His research interests include 3D object detection, multimodal fusion. TAO Chongben, born in 1985, Ph. D., associate professor. His research interests include multi modal fusion perception for autonomous driving.
Supported by:
National Natural Science Foundation of China (62472300)

基于Mamba模型的分级跨模态融合三维目标检测方法

李明光¹，陶重犇^1,2

1.苏州科技大学电子与信息工程学院 2. 清华大学苏州汽车研究院

通讯作者: 陶重犇
作者简介:李明光(2001—)，男，山东临沂人，硕士研究生，主要研究方向：三维目标检测、多模态融合；陶重犇(1985—)，男，江苏苏州人，副教授，博士，主要研究方向：自动驾驶、多模态融合感知
基金资助:
国家自然科学基金资助项目（62472300）

Abstract

Abstract: To address the issue that existing cross-modal fusion methods based on Bird's-Eye View( BEV）space neglect effective preservation of local BEV feature information in the initial fusion stage, leading to insufficient shallow cross-modal interactions, constrain subsequent deep fusion, and reduced accuracy in 3D object detection, a hierarchical cross-modal fusion method based on Mamba model was proposed. Cross-modal features were mapped into a hidden state space to facilitate interactions, thereby enriching local information, reducing discrepancies among modal features, and enhancing the consistency of fused feature representations. In the shallow fusion stage, a feature-channel exchange mechanism was designed to exchange feature channels from different sensor modalities, improving the preservation of shallow-level details. Subsequently, the visual state space block of the Mamba model was improved to strengthen effective interactions among shallow features and enhance feature representation quality. In the deep fusion stage, an attention mechanism was introduced to compute inter-modal difference features, which, combined with a gating mechanism, identified and fused complementary long-range dependencies among modalities. Finally, a channel-adaptive module was employed to calculate channel attention on normalized original features, adaptively learning intra-modal channel relationships to enhance single-modal BEV feature representations, thus compensating for the Mamba model’s inherent limitation in modeling inter-channel relationships. Experimental results show that the proposed method achieves superior detection performance compared to methods such as TransFusion and LoGoNet on the nuScenes and Waymo datasets. On the nuScenes test set, it attains a mean Average Precision (mAP) of 72.4% and a nuScenes Detection Score (NDS) of 73.9%, surpassing the baseline BEVFusion_mit by 2.2 and 1.0 percentage points, respectively.

Key words: 3D object detection, cross-modal fusion, Mamba, Bird's-Eye View (BEV), autonomous driving

摘要： 针对现有基于鸟瞰视图（BEV）的跨模态融合方法，在初期融合阶段忽视了对BEV特征局部信息的有效保护，导致浅层跨模态交互不足，进而制约后续深层融合效果和三维目标检测精度的问题，提出基于Mamba模型的分级跨模态融合三维目标检测方法。将Mamba的状态空间建模机制与分级融合机制深度结合，使跨模态特征映射至隐藏状态空间以进行交互，丰富局部信息，降低跨模态特征之间的差异性，增强融合特征表达的一致性。在浅层融合阶段，设计特征通道交互机制，通过交换不同传感器模态的特征通道提升局部细节保留能力，并改进了Mamba视觉状态空间模块以强化浅层特征间的交互；在深层融合阶段，引入注意力机制与门控机制构建隐藏特征转换，用于识别并融合模态间互补的长距离依赖特征；最后，通过通道自适应模块计算归一化原始特征上的通道关注，自适应学习模态内通道关系以增强单个模态的BEV特征表示,弥补Mamba模型在建模通道间关系方面的不足。实验结果表明，所提方法在nuScenes和Waymo数据集上取得了优于TransFusion、LoGoNet等方法的检测性能，在nuScenes测试集的平均精度均值（mAP）达到72.4%，nuScenes检测得分（NDS）为73.9%，相较于基线BEVFusion_mit分别提高了2.2和1.0个百分点。

关键词: 三维目标检测, 跨模态融合, Mamba, 鸟瞰视图, 自动驾驶

CLC Number:

TP391.4

LI Mingguang, TAO Chongben. Hierarchical cross-modal fusion method for 3D object detection based on Mamba model[J]. Journal of Computer Applications, DOI: 10.11772/j.issn.1001-9081.2025030281.

李明光陶重犇. 基于Mamba模型的分级跨模态融合三维目标检测方法[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2025030281.

[1]	Haiyang PENG, Weixing JI, Fawang LIU. Blockchain-based data notarization model for autonomous driving simulation testing [J]. Journal of Computer Applications, 2025, 45(8): 2421-2427.
[2]	Chuanhao ZHANG, Xiaohan TU, Xuehui GU, Bo XUAN. LiDAR-camera 3D object detection based on multi-modal information mutual guidance and supplementation [J]. Journal of Computer Applications, 2025, 45(3): 946-952.
[3]	Yue LIU, Fang LIU, Aoyun WU, Qiuyue CHAI, Tianxiao WANG. 3D object detection network based on self-attention mechanism and graph convolution [J]. Journal of Computer Applications, 2024, 44(6): 1972-1977.
[4]	Yaping DENG, Yingjiang LI. Review of YOLO algorithm and its applications to object detection in autonomous driving scenes [J]. Journal of Computer Applications, 2024, 44(6): 1949-1958.
[5]	Chao GE, Jiabin ZHANG, Lei WANG, Zhixin LUN. Trajectory planning for autonomous vehicles based on model predictive control [J]. Journal of Computer Applications, 2024, 44(6): 1959-1964.
[6]	Cunyi LIAO, Yi ZHENG, Weijin LIU, Huan YU, Shouyin LIU. Decoupling-fusing algorithm for multiple tasks with autonomous driving environment perception [J]. Journal of Computer Applications, 2024, 44(2): 424-431.
[7]	Mu LI, Yuheng YANG, Xizheng KE. Emotion recognition model based on hybrid-mel gama frequency cross-attention transformer modal [J]. Journal of Computer Applications, 2024, 44(1): 86-93.
[8]	Jing ZHOU, Yiyu HU, Chengyu HU, Tianjiang WANG. Weakly perceived object detection method based on point cloud completion and multi-resolution Transformer [J]. Journal of Computer Applications, 2023, 43(7): 2155-2165.
[9]	Jing ZHANG, Aihong ZHU. Optimization method of automatic train operation speed curve based on genetic algorithm and particle swarm optimization [J]. Journal of Computer Applications, 2022, 42(2): 599-605.
[10]	LI Chao, LAN Hai, WEI Xian. Attention-based object detection with millimeter wave radar-lidar fusion [J]. Journal of Computer Applications, 2021, 41(7): 2137-2144.
[11]	LIU Dan, WU Yajuan, LUO Nanchao, ZHENG Bochuan. Object detection of Gaussian-YOLO v3 implanting attention and feature intertwine modules [J]. Journal of Computer Applications, 2020, 40(8): 2225-2230.
[12]	HU Xuemin, TONG Xiuchi, GUO Lin, ZHANG Ruohan, KONG Li. End-to-end autonomous driving model based on deep visual attention neural network [J]. Journal of Computer Applications, 2020, 40(7): 1926-1931.
[13]	HU Xuemin, CHENG Yu, CHEN Guowen, ZHANG Ruohan, TONG Xiuchi. Motion planning for autonomous driving with directional navigation based on deep spatio-temporal Q-network [J]. Journal of Computer Applications, 2020, 40(7): 1919-1925.
[14]	BAI Liyun, HU Xuemin, SONG Sheng, TONG Xiuchi, ZHANG Ruohan. Motion planning model based on deep cascaded neural network for autonomous driving [J]. Journal of Computer Applications, 2019, 39(10): 2870-2875.
[15]	ZHOU Huizi, HU Xuemin, CHEN Long, TIAN Mei, XIONG Dou. Dynamic path planning for autonomous driving with avoidance of obstacles [J]. Journal of Computer Applications, 2017, 37(3): 883-888.

Hierarchical cross-modal fusion method for 3D object detection based on Mamba model

基于Mamba模型的分级跨模态融合三维目标检测方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics