基于Mamba模型的分级跨模态融合三维目标检测方法

doi:10.11772/j.issn.1001-9081.2025030281

《计算机应用》唯一官方网站 ›› 2026, Vol. 46 ›› Issue (2): 572-579.DOI: 10.11772/j.issn.1001-9081.2025030281

• 多媒体计算与计算机仿真 • 上一篇

基于Mamba模型的分级跨模态融合三维目标检测方法

李明光¹, 陶重犇¹^,²()

^1.苏州科技大学电子与信息工程学院，江苏苏州 215009
^2.清华大学苏州汽车研究院，江苏苏州 215200

收稿日期:2025-03-21 修回日期:2025-06-04 接受日期:2025-06-09 发布日期:2025-06-23 出版日期:2026-02-10
通讯作者: 陶重犇
作者简介:李明光（2001—），男，山东临沂人，硕士研究生，主要研究方向：三维目标检测、多模态融合
陶重犇（1985—），男，江苏苏州人，副教授，博士，主要研究方向：自动驾驶、多模态融合感知。 Email:tom1tao@163.com
基金资助:
国家自然科学基金资助项目(62472300)

Hierarchical cross-modal fusion method for 3D object detection based on Mamba model

Mingguang LI¹, Chongben TAO¹^,²()

^1.School of Electronic and Information Engineering，Suzhou University of Science and Technology，Suzhou Jiangsu 215009，China
^2.Suzhou Automotive Research Institute，Tsinghua University，Suzhou Jiangsu 215200，China

Received:2025-03-21 Revised:2025-06-04 Accepted:2025-06-09 Online:2025-06-23 Published:2026-02-10
Contact: Chongben TAO
About author:LI Mingguang， born in 2001， M. S. candidate. His research interests include 3D object detection， multi-modal fusion.
TAO Chongben， born in 1985， Ph. D.， associate professor. His research interests include autonomous driving， multi-modal fusion perception. Email:tom1tao@163.com
Supported by:
National Natural Science Foundation of China(62472300)

摘要/Abstract

摘要：

针对现有基于鸟瞰视图（BEV）的跨模态融合方法在初期融合阶段忽视了对BEV特征局部信息的有效保护，导致浅层跨模态交互不足，进而制约后续深层融合效果并降低三维目标检测精度的问题，提出基于Mamba模型的分级跨模态融合三维目标检测方法。该方法将Mamba的状态空间建模机制与分级融合机制深度结合，使跨模态特征映射至隐藏状态空间进行交互，以丰富局部信息，降低跨模态特征之间的差异性，并增强融合特征表达的一致性。首先，在浅层融合阶段，设计特征通道交换机制以通过交换不同传感器模态的特征通道提升浅层局部细节的保留能力，并改进Mamba模型的视觉状态空间（VSS）块以强化浅层特征间的交互；然后，在深层融合阶段，引入注意力机制与门控机制构建隐藏的特征转换，从而识别并融合模态间互补的长距离依赖特征；最后，通过通道自适应模块计算归一化原始特征上的通道关注，并自适应地学习模态内的通道关系，增强单个模态的BEV特征表示，从而弥补Mamba模型在建模通道间关系方面的不足。实验结果表明，所提方法在nuScenes和Waymo数据集上取得了优于TransFusion和结合局部-全局建模的多模态融合方法LoGoNet （Local-to-Global Network）等方法的检测性能，在nuScenes测试集上的平均精度均值（mAP）达到72.4%，nuScenes检测得分（NDS）为73.9%，相较于基线方法BEVFusion_mit分别提高了2.2和1.0个百分点。

关键词: 三维目标检测, 跨模态融合, Mamba, 鸟瞰视图, 自动驾驶

Abstract:

To address the issue that the existing cross-modal fusion methods based on Bird’s-Eye View （BEV） neglect effective preservation of local BEV feature information in the initial fusion stage， leading to insufficient shallow cross-modal interactions， so as to constrain subsequent deep fusion effect and reduce accuracy in 3D object detection， a hierarchical cross-modal fusion method for 3D object detection based on Mamba model was proposed. In the method， the state space modeling mechanism of Mamba was integrated with the hierarchical fusion mechanism deeply， so that cross-modal features were mapped into a hidden state space to facilitate interactions， thereby enriching local information， reducing discrepancies among modal features， and enhancing the consistency of fused feature representations. In the shallow fusion stage， a feature channel exchange mechanism was designed to exchange feature channels from different sensor modalities， thereby improving the preservation ability of shallow local details， and the Visual State Space （VSS） block of Mamba model was improved to strengthen interactions among shallow features. In the deep fusion stage， an attention mechanism and a gating mechanism were introduced to construct hidden feature transformation， so as to identify and fuse complementary long-range dependency features among modalities. Finally， a channel adaptive module was employed to calculate channel attention on normalized original features， and intra-modal channel relationships were learned adaptively to enhance single-modal BEV feature representations， thus compensating for Mamba model’s limitation in modeling inter-channel relationships. Experimental results show that the proposed method achieves superior detection performance compared to methods such as TransFusion and multi-modal fusion method combining local-global modeling LoGoNet （Local-to-Global Network） on the nuScenes and Waymo datasets. On the nuScenes test set， the proposed method has a mean Average Precision （mAP） of 72.4% and a nuScenes Detection Score （NDS） of 73.9%， surpassing the baseline method BEVFusion_mit by 2.2 and 1.0 percentage points， respectively.

Key words: 3D object detection, cross-modal fusion, Mamba, Bird’s-Eye View (BEV), autonomous driving

中图分类号:

TP391.4

李明光, 陶重犇. 基于Mamba模型的分级跨模态融合三维目标检测方法[J]. 计算机应用, 2026, 46(2): 572-579.

Mingguang LI, Chongben TAO. Hierarchical cross-modal fusion method for 3D object detection based on Mamba model[J]. Journal of Computer Applications, 2026, 46(2): 572-579.

图/表 17

参考文献 26

[1]	孙逊，冯睿锋，陈彦如. 基于深度与实例分割融合的单目3D目标检测方法［J］. 计算机应用， 2024， 44（7）： 2208-2215.
	SUN X， FENG R F， CHEN Y R. Monocular 3D object detection method integrating depth and instance segmentation［J］. Journal of Computer Applications， 2024， 44（7）： 2208-2215.
[2]	周静，胡怡宇，胡成玉，等. 基于点云补全和多分辨Transformer的弱感知目标检测方法［J］. 计算机应用， 2023， 43（7）： 2155-2165.
	ZHOU J， HU Y Y， HU C Y， et al. Weakly perceived object detection method based on point cloud completion and multi-resolution Transformer［J］. Journal of Computer Applications， 2023， 43（7）： 2155-2165.
[3]	LIANG M， YANG B， WANG S， et al. Deep continuous fusion for multi-sensor 3D object detection［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11220. Cham： Springer， 2018： 663-678.
[4]	李学钊，王伟，薛冰. 基于梯度算子和注意力的多模态融合目标检测［J］. 仪器仪表学报， 2024， 45（11）： 224-232.
	LI X Z， WANG W， XUE B. Multi-modal fusion object detection based on gradient operator and attention［J］. Chinese Journal of Scientific Instrument， 2024， 45（11）： 224-232.
[5]	LIU Z， TANG H， AMINI A， et al. BEVFusion： multi-task multi-sensor fusion with unified bird's-eye view representation［C］// Proceedings of the 2023 IEEE International Conference on Robotics and Automation. Piscataway： IEEE， 2023： 2774-2781.
[6]	LIANG T， XIE H， YU K， et al. BEVFusion： a simple and robust LiDAR-camera fusion framework［C］// Proceedings of the 36th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2022： 10421-10434.
[7]	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 6000-6010.
[8]	WEI M， LI J， KANG H， et al. BEV-CFKT： a LiDAR-camera cross-modality-interaction fusion and knowledge transfer framework with transformer for BEV 3D object detection［J］. Neurocomputing， 2024， 582： No.127527.
[9]	GU A， GOEL K， RÉ C. Efficiently modeling long sequences with structured state spaces［EB/OL］. ［2025-03-13］..
[10]	LIU Y， TIAN Y J， ZHAO Y Z， et al. VMamba： visual state space model［EB/OL］. ［2025-03-13］..
[11]	ZHOU M H， LI T Y， QIAO C F， et al. DMM： disparity-guided multispectral mamba for oriented object detection in remote sensing［EB/OL］. ［2025-03-13］..
[12]	CAESAR H， BANKITI V， LANG A H， et al. nuScenes： a multimodal dataset for autonomous driving［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 11618-11628.
[13]	SUN P， KRETZSCHMAR H， DOTIWALLA X， et al. Scalability in perception for autonomous driving： Waymo Open Dataset［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 2443-2451.
[14]	OpenMMLab. MMDetection 3D： OpenMMLab next-generation platform for general 3D object detection［EB/OL］. ［2024-12-02］..
[15]	LIU Z， LIN Y T， CAO Y， et al. Swin Transformer： hierarchical Vision Transformer using shifted windows［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 9992-10002.
[16]	YAN Y， MAO Y， LI B. SECOND： sparsely embedded convolutional detection［J］. Sensors， 2018， 18（10）： No.3337.
[17]	KINGMA D P， BA J L. Adam： a method for stochastic optimization［EB/OL］. ［2025-03-13］..
[18]	LANG A H， VORA S， CAESAR H， et al. PointPillars： fast encoders for object detection from point clouds［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 12689-12697.
[19]	YIN T， ZHOU X， KRÄHENBÜHL P. Center-based 3D object detection and tracking［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 11779-11788.
[20]	BAI X， HU Z， ZHU X， et al. TransFusion： robust LiDAR-camera fusion for 3D object detection with transformers［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 1080-1089.
[21]	YIN T， ZHOU X， KRÄHENBÜHL P. Multimodal virtual point 3D detection［C］// Proceedings of the 35th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2021： 16494-16507.
[22]	WANG C， MA C， ZHU M， et al. PointAugmenting： cross-modal augmentation for 3D object detection［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 11789-11798.
[23]	CAI Q， PAN Y， YAO T， et al. ObjectFusion： multi-modal 3D object detection with object-centric fusion［C］// Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2023： 18021-18030.
[24]	YOO J H， KIM Y， KIM J， et al. 3D-CVF： generating joint camera and lidar features using cross-view spatial feature fusion for 3D object detection［C］// Proceedings of the 2020 European Conference on Computer Vision， LNCS 12372. Cham： Springer， 2020： 720-736.
[25]	SHI S， GUO C， JIANG L， et al. PV-RCNN： point-voxel feature set abstraction for 3D object detection［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 10526-10535.
[26]	LI X， MA T， HOU Y， et al. LoGoNet： towards accurate 3D object detection with local-to-global cross-modal fusion［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023： 17524-17534.

方法	数据	mAP	NDS	不同类别的AP
方法	数据	mAP	NDS	汽车	卡车	建筑车辆	公交	拖车	障碍	摩托	自行车	行人	交通锥
PointPillars^［18］	L	30.5	45.3	68.4	23.0	4.1	28.2	23.4	38.9	27.4	1.1	59.7	30.8
CenterPoint^［19］	L	60.3	67.3	85.2	53.5	20.0	63.6	56.0	71.1	59.5	30.7	84.6	78.4
TransFusion-L^［20］	L	65.5	70.2	86.2	56.7	28.2	66.3	58.8	78.2	68.3	44.2	86.1	82.0
MVP^［21］	LC	66.4	70.5	86.8	58.5	26.1	67.4	57.3	74.8	70.0	49.3	89.1	85.0
PointAugmenting^［22］	LC	66.8	71.0	87.5	57.3	28.0	65.2	60.7	72.6	74.3	50.9	87.9	83.6
TransFusion^［20］	LC	68.9	71.7	87.1	60.0	33.1	68.3	60.8	78.1	73.6	52.9	88.4	86.7
BEVFusion_ali ^［6］	LC	69.8	71.9	88.1	60.9	34.4	68.5	62.1	78.2	71.8	52.2	89.2	85.5
BEVFusion_mit ^［5］	LC	70.2	72.9	88.6	60.1	39.3	69.8	63.8	80.0	74.1	51.0	89.2	86.5
ObjectFusion^［23］	LC	71.0	73.3	89.4	59.0	40.5	71.8	63.1	76.6	78.1	53.2	90.7	87.7
本文方法	LC	72.4	73.9	89.9	65.7	37.6	73.3	65.4	82.2	76.5	53.6	91.4	88.7

方法	数据	mAP	NDS	不同类别的AP
方法	数据	mAP	NDS	汽车	卡车	建筑车辆	公交	拖车	障碍	摩托	自行车	行人	交通锥
PointPillars^［18］	L	30.5	45.3	68.4	23.0	4.1	28.2	23.4	38.9	27.4	1.1	59.7	30.8
CenterPoint^［19］	L	60.3	67.3	85.2	53.5	20.0	63.6	56.0	71.1	59.5	30.7	84.6	78.4
TransFusion-L^［20］	L	65.5	70.2	86.2	56.7	28.2	66.3	58.8	78.2	68.3	44.2	86.1	82.0
MVP^［21］	LC	66.4	70.5	86.8	58.5	26.1	67.4	57.3	74.8	70.0	49.3	89.1	85.0
PointAugmenting^［22］	LC	66.8	71.0	87.5	57.3	28.0	65.2	60.7	72.6	74.3	50.9	87.9	83.6
TransFusion^［20］	LC	68.9	71.7	87.1	60.0	33.1	68.3	60.8	78.1	73.6	52.9	88.4	86.7
BEVFusion_ali ^［6］	LC	69.8	71.9	88.1	60.9	34.4	68.5	62.1	78.2	71.8	52.2	89.2	85.5
BEVFusion_mit ^［5］	LC	70.2	72.9	88.6	60.1	39.3	69.8	63.8	80.0	74.1	51.0	89.2	86.5
ObjectFusion^［23］	LC	71.0	73.3	89.4	59.0	40.5	71.8	63.1	76.6	78.1	53.2	90.7	87.7
本文方法	LC	72.4	73.9	89.9	65.7	37.6	73.3	65.4	82.2	76.5	53.6	91.4	88.7

方法	数据	mAP	NDS	不同类别的AP
方法	数据	mAP	NDS	汽车	卡车	建筑车辆	公交	拖车	障碍	摩托	自行车	行人	交通锥
3D-CVF^［24］	LC	52.7	62.3	83.0	45.0	15.9	48.8	49.6	65.9	51.2	30.4	74.2	65.9
TransFusion^［20］	LC	67.3	71.2	87.6	62.0	27.4	75.7	42.8	73.9	75.4	63.1	87.8	77.0
BEVFusion_ali^［6］	LC	67.9	71.0	88.6	65.0	28.1	75.4	41.4	72.2	76.7	65.8	88.7	76.9
BEVFusion_mit^［5］	LC	68.3	71.1	88.5	65.1	28.7	75.2	41.9	73.1	76.2	66.8	88.9	77.2
ObjectFusion^［23］	LC	69.9	72.3	89.6	65.2	32.1	77.5	43.6	75.8	79.4	65.1	89.4	81.3
本文方法	LC	71.3	72.5	89.8	64.2	32.7	76.2	49.6	78.6	78.9	68.4	89.7	83.8

方法	数据	mAP	NDS	不同类别的AP
方法	数据	mAP	NDS	汽车	卡车	建筑车辆	公交	拖车	障碍	摩托	自行车	行人	交通锥
3D-CVF^［24］	LC	52.7	62.3	83.0	45.0	15.9	48.8	49.6	65.9	51.2	30.4	74.2	65.9
TransFusion^［20］	LC	67.3	71.2	87.6	62.0	27.4	75.7	42.8	73.9	75.4	63.1	87.8	77.0
BEVFusion_ali^［6］	LC	67.9	71.0	88.6	65.0	28.1	75.4	41.4	72.2	76.7	65.8	88.7	76.9
BEVFusion_mit^［5］	LC	68.3	71.1	88.5	65.1	28.7	75.2	41.9	73.1	76.2	66.8	88.9	77.2
ObjectFusion^［23］	LC	69.9	72.3	89.6	65.2	32.1	77.5	43.6	75.8	79.4	65.1	89.4	81.3
本文方法	LC	71.3	72.5	89.8	64.2	32.7	76.2	49.6	78.6	78.9	68.4	89.7	83.8

方法	数据	mAPH	不同类别的APH
方法	数据	mAPH	车辆	行人	自行车
PointPillars^［18］	L	57.6	62.5	50.2	59.9
PVRCNN^［25］	L	63.3	68.4	57.6	64.0
CenterPoint^［19］	L	67.6	68.4	65.8	68.5
TransFusion^［20］	LC	65.5	65.1	63.7	65.9
PointAugmenting^［22］	LC	66.7	62.2	64.6	73.3
LoGoNet^［26］	LC	71.3	70.5	69.7	73.6
本文方法	LC	71.8	70.9	71.8	72.4

基于Mamba模型的分级跨模态融合三维目标检测方法

Hierarchical cross-modal fusion method for 3D object detection based on Mamba model

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 17

参考文献 26

相关文章 15

编辑推荐

Metrics

方法	数据	MAC/10⁹	延时/ms
CenterPoint^［19］	L	151.6	44.8
TransFusion^［20］	LC	483.4	86.5
MVP^［21］	LC	369.2	103.9
BEVFusion_mit^［5］	LC	251.7	66.2
本文方法	LC	289.5	74.8

Baseline	SBSF	DBSF	CAM	mAP/%	NDS/%
√				68.3	71.1
√	√			69.5	71.7
√		√		68.9	71.4
√	√	√		70.6	72.2
√			√	68.5	71.1
√	√	√	√	71.3	72.5

SBSF	DBSF	层数	mAP/%	NDS/%
√		1	68.7	71.3
√		2	69.5	71.7
√		3	69.2	71.4
√		4	68.6	71.2
√	√	1	70.4	71.9
√	√	2	71.3	72.5
√	√	3	70.9	72.3
√	√	4	69.7	71.7

[1]	彭海洋, 计卫星, 刘法旺. 基于区块链的自动驾驶仿真测试数据存证模型[J]. 《计算机应用》唯一官方网站, 2025, 45(8): 2421-2427.
[2]	张传浩, 屠晓涵, 谷学汇, 轩波. 基于多模态信息相互引导补充的雷达-相机三维目标检测[J]. 《计算机应用》唯一官方网站, 2025, 45(3): 946-952.
[3]	刘越, 刘芳, 武奥运, 柴秋月, 王天笑. 基于自注意力机制与图卷积的3D目标检测网络[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1972-1977.
[4]	邓亚平, 李迎江. YOLO算法及其在自动驾驶场景中目标检测综述[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1949-1958.
[5]	葛超, 张嘉滨, 王蕾, 伦志新. 基于模型预测控制的自动驾驶车辆轨迹规划[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1959-1964.
[6]	廖存燚, 郑毅, 刘玮瑾, 于欢, 刘守印. 自动驾驶环境感知多任务去耦-融合算法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 424-431.
[7]	李牧, 杨宇恒, 柯熙政. 基于混合特征提取与跨模态特征预测融合的情感识别模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 86-93.
[8]	周静, 胡怡宇, 胡成玉, 王天江. 基于点云补全和多分辨Transformer的弱感知目标检测方法[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2155-2165.
[9]	张京, 朱爱红. 基于遗传算法和粒子群优化的列车自动驾驶速度曲线优化方法[J]. 《计算机应用》唯一官方网站, 2022, 42(2): 599-605.
[10]	李朝, 兰海, 魏宪. 基于注意力的毫米波-激光雷达融合目标检测[J]. 计算机应用, 2021, 41(7): 2137-2144.
[11]	刘丹, 吴亚娟, 罗南超, 郑伯川. 嵌入注意力和特征交织模块的Gaussian-YOLO v3目标检测[J]. 计算机应用, 2020, 40(8): 2225-2230.
[12]	胡学敏, 童秀迟, 郭琳, 张若晗, 孔力. 基于深度视觉注意神经网络的端到端自动驾驶模型[J]. 计算机应用, 2020, 40(7): 1926-1931.
[13]	胡学敏, 成煜, 陈国文, 张若晗, 童秀迟. 基于深度时空Q网络的定向导航自动驾驶运动规划[J]. 计算机应用, 2020, 40(7): 1919-1925.
[14]	白丽贇, 胡学敏, 宋昇, 童秀迟, 张若晗. 基于深度级联神经网络的自动驾驶运动规划模型[J]. 计算机应用, 2019, 39(10): 2870-2875.
[15]	周慧子, 胡学敏, 陈龙, 田梅, 熊豆. 面向自动驾驶的动态路径规划避障算法[J]. 计算机应用, 2017, 37(3): 883-888.

方法	AP
方法	汽车	公交	行人	卡车
Base	73.2	66.4	78.8	68.5
Base+S	82.9	69.3	83.5	73.9
Base+S+D	90.4	78.8	91.8	74.2
Base+S+D（CAM）	91.1	81.4	92.7	78.6

方法	AP
方法	汽车	公交	行人	卡车
Base	73.2	66.4	78.8	68.5
Base+S	82.9	69.3	83.5	73.9
Base+S+D	90.4	78.8	91.8	74.2
Base+S+D（CAM）	91.1	81.4	92.7	78.6