基于多模态信息相互引导补充的雷达-相机三维目标检测

doi:10.11772/j.issn.1001-9081.2024030290

《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (3): 946-952.DOI: 10.11772/j.issn.1001-9081.2024030290

• 多媒体计算与计算机仿真 • 上一篇下一篇

基于多模态信息相互引导补充的雷达-相机三维目标检测

张传浩¹(), 屠晓涵¹, 谷学汇¹, 轩波²

^1.郑州警察学院图像与网络侦查系，郑州 450000
^2.郑州图铭智能科技有限公司，郑州 450000

收稿日期:2024-03-20 修回日期:2024-06-03 接受日期:2024-06-04 发布日期:2024-07-19 出版日期:2025-03-10
通讯作者: 张传浩
作者简介:屠晓涵（1991—），女，河南南阳人，讲师，博士，CCF会员，主要研究方向：计算机视觉、机器学习
谷学汇（1984—），男，吉林长春人，副教授，博士，CCF会员，主要研究方向：视频图像处理与检验
轩波（1978—），男，北京人，副研究员，博士，主要研究方向：模式识别、图形图像处理。
基金资助:
国家重点研发计划项目;河南省自然科学基金资助项目(242300420693);河南省科技攻关项目(212102210531);中央高校基本科研业务费项目(2022TJJBKY002)

LiDAR-camera 3D object detection based on multi-modal information mutual guidance and supplementation

Chuanhao ZHANG¹(), Xiaohan TU¹, Xuehui GU¹, Bo XUAN²

^1.Department of Image and Network Investigation，Zhengzhou Police University，Zhengzhou Henan 450000，China
^2.Zhengzhou Tuming Intelligent Technology Company Limited，Zhengzhou Henan 450000，China

Received:2024-03-20 Revised:2024-06-03 Accepted:2024-06-04 Online:2024-07-19 Published:2025-03-10
Contact: Chuanhao ZHANG
About author:TU Xiaohan， born in 1991， Ph. D.， lecturer. Her research interests include computer vision， machine learning.
GU Xuehui， born in 1984， Ph. D.， associate professor. His research interests include video image processing and verification.
XUAN Bo， born in 1978， Ph. D.， associate research fellow. His research interests include pattern recognition， graphics and image processing.
Supported by:
National Key Research and Development Program of China;Henan Provincial Natural Science Foundation(242300420693);Henan Province Science and Technology Research Project(212102210531);Fundamental Research Funds for Central University(2022TJJBKY002)

摘要/Abstract

摘要：

多模态三维目标检测是计算机视觉的一项重要任务，如何更好地融合不同模态之间的信息一直是该任务的研究重点。现有方法在融合不同模态信息时缺少对信息的筛选，且过多无关与干扰信息会造成模型性能的下降。针对上述问题，提出一种基于多模态信息相互引导补充的雷达-相机三维目标检测模型，以在融合特征时从另一种模态中自适应地挑选信息进行融合。自适应信息融合包括数据层面的相互引导补充和特征层面的相互引导补充。在数据层面的融合中，使用由点云产生的深度图和图像产生的分割掩码作为输入，以分别构建出实例级的深度图与实例级的三维虚拟点用于图像与点云的补充。在特征层面的融合中，使用点云产生的体素特征和图像产生的特征图作为输入，并从另一种模态中为待融合特征选取关键区域并通过注意力机制进行特征融合。实验结果表明，所提模型在nuScenes测试集上取得了良好的效果。相较于BEVFusion和TransFusion等传统非引导的融合模型，所提模型将平均精度均值（mAP）和nuScenes检测分数（NDS）这2个主流评测指标分别提升了0.9~28.9个百分点和0.6~26.1个百分点。以上验证了所提模型可有效提高多模态三维目标检测的准确性。

关键词: 多模态, 三维目标检测, 自适应信息融合, 数据层面融合, 特征层面融合, 雷达-相机

Abstract:

Multi-modal 3D object detection is an important task in computer vision， and how to better fuse information among different modalities is always a research focus of this task. Previous methods lack information filtering when fusing the information of different modalities， and excessive irrelevant and interference information may lead to a decline in model performance. To address the above issues， an LiDAR-camera 3D object detection model based on multi-modal information mutual guidance and supplementation was proposed， which selected information from another modality for fusion adaptively when fusing features. Adaptive information fusion includes data-level and feature-level mutual guidance and supplementation. In data-level fusion， depth maps generated by point clouds and segmentation masks generated by images were used as input to construct instance-level depth maps and instance-level 3D virtual points， respectively， for supplementing images and point clouds. In feature-level fusion， voxel features generated by point clouds and feature maps generated by images were used as input， and key regions were selected from another modality for the features to be fused and feature fusion was conducted through attention mechanism. Experimental results show that the proposed model achieves good results on nuScenes test set. Compared to traditional unguided fusion models such as BEVFusion and TransFusion， the proposed model has the two mainstream evaluation indexes — mean Average Precision （mAP） and nuScenes Detection Score （NDS） improved by 0.9-28.9 percentage points and 0.6-26.1 percentage points， respectively. The above verifies that the proposed model can improve the accuracy of multi-modal 3D object detection effectively.

Key words: multi-modal, 3D object detection, adaptive information fusion, data-level fusion, feature-level fusion, LiDAR-camera

中图分类号:

TP751.1

张传浩, 屠晓涵, 谷学汇, 轩波. 基于多模态信息相互引导补充的雷达-相机三维目标检测[J]. 计算机应用, 2025, 45(3): 946-952.

Chuanhao ZHANG, Xiaohan TU, Xuehui GU, Bo XUAN. LiDAR-camera 3D object detection based on multi-modal information mutual guidance and supplementation[J]. Journal of Computer Applications, 2025, 45(3): 946-952.

图/表 8

参考文献 39

1	QI C R， LIU W， WU C， et al. Frustum PointNets for 3D object detection from RGB-D data ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 918-927.
2	SHIN K， KWON Y， TOMIZUKA M. RoarNet： a robust 3D object detection based on region approximation refinement ［C］// Proceedings of the 2019 IEEE Intelligent Vehicles Symposium. Piscataway： IEEE， 2019： 2510-2515.
3	ZHANG Y， CHEN J， HUANG D. CAT-Det： contrastively augmented Transformer for multi-modal 3D object detection ［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 898-907.
4	HUANG J， HUANG G， ZHU Z， et al. BEVDet： high-performance multi-camera 3D object detection in bird-eye-view ［EB/OL］. ［2024-02-14］. .
5	LIU Z， TANG H， AMINI A， et al. BEVFusion： multi-task multi-sensor fusion with unified bird’s-eye view representation ［C］// Proceedings of the IEEE 2023 International Conference on Robotics and Automation. Piscataway： IEEE， 2023： 2774-2781.
6	PANG S， MORRIS D， RADHA H. CLOCs： camera-LiDAR object candidates fusion for 3D object detection ［C］// Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway： IEEE， 2020： 10386-10393.
7	MAO J， WANG X， LI H. Interpolated convolutional networks for 3D point cloud understanding ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 1578-1587.
8	QI C R， SU H， MO K， et al. PointNet： deep learning on point sets for 3D classification and segmentation ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 77-85.
9	QI C R， YI L， SU H， et al. PointNet++ deep hierarchical feature learning on point sets in a metric space ［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 5105-5114.
10	WANG Y， SUN Y， LIU Z， et al. Dynamic graph CNN for learning on point clouds ［J］. ACM Transactions on Graphics， 2019， 38（5）： No.146.
11	ZHOU Y， TUZEL O. VoxelNet： end-to-end learning for point cloud based 3D object detection ［C］// Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 4490-4499.
12	LANG A H， VORA S， CAESAR H， et al. PointPillars： fast encoders for object detection from point clouds ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 12689-12697.
13	ZHOU Y， SUN P， ZHANG Y， et al. End-to-end multi-view fusion for 3D object detection in LiDAR point clouds ［C］// Proceedings of the 2020 Conference on Robot Learning. New York： JMLR.org， 2020： 923-932.
14	MEYER G P， LADDHA A， KEE E， et al. LaserNet： an efficient probabilistic 3D object detector for autonomous driving ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 12669-12678.
15	YIN T， ZHOU X， KRÄHENBÜHL P. Center-based 3D object detection and tracking ［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 11779-11788.
16	YANG B， LUO W， URTASUN R. PIXOR： real-time 3D object detection from point clouds ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 7652-7660.
17	BRAZIL G， LIU X. M3D-RPN： monocular 3D region proposal network for object detection ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 9286-9295.
18	MANHARDT F， KEHL W， GAIDON A. ROI-10D： monocular lifting of 2D detection to 6D pose and metric shape ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 2064-2073.
19	BRAZIL G， PONS-MOLL G， LIU X， et al. Kinematic 3D object detection in monocular video ［C］// Proceedings of the 2020 European Conference on Computer Vision， LNCS 12368. Cham： Springer， 2020： 135-152.
20	MOUSAVIAN A， ANGUELOV D， FLYNN J， et al. 3D bounding box estimation using deep learning and geometry ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 5632-5640.
21	FU H， GONG M， WANG C， et al. Deep ordinal regression network for monocular depth estimation ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 2002-2011.
22	ZIA M Z， STARK M， SCHINDLER K. Are cars just 3D boxes？ — jointly estimating the 3D shape of multiple objects ［C］// Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2014： 3678-3685.
23	PHILION J， FIDLER S. Lift， splat， shoot： encoding images from arbitrary camera rigs by implicitly unprojecting to 3D ［C］// Proceedings of the 2020 European Conference on Computer Vision， LNCS 12359. Cham： Springer， 2020：194-210.
24	HUANG J， HUANG G. BEVDet 4D： exploit temporal cues in multi-camera 3D object detection ［EB/OL］. ［2023-12-02］. .
25	PANG S， MORRIS D， RADHA H. Fast-CLOCs： fast camera-LiDAR object candidates fusion for 3D object detection ［C］// Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway： IEEE， 2022： 3747-3756.
26	YANG B， GUO R， LIANG M， et al. RadarNet： exploiting radar for robust perception of dynamic objects ［C］// Proceedings of the 2020 European Conference on Computer Vision， LNCS 12363. Cham： Springer， 2020： 496-512.
27	QIAN K， ZHU S， ZHANG X， et al. Robust multimodal vehicle detection in foggy weather using complementary lidar and radar signals ［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 444-453.
28	YANG B， LIANG M， URTASUN R. HDNET： exploiting HD maps for 3D object detection ［C］// Proceedings of the 2nd Conference on Robot Learning. New York： JMLR.org， 2018： 146-155.
29	FANG J， ZHOU D， SONG X， et al. MapFusion： a general framework for 3D object detection with HDMaps ［C］// Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway： IEEE， 2021： 3406-3413.
30	LIU Z， LIN Y， CAO Y， et al. Swin Transformer： hierarchical vision Transformer using shifted windows ［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 9992-10002.
31	XIE E， YU Z， ZHOU D， et al. M² BEV： multi-camera joint 3D detection and segmentation with unified birds-eye view representation［EB/OL］. ［2024-01-25］. .
32	LI Z， WANG W， LI H， et al. BEVFormer： learning bird’s-eye-view representation from multi-camera images via spatiotemporal Transformers ［C］// Proceedings of the 2022 European Conference on Computer Vision， LNCS 13669. Cham： Springer， 2022： 1-18.
33	YAN Y， MAO Y， LI B. SECOND： sparsely embedded convolutional detection ［J］. Sensors， 2018， 18（10）： No.3337.
34	VORA S， LANG A H， HELOU B， et al. PointPainting： sequential fusion for 3D object detection ［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 4603-4611.
35	YIN T， ZHOU X， KRÄHENBÜHL P. Multimodal virtual point 3D detection ［C］// Proceedings of the 35th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2021： 16494-16507.
36	XU S， ZHOU D， FANG J， et al. FusionPainting： multimodal fusion with adaptive attention for 3D object detection ［C］// Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference. Piscataway： IEEE， 2021： 3047-3054.
37	CHEN Z， LI Z， ZHANG S， et al. AutoAlign： pixel-instance feature aggregation for multi-modal 3D object detection ［C］// Proceedings of the 31st International Joint Conference on Artificial Intelligence. California： IJCAI.org， 2022：827-833.
38	CHEN X， ZHANG T， WANG Y， et al. FUTR3D： a unified sensor fusion framework for 3D detection ［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023： 172-181.
39	BAI X， HU Z， ZHU X， et al. Transfusion： robust LiDAR-camera fusion for 3D object detection with transformers ［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 1080-1089.

模态	模型	测试集		验证集
模态	模型	mAP	NDS	mAP	NDS
相机	BEVDet	42.2	48.2	—	—
	M²BEV	42.9	47.4	41.7	47.0
	BEVFormer	44.5	53.5	41.6	51.7
	BEVDet4D	45.1	56.9	—	—
雷达	PointPillars	—	—	52.3	61.3
	SECOND	52.8	63.3	52.6	63.0
	CenterPoint	60.3	67.3	59.6	66.8
雷达+相机	PointPainting	—	—	65.8	69.6
	MVP	66.4	70.5	66.1	70.0
	FusionPainting	68.1	71.6	66.5	70.7
	AutoAlign	—	—	66.6	71.1
	FUTR3D	—	—	64.5	68.3
	TransFusion	68.9	71.6	67.5	71.3
	BEVFusion	70.2	72.9	68.5	71.4
	本文模型	71.1	73.5	70.0	72.3

模态	模型	测试集		验证集
模态	模型	mAP	NDS	mAP	NDS
相机	BEVDet	42.2	48.2	—	—
	M²BEV	42.9	47.4	41.7	47.0
	BEVFormer	44.5	53.5	41.6	51.7
	BEVDet4D	45.1	56.9	—	—
雷达	PointPillars	—	—	52.3	61.3
	SECOND	52.8	63.3	52.6	63.0
	CenterPoint	60.3	67.3	59.6	66.8
雷达+相机	PointPainting	—	—	65.8	69.6
	MVP	66.4	70.5	66.1	70.0
	FusionPainting	68.1	71.6	66.5	70.7
	AutoAlign	—	—	66.6	71.1
	FUTR3D	—	—	64.5	68.3
	TransFusion	68.9	71.6	67.5	71.3
	BEVFusion	70.2	72.9	68.5	71.4
	本文模型	71.1	73.5	70.0	72.3

模型	模态	晴天		雨天		白天		夜晚
模型	模态	mAP	mIoU	mAP	mIoU	mAP	mIoU	mAP	mIoU
CenterPoint	雷达	62.9	50.7	59.2	42.3	62.8	48.9	35.4	37.0
BEVDet/LSS	相机	32.9	59.0	33.7	50.5	33.7	57.4	13.5	30.8
BEVFusion	雷达+相机	68.2	65.6	69.9	55.9	68.5	63.1	42.8	43.6
本文模型	雷达+相机	69.0	66.5	71.6	57.2	69.2	63.8	43.9	45.3

模型	模态	晴天		雨天		白天		夜晚
模型	模态	mAP	mIoU	mAP	mIoU	mAP	mIoU	mAP	mIoU
CenterPoint	雷达	62.9	50.7	59.2	42.3	62.8	48.9	35.4	37.0
BEVDet/LSS	相机	32.9	59.0	33.7	50.5	33.7	57.4	13.5	30.8
BEVFusion	雷达+相机	68.2	65.6	69.9	55.9	68.5	63.1	42.8	43.6
本文模型	雷达+相机	69.0	66.5	71.6	57.2	69.2	63.8	43.9	45.3

模态	mAP	NDS
雷达	62.7	69.8
相机	36.6	43.4
雷达+相机	70.0	72.3

基于多模态信息相互引导补充的雷达-相机三维目标检测

LiDAR-camera 3D object detection based on multi-modal information mutual guidance and supplementation

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 8

参考文献 39

相关文章 15

编辑推荐

Metrics

模型	mAP	NDS
基线	68.5	71.4
基线+多模态引导（数据）	68.9	71.7
基线+多模态引导（特征）	69.6	72.1
基线+多模态引导（数据+特征）	70.0	72.3

[1]	孙晨伟, 侯俊利, 刘祥根, 吕建成. 面向工程图纸理解的大语言模型提示生成方法[J]. 《计算机应用》唯一官方网站, 2025, 45(3): 801-807.
[2]	陈维, 施昌勇, 马传香. 基于多模态数据融合的农作物病害识别方法[J]. 《计算机应用》唯一官方网站, 2025, 45(3): 840-848.
[3]	蔡启健, 谭伟. 语义图增强的多模态推荐算法[J]. 《计算机应用》唯一官方网站, 2025, 45(2): 421-427.
[4]	黄颖, 杨佳宇, 金家昊, 万邦睿. 用于RGBT跟踪的孪生混合信息融合算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2878-2885.
[5]	张睿, 张鹏云, 高美蓉. 自优化双模态多通路非深度前庭神经鞘瘤识别模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2975-2982.
[6]	薛凯鹏, 徐涛, 廖春节. 融合自监督和多层交叉注意力的多模态情感分析网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2387-2392.
[7]	刘越, 刘芳, 武奥运, 柴秋月, 王天笑. 基于自注意力机制与图卷积的3D目标检测网络[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1972-1977.
[8]	陈田, 蔡从虎, 袁晓辉, 罗蓓蓓. 基于多尺度卷积和自注意力特征融合的多模态情感识别方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 369-376.
[9]	赖华, 孙童, 王文君, 余正涛, 高盛祥, 董凌. 多模态特征的越南语语音识别文本标点恢复[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 418-423.
[10]	郑盛有, 陈雁翔, 赵祖兴, 刘海洋. 多模态部分伪造数据集的构建与基准检测[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3134-3140.
[11]	林于翔, 吴运兵, 阴爱英, 廖祥文. 基于语义相关性分析的多模态摘要模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 65-72.
[12]	赵强, 王中卿, 王红玲. 融合多模态信息的产品摘要抽取模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 73-78.
[13]	黄懿蕊, 罗俊玮, 陈景强. 基于对比学习和GIF标记的多模态对话回复检索[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 32-38.
[14]	王春雷, 王肖, 刘凯. 多模态知识图谱表示学习综述[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 1-15.
[15]	罗俊豪, 朱焱. 用于未对齐多模态语言序列情感分析的多交互感知网络[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 79-85.