Monocular 3D object detection method integrating depth and instance segmentation

doi:10.11772/j.issn.1001-9081.2023070990

Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (7): 2208-2215.DOI: 10.11772/j.issn.1001-9081.2023070990

• Multimedia computing and computer simulation • Previous Articles Next Articles

Monocular 3D object detection method integrating depth and instance segmentation

Xun SUN¹, Ruifeng FENG²(), Yanru CHEN²

^1.Line Station Design and Research Institute，China Railway Siyuan Survey and Design Group Company Limited，Wuhan Hubei 430063，China
^2.College of Economics and Management，Southwest Jiaotong University，Chengdu Sichuan 610031，China

Received:2023-07-21 Revised:2023-09-26 Accepted:2023-09-28 Online:2023-10-26 Published:2024-07-10
Contact: Ruifeng FENG
About author:SUN Xun， born in 1972， senior engineer.
CHEN Yanru， born in 1974， Ph. D.， professor. Her research interests include logistics system modeling and optimization， machine learning.
First author contact:Her research interests include station intelligence， logistics planning， station design.
FENG Ruifeng， born in 1999， M. S. candidate. His research interests include logistics system modeling and optimization， machine learning.
Supported by:
National Natural Science Foundation of China(62173279)

基于深度与实例分割融合的单目3D目标检测方法

孙逊¹, 冯睿锋²(), 陈彦如²

^1.中铁第四勘察设计院集团有限公司线路站场设计研究院, 武汉 430063
^2.西南交通大学经济管理学院, 成都 610031

通讯作者: 冯睿锋
作者简介:孙逊（1972—）女，湖北武汉人，高级工程师，主要研究方向：场站智能化、物流规划、场站设计；
陈彦如（1974—），女，内蒙古包头人，教授，博士，主要研究方向：物流系统建模与优化、机器学习。
第一联系人：冯睿锋（1999—），男，四川遂宁人，硕士研究生，主要研究方向：物流系统建模与优化、机器学习；
基金资助:
国家自然科学基金资助项目(62173279)

Abstract

Abstract:

To address the limitations of monocular 3D object detection， when encountering changing object size due to changing perspective and occlusion， a new monocular 3D object detection method was proposed fusing depth information with instance segmentation masks. Firstly， with the help of the Depth-Mask Attention Fusion （DMAF） module， depth information was combined with instance segmentation masks to provide more accurate object boundaries. Secondly， dynamic convolution was introduced， and the fused features obtained from the DMAF module were used to guide the generation of dynamic convolution kernels for dealing with objects of different scales. Moreover， a 2D-3D bounding box consistency loss function was introduced into loss function， adjusting the predicted 3D bounding box to highly coincide with corresponding 2D detection box， thereby enhancing performance in instance segmentation and 3D object detection tasks. Lastly， the effectiveness of the proposed method was confirmed through ablation studies and validated on the KITTI test set. The results indicate that， compared to methods using only depth estimation maps and instance segmentation masks， the proposed method improves the average accuracy of vehicle detection under medium difficulty by 6.36 percentage points， and it outperforms comparative techniques like D4LCN （Depth-guided Dynamic-Depthwise-Dilated Local Convolutional Network） and M3D-RPN （Monocular 3D Region Proposal Network） in both 3D object detection and aerial view object detection tasks.

Key words: monocular 3D object detection, deep learning, dynamic convolution, instance segmentation

摘要：

针对单目3D目标检测在视角变化引起的物体大小变化以及物体遮挡等情况下效果不佳的问题，提出一种融合深度信息和实例分割掩码的新型单目3D目标检测方法。首先，通过深度-掩码注意力融合（DMAF）模块，将深度信息与实例分割掩码结合，以提供更准确的物体边界；其次，引入动态卷积，并利用DMAF模块得到的融合特征引导动态卷积核的生成，以处理不同尺度的物体；再次，在损失函数中引入2D-3D边界框一致性损失函数，调整预测的3D边界框与对应的2D检测框高度一致，以提高实例分割和3D目标检测任务的效果；最后，通过消融实验验证该方法的有效性，并在KITTI测试集上对该方法进行验证。实验结果表明，与仅使用深度估计图和实例分割掩码的方法相比，在中等难度下对车辆类别检测的平均精度提高了6.36个百分点，且3D目标检测和鸟瞰图目标检测任务的效果均优于D4LCN（Depth-guided Dynamic-Depthwise-Dilated Local Convolutional Network）、M3D-RPN（Monocular 3D Region Proposal Network）等对比方法。

关键词: 单目3D目标检测, 深度学习, 动态卷积, 实例分割

CLC Number:

TP391.4

Xun SUN, Ruifeng FENG, Yanru CHEN. Monocular 3D object detection method integrating depth and instance segmentation[J]. Journal of Computer Applications, 2024, 44(7): 2208-2215.

孙逊, 冯睿锋, 陈彦如. 基于深度与实例分割融合的单目3D目标检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2208-2215.

Figures/Tables 11

References 30

1	MATURANA D， SCHERER S. VoxNet： a 3D convolutional neural network for real-time object recognition ［C］// Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway： IEEE， 2015： 922-928.
2	QI C R， LIU W， WU C， et al. Frustum PointNets for 3D object detection from RGB-D data ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 918-927.
3	周静，胡怡宇，胡成玉，等.基于点云补全和多分辨Transformer的弱感知目标检测方法［J］.计算机应用， 2023， 43（7）： 2155-2165.
	ZHOU J， HU Y Y， HU C Y， et al. Weakly perceived object detection method based on point cloud completion and multi-resolution Transformer ［J］. Journal of Computer Applications， 2023， 43（7）： 2155-2165.
4	WANG Y， CHAO W-L， GARG D， et al. Pseudo-LiDAR from visual depth estimation： bridging the gap in 3D object detection for autonomous driving ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 8445-8453.
5	LI P， CHEN X， SHEN S. Stereo R-CNN based 3D object detection for autonomous driving ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 7636-7644.
6	WENG X， KITANI K. Monocular 3D object detection with pseudo-lidar point cloud ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop. Piscataway： IEEE， 2019： 857-866.
7	王凤随，熊磊，钱亚萍.联合实例深度的多尺度单目3D目标检测算法［J］.激光与光电子学进展， 2023， 60（16）： 1612002.
	WANG F S， XIONG L， QIAN Y P. Multiscale monocular three-dimensional object detection algorithm incorporating instance depth ［J］. Laser & Optoelectronics Progress， 2023， 60（16）： 1612002.
8	DING M， HUO Y， YI H， et al. Learning depth-guided convolutions for monocular 3D object detection ［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 11669-11678.
9	MOUSAVIAN A， ANGUELOV D， FLYNN J， et al. 3D bounding box estimation using deep learning and geometry ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 5632-5640.
10	QIN Z， WANG J， LU Y. MonoGRNet： a geometric reasoning network for monocular 3D object localization ［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2019， 33（1）： 8851-8858.
11	BRAZIL G， LIU X. M3D-RPN： monocular 3D region proposal network for object detection ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 9286-9295.
12	GODARD C， AODHA O M， FIRMAN M， et al. Digging into self-supervised monocular depth estimation ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 3827-3837.
13	FU H， GONG M， WANG C， et al. Deep ordinal regression network for monocular depth estimation ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 2002-2011.
14	CHEN X， MA H， WAN J， et al. Multi-view 3D object detection network for autonomous driving ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 6526-6534.
15	SHIRMOHAMMADI Z， NIKOOFARD A， ERSHADI G. AM3D： an accurate crosstalk probability modeling to predict channel delay in 3D ICs ［J］. Microelectronics Reliability， 2019， 102： 113379.
16	READING C， HARAKEH A， CHAE J， et al. Categorical depth distribution network for monocular 3D object detection ［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 8551-8560.
17	HE K， ZHANG X， REN S， et al. Deep residual learning for image recognition ［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778.
18	HE K， GKIOXARI G， DOLLÁR P， et al. Mask R-CNN ［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 2980-2988.
19	GIRSHICK R. Fast R-CNN ［C］// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2015： 1440-1448.
20	GEIGER A， LENZ P， URTASUN R. Are we ready for autonomous driving？ the KITTI vision benchmark suite ［C］// Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2012： 3354-3361.
21	CORDTS M， OMRAN M， RAMOS S， et al. The Cityscapes dataset for semantic urban scene understanding ［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 3213-3223.
22	WOO S， PARK J， LEE J-Y， et al. CBAM： convolutional block attention module ［C］// Proceedings of the 15th European Conference on Computer Vision. Cham： Springer， 2018： 3-19.
23	DE BRABANDERE B， JIA X， TUYTELAARS T， et al. Dynamic filter networks ［C］// Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2016： 667-675.
24	LIU W， ANGUELOV D， ERHAN D， et al. SSD： single shot MultiBox detector ［C］// Proceedings of the 14th European Conference on Computer Vision. Cham： Springer， 2016： 21-37.
25	LIN T-Y， GOYAL P， GIRSHICK R， et al. Focal loss for dense object detection ［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 2999-3007.
26	CHEN X， KUNDU K， ZHU Y， et al. 3D object proposals for accurate object class detection ［C］// Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge： MIT Press， 2015： 424-432.
27	CHEN X， KUNDU K， ZHANG Z， et al. Monocular 3D object detection for autonomous driving ［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 2147-2156.
28	XU B， CHEN Z. Multi-level fusion based 3D object detection from monocular images ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 2345-2353.
29	LIU Z， WU Z， TÓTH R. Smoke： single-stage monocular 3D object detection via keypoint estimation ［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Piscataway： IEEE， 2020： 4289-4298.
30	CHONG Z， MA X， ZHANG H， et al. MonoDistill： learning spatial features for monocular 3D object detection ［EB/OL］. （2022-01-26）［2023-08-29］. .

项目	配置或版本
CPU	Intel Xeon Gold 5320
内存	120 GB
GPU	A30×2
系统	Ubuntu 20.04
CUDA	11.4

项目	配置或版本
CPU	Intel Xeon Gold 5320
内存	120 GB
GPU	A30×2
系统	Ubuntu 20.04
CUDA	11.4

方法	AP^3D			AP^BEV
方法	Easy	Mod	Hard	Easy	Mod	Hard
Mono3D^［27］	2.53	2.31	2.31	5.22	5.19	4.13
MF3D^［28］	10.53	5.69	5.39	22.03	13.63	11.60
MoNoGRNet^［10］	13.88	10.19	7.62	24.97	19.44	16.30
M3D-RPN^［11］	20.27	17.06	15.21	25.94	21.18	17.90
D4LCN^［8］	22.32	16.20	12.30	31.53	22.58	17.87
SMOKE^［29］	14.76	12.85	11.50	19.99	15.61	15.28
MonoDistill^［30］	18.05	14.98	13.42	24.26	18.43	16.95
本文方法	24.91	21.03	17.28	33.40	25.03	19.80

方法	AP^3D			AP^BEV
方法	Easy	Mod	Hard	Easy	Mod	Hard
Mono3D^［27］	2.53	2.31	2.31	5.22	5.19	4.13
MF3D^［28］	10.53	5.69	5.39	22.03	13.63	11.60
MoNoGRNet^［10］	13.88	10.19	7.62	24.97	19.44	16.30
M3D-RPN^［11］	20.27	17.06	15.21	25.94	21.18	17.90
D4LCN^［8］	22.32	16.20	12.30	31.53	22.58	17.87
SMOKE^［29］	14.76	12.85	11.50	19.99	15.61	15.28
MonoDistill^［30］	18.05	14.98	13.42	24.26	18.43	16.95
本文方法	24.91	21.03	17.28	33.40	25.03	19.80

方法	AP^3D			AP^BEV
方法	Easy	Mod	Hard	Easy	Mod	Hard
基线	18.28	14.67	13.38	26.38	19.88	16.27
+①	19.41	16.49	14.41	28.12	21.30	17.54
+①+④	22.18	16.63	15.57	28.96	23.07	17.87
+①+②	20.89	14.92	14.12	27.42	21.81	16.82
+①+②+③	22.39	17.64	15.77	29.16	23.22	18.73
+①+②+③+④	24.91	21.03	17.28	33.40	25.03	19.80

Monocular 3D object detection method integrating depth and instance segmentation

基于深度与实例分割融合的单目3D目标检测方法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 11

References 30

Related Articles 15

Recommended Articles

Metrics

[1]	Yunchuan HUANG, Yongquan JIANG, Juntao HUANG, Yan YANG. Molecular toxicity prediction based on meta graph isomorphism network [J]. Journal of Computer Applications, 2024, 44(9): 2964-2969.
[2]	Yexin PAN, Zhe YANG. Optimization model for small object detection based on multi-level feature bidirectional fusion [J]. Journal of Computer Applications, 2024, 44(9): 2871-2877.
[3]	Jing QIN, Zhiguang QIN, Fali LI, Yueheng PENG. Diagnosis of major depressive disorder based on probabilistic sparse self-attention neural network [J]. Journal of Computer Applications, 2024, 44(9): 2970-2974.
[4]	Xiyuan WANG, Zhancheng ZHANG, Shaokang XU, Baocheng ZHANG, Xiaoqing LUO, Fuyuan HU. Unsupervised cross-domain transfer network for 3D/2D registration in surgical navigation [J]. Journal of Computer Applications, 2024, 44(9): 2911-2918.
[5]	Shunyong LI, Shiyi LI, Rui XU, Xingwang ZHAO. Incomplete multi-view clustering algorithm based on self-attention fusion [J]. Journal of Computer Applications, 2024, 44(9): 2696-2703.
[6]	Yuhan LIU, Genlin JI, Hongping ZHANG. Video pedestrian anomaly detection method based on skeleton graph and mixed attention [J]. Journal of Computer Applications, 2024, 44(8): 2551-2557.
[7]	Yanjie GU, Yingjun ZHANG, Xiaoqian LIU, Wei ZHOU, Wei SUN. Traffic flow forecasting via spatial-temporal multi-graph fusion [J]. Journal of Computer Applications, 2024, 44(8): 2618-2625.
[8]	Qianhong SHI, Yan YANG, Yongquan JIANG, Xiaocao OUYANG, Wubo FAN, Qiang CHEN, Tao JIANG, Yuan LI. Multi-granularity abrupt change fitting network for air quality prediction [J]. Journal of Computer Applications, 2024, 44(8): 2643-2650.
[9]	Zheng WU, Zhiyou CHENG, Zhentian WANG, Chuanjian WANG, Sheng WANG, Hui XU. Deep learning-based classification of head movement amplitude during patient anaesthesia resuscitation [J]. Journal of Computer Applications, 2024, 44(7): 2258-2263.
[10]	Huanhuan LI, Tianqiang HUANG, Xuemei DING, Haifeng LUO, Liqing HUANG. Public traffic demand prediction based on multi-scale spatial-temporal graph convolutional network [J]. Journal of Computer Applications, 2024, 44(7): 2065-2072.
[11]	Zhi ZHANG, Xin LI, Naifu YE, Kaixi HU. DKP： defending against model stealing attacks based on dark knowledge protection [J]. Journal of Computer Applications, 2024, 44(7): 2080-2086.
[12]	Yiqun ZHAO, Zhiyu ZHANG, Xue DONG. Anisotropic travel time computation method based on dense residual connection physical information neural networks [J]. Journal of Computer Applications, 2024, 44(7): 2310-2318.
[13]	Yangyi GAO, Tao LEI, Xiaogang DU, Suiyong LI, Yingbo WANG, Chongdan MIN. Crowd counting and locating method based on pixel distance map and four-dimensional dynamic convolutional network [J]. Journal of Computer Applications, 2024, 44(7): 2233-2242.
[14]	Song XU, Wenbo ZHANG, Yifan WANG. Lightweight video salient object detection network based on spatiotemporal information [J]. Journal of Computer Applications, 2024, 44(7): 2192-2199.
[15]	Yajuan ZHAO, Fanjun MENG, Xingjian XU. Review of online education learner knowledge tracing [J]. Journal of Computer Applications, 2024, 44(6): 1683-1698.