Proposal-based aggregation network for single object tracking in 3D point cloud

doi:10.11772/j.issn.1001-9081.2021030533

Abstract

Abstract:

Compared with 2D RGB-based images， 3D point clouds retain the real and rich geometric information of objects in space to deal with vision challenge with scale variation in the single object tracking problem. However， the precision of 3D object tracking is affected by the loss of information brought by the sparsity of point cloud data and the deformation caused by the object position changing. To solve the above two problems， a proposal-based aggregation network composed of three modules was proposed in an end-to-end learning pattern. In this network， the 3D bounding box was determined by locating object center in the best proposal to realize the single object tracking in 3D point cloud. Firstly， the point cloud data of both templates and search areas was transferred into bird’s-eye view pseudo images. In the first module， the feature information was enriched through spatial and cross-channel attention mechanisms. Then， in the second module， the best proposal was given by the anchor-based deep cross-correlation Siamese region proposal subnetwork. Finally， in the third module， the object features were extracted through region of interest pooling operation by the best proposal at first， and then， the object and template features were aggregated， the sparse modulated deformable convolution layer was used to deal with the problems of point cloud sparsity and deformation， and the final 3D bounding box was determined. Experimental results of the comparison between the proposed method and the state-of-the-art 3D point cloud single object tracking methods on KITTI dataset show that： in comprehensive experiment of car， the proposed method has improved 1.7 percentage points on success rate and 0.2 percentage points on precision in real scenes； in multi-category extensive experiment of car， van， cyclist and pedestrian， the proposed method has improved the average success rate by 0.8 percentage points， and the average precision by 2.8 percentage points， indicating that the proposed method can solve the single object tracking problem in 3D point cloud and make the 3D object tracking results more accurate.

Key words: point cloud, object tracking, Siamese network, attention mechanism, deformable convolution

摘要：

与二维可见光图像相比，三维点云在空间中保留了物体真实丰富的几何信息，能够应对单目标跟踪问题中存在尺度变换的视觉挑战。针对三维目标跟踪精度受到点云数据稀疏性导致的信息缺失影响，以及物体位置变化带来的形变影响这两个问题，在端到端的学习模式下提出了由三个模块构成的提案聚合网络，通过在最佳提案内定位物体的中心来确定三维边界框从而实现三维点云中的单目标跟踪。首先，将模板和搜索区域的点云数据转换为鸟瞰伪图，模块一通过空间和跨通道注意力机制丰富特征信息；然后，模块二用基于锚框的深度互相关孪生区域提案子网给出最佳提案；最后，模块三先利用最佳提案对搜索区域的感兴趣区域池化操作来提取目标特征，随后聚合了目标与模板特征，利用稀疏调制可变形卷积层来解决点云稀疏以及形变的问题并确定了最终三维边界框。在KITTI跟踪数据集上把所提方法与最新的三维点云单目标跟踪方法进行比较的实验结果表明：在汽车类综合性实验中，真实场景中所提方法在成功率上提高了1.7个百分点，精确率上提高了0.2个百分点；在多类别扩展性实验上，即在汽车、货车、骑车人以及行人这4类上所提方法的平均成功率提高了0.8个百分点，平均精确率提高了2.8个百分点。可见，所提方法能够解决三维点云中的单目标跟踪问题，使得三维目标跟踪结果更加精确。

关键词: 点云, 目标跟踪, 孪生网络, 注意力机制, 可变形卷积

CLC Number:

TP399

Yi ZHUANG, Haitao ZHAO. Proposal-based aggregation network for single object tracking in 3D point cloud[J]. Journal of Computer Applications, 2022, 42(5): 1407-1416.

庄屹, 赵海涛. 面向三维点云单目标跟踪的提案聚合网络[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1407-1416.

Figures/Tables 18

Fig. 1 Overall structure of Proposal-based Aggregation Network （PA-Net）

Fig. 2 Rasterized feature extraction for point cloud

Fig. 3 Structure of spatial attention mechanism module

Fig. 4 Structure of cross-channel attention mechanism module

Fig. 5 Schematic diagram of convolution and deep cross-correlation

Tab. 1 Parameter setting of convolution modules

模块	参数
卷积块1	Conv2D（64，128，3，2，1）
	Conv2D（128，128，3，1，1）*3
	BatchNorm2D（128，128）
	ReLU
卷积块2	Conv2D（128，128，3，2，1）
	Conv2D（128，128，3，1，1）*5
	BatchNorm2D（128，128）
	ReLU
卷积块3	Conv2D（128，256，3，2，1）
	Conv2D（256，256，3，1，1）*5
	BatchNorm2D（256，256）
	ReLU
上采样块1	Deconv2D（128，256，1，1，0）
上采样块2	Deconv2D（128，256，2，2，0）
上采样块3	Deconv2D（256，256，4，4，0）
分类融合卷积块	Conv2D（256，2，1，1，0）（用于3个分辨率）
	Concatenate（拼接3个分辨率结果）
	Conv2D（6，2，1，1，0）
	Sigmoid
锚框偏移融合卷积块	Conv2D（256，4，1，1，0）（用于3个分辨率）
	Concatenate（拼接3个分辨率结果）
	Conv2D（12，4，1，1，0）

Fig. 6 Structure of sparse modulated deformable convolution

Fig. 7 Fusion result of merged sampling oftemplate point cloud（Car）

Fig. 8 Curves of loss in training and validation

Tab. 2 Comprehensive experimental results of different methods on Car

方法	成功率/%					精确率/%
方法	SC3D	2D-SC3D	P2B	3D-SiamRPN	PA-Net	SC3D	2D-SC3D	P2B	3D-SiamRPN	PA-Net
前一帧预测	41.3	36.2	56.2	57.3	59.0	57.9	51.0	72.8	75.0	75.2
前一帧GT	64.6	—	82.4	—	85.5	74.5	—	90.1	—	91.8
当前帧GT	76.9	—	84.0	—	89.4	81.3	—	90.3	—	93.2

Tab. 3 Extensive experimental results on different categories of different methods

类别	帧数	成功率/%					精确率/%
类别	帧数	SC3D	2D-SC3D	P2B	3D-SiamRPN	PA-Net	SC3D	2D-SC3D	P2B	3D-SiamRPN	PA-Net
均值		30.0	26.6	42.4	46.7	47.5	46.7	43.6	60.0	64.9	67.7
汽车	6 424	41.3	36.2	56.2	57.3	59.0	57.9	51.0	72.8	75.0	75.2
货车	1 248	40.4	—	40.8	45.7	51.2	47.0	—	48.4	52.8	62.8
骑车人	308	41.5	43.2	32.1	36.1	55.8	70.4	81.2	44.7	49.0	78.4
行人	6 088	18.2	17.9	28.7	35.2	38.4	37.8	47.8	49.6	56.2	66.2

Tab. 4 Ablation experimental results of PA-Net in feature enriching layer and aggregated regression layer on Car

特征丰富层	聚合回归层	成功率	精确率
无注意力机制	传统卷积	52.2	62.3
并行式注意力机制	传统卷积	54.3	64.0
分离式注意力机制	传统卷积	54.4	68.8
分离式注意力机制	调制可变性卷积	58.8	74.9
分离式注意力机制	稀疏调制可变性卷积	59.0	75.2

Fig. 9 Visual comparison of PA-Net and P2B tracking results on Car

Fig. 10 Tracking trajectories of PA-Net and P2B on Car

Fig. 11 Classification heat maps with different feature enriching layers

Fig. 12 Prediction of object bounding box and center with different output layers

Tab. 5 Results of predicted center position and deflection angle

回归对象		预测值
中心位置	提案前景最优置信度	0.987
中心位置	回归值/m	［2.500 5，11.901 0，0.752 1］
中心补偿	中心前景最优置信度	0.945
中心补偿	补偿值/m	［0.034 8，0.195 4，0.015 3］
预测中心/m		［2.535 3，12.096 4，0.767 4］
真实中心/m		［2.289 0，12.072 6，0.764 7］
中心偏差/m		［0.246 3，0.023 8，0.002 7］
预测偏转角度/rad		0.048 7
真实偏转角度/rad		0.058 6
角度偏差/rad		$-$ 0.009 9

Tab. 5 Results of predicted center position and deflection angle

回归对象		预测值
中心位置	提案前景最优置信度	0.987
中心位置	回归值/m	［2.500 5，11.901 0，0.752 1］
中心补偿	中心前景最优置信度	0.945
中心补偿	补偿值/m	［0.034 8，0.195 4，0.015 3］
预测中心/m		［2.535 3，12.096 4，0.767 4］
真实中心/m		［2.289 0，12.072 6，0.764 7］
中心偏差/m		［0.246 3，0.023 8，0.002 7］
预测偏转角度/rad		0.048 7
真实偏转角度/rad		0.058 6
角度偏差/rad		$-$ 0.009 9

Tab. 6 Running speeds of different methods on Car

方法	预处理/ms	模型推理/ms	后处理/ms	总时长/ms	帧率/（frame·s^-1）
P2B	7.0	14.3	0.9	22.2	45.0
3D-SiamRPN	0.5	40.7	7.2	48.0	20.8
PA-Net	35.0	5.6	0.3	40.9	24.4

References 28

1	SMEULDERS A W M， CHU D M， CUCCHIARA R， et al. Visual tracking： an experimental survey ［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2014， 36（7）： 1442-1468. 10.1109/tpami.2013.230
2	SHAO L， SHAH P， DWARACHERLA V， et al. Motion-based object segmentation based on dense RGB-D scene flow ［J］. IEEE Robotics and Automation Letters， 2018， 3（4）： 3797-3804. 10.1109/lra.2018.2856525
3	ZHOU Y， WANG T， HU R H， et al. Multiple Kernelized Correlation Filters （MKCF） for extended object tracking using X-band marine radar data ［J］. IEEE Transactions on Signal Processing， 2019， 67（14）： 3676-3688. 10.1109/tsp.2019.2917812
4	LI C L， ZHU C L， HUANG Y， et al. Cross-modal ranking with soft consistency and noisy labels for robust RGB-T tracking ［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11217. Cham： Springer， 2018： 831-847.
5	ZHU Y B， LI C L， TANG J， et al. Quality-aware feature aggregation network for robust RGBT tracking ［J］. IEEE Transactions on Intelligent Vehicles， 2021， 6（1）： 121-130. 10.1109/tiv.2020.2980735
6	王红艳，郑伶杰，陈献娜.简述激光雷达点云数据的处理应用［J］.资源导刊，2015（S2）：44-45. 10.3969/j.issn.1674-053X.2015.z2.022
	WANG H Y， ZHENG L J， CHEN X N. Brief introduction of the processing application of the point cloud data of lidar ［J］. Resources Guide， 2015（S2）： 44-45. 10.3969/j.issn.1674-053X.2015.z2.022
7	GIANCOLA S， ZARZAR J， GHANEM B. Leveraging shape completion for 3D Siamese tracking ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 1359-1368. 10.1109/cvpr.2019.00145
8	QI H Z， FENG C， CAO Z G， et al. P2B： point-to-box network for 3D object tracking in point clouds ［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 6328-6337. 10.1109/CVPR42600.2020.00636
9	QI C H， YI L， SU H， et al. PointNet++： deep hierarchical feature learning on point sets in a metric space ［C］// Proceedings of the 2017 31st International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 5105-5114.
10	QI C H， LITANY O， HE K M， et al. Deep Hough voting for 3D object detection in point clouds ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 9276-9285. 10.1109/iccv.2019.00937
11	FANG Z， ZHOU S F， CUI Y B， et al. 3D-SiamRPN： an end-to-end learning method for real-time 3D single object tracking using raw point cloud ［J］. IEEE Sensors Journal， 2021， 21（4）： 4995-5011. 10.1109/jsen.2020.3033034
12	ZARZAR J， GIANCOLA S， GHANEM B. Efficient tracking proposals using 2D-3D Siamese networks on LIDAR ［EB/OL］. ［2021-02-13］. .
13	LI B， YAN J J， WU W， et al. High performance visual tracking with Siamese region proposal network ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 8971-8980. 10.1109/cvpr.2018.00935
14	LI B， WU W， WANG Q， et al. SiamRPN++： evolution of Siamese visual tracking with very deep networks ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 4277-4286. 10.1109/cvpr.2019.00441
15	ZHOU Y， TUZEL O. VoxelNet： end-to-end learning for point cloud based 3D object detection ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 4490-4499. 10.1109/cvpr.2018.00472
16	YAN Y， MAO Y X， LI B. SECOND： sparsely embedded convolutional detection ［J］. Sensors， 2018， 18（10）： Article No.3337. 10.3390/s18103337
17	LANG A H， VORA S， CAESAR H， et al. PointPillars： fast encoders for object detection from point clouds ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 12689-12697. 10.1109/cvpr.2019.01298
18	NAM H， HA J W， KIM J. Dual attention networks for multimodal reasoning and matching ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 2156-2164. 10.1109/cvpr.2017.232
19	FU J， LIU J， TIAN H J， et al. Dual attention network for scene segmentation ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 3141-3149. 10.1109/cvpr.2019.00326
20	DAI J F， QI H Z， XIONG Y W， et al. Deformable convolutional networks ［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 764-773. 10.1109/iccv.2017.89
21	ZHU X Z， HU H， LIN S， et al. Deformable ConvNets v2： more deformable， better results ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 9300-9308. 10.1109/cvpr.2019.00953
22	YU Y C， XIONG Y L， HUANG W Let al. Deformable Siamese attention networks for visual object tracking ［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 6727-6736. 10.1109/cvpr42600.2020.00676
23	尚丽，苏品刚，周燕.基于改进的快速稀疏编码的图像特征提取［J］.计算机应用，2013，33（3）：656-659. 10.3724/SP.J.1087.2013.00656
	SHANG L， SU P G， ZHOU Y. Image feature extraction based on modified fast sparse coding algorithm ［J］. Journal of Computer Applications， 2013， 33（3）： 656-659. 10.3724/SP.J.1087.2013.00656
24	LIN T Y， GOYAL P， GIRSHICK Ret al. Focal loss for dense object detection ［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 2999-3007. 10.1109/iccv.2017.324
25	SHAH J， QURESHI I， DENG Y M， et al. Reconstruction of sparse signals and compressively sampled images based on smooth l₁-norm approximation ［J］. Journal of Signal Processing Systems， 2017， 88（3）： 333-344. 10.1007/s11265-016-1168-8
26	GEIGER A， LENZ P， URTASUN R. Are we ready for autonomous driving？ the KITTI vision benchmark suite ［C］// Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2012： 3354-3361. 10.1109/cvpr.2012.6248074
27	WU Y， LIM J， YANG M H. Online object tracking： a benchmark ［C］// Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2013： 2411-2418. 10.1109/cvpr.2013.312
28	KINGMA D P， BA J L. Adam： a method for stochastic optimization ［EB/OL］. ［2021-02-03］. .

[1]	Jing QIN, Zhiguang QIN, Fali LI, Yueheng PENG. Diagnosis of major depressive disorder based on probabilistic sparse self-attention neural network [J]. Journal of Computer Applications, 2024, 44(9): 2970-2974.
[2]	Liting LI, Bei HUA, Ruozhou HE, Kuang XU. Multivariate time series prediction model based on decoupled attention mechanism [J]. Journal of Computer Applications, 2024, 44(9): 2732-2738.
[3]	Zhiqiang ZHAO, Peihong MA, Xinhong HEI. Crowd counting method based on dual attention mechanism [J]. Journal of Computer Applications, 2024, 44(9): 2886-2892.
[4]	Kaipeng XUE, Tao XU, Chunjie LIAO. Multimodal sentiment analysis network with self-supervision and multi-layer cross attention [J]. Journal of Computer Applications, 2024, 44(8): 2387-2392.
[5]	Pengqi GAO, Heming HUANG, Yonghong FAN. Fusion of coordinate and multi-head attention mechanisms for interactive speech emotion recognition [J]. Journal of Computer Applications, 2024, 44(8): 2400-2406.
[6]	Zhonghua LI, Yunqi BAI, Xuejin WANG, Leilei HUANG, Chujun LIN, Shiyu LIAO. Low illumination face detection based on image enhancement [J]. Journal of Computer Applications, 2024, 44(8): 2588-2594.
[7]	Shangbin MO, Wenjun WANG, Ling DONG, Shengxiang GAO, Zhengtao YU. Single-channel speech enhancement based on multi-channel information aggregation and collaborative decoding [J]. Journal of Computer Applications, 2024, 44(8): 2611-2617.
[8]	Li LIU, Haijin HOU, Anhong WANG, Tao ZHANG. Generative data hiding algorithm based on multi-scale attention [J]. Journal of Computer Applications, 2024, 44(7): 2102-2109.
[9]	Song XU, Wenbo ZHANG, Yifan WANG. Lightweight video salient object detection network based on spatiotemporal information [J]. Journal of Computer Applications, 2024, 44(7): 2192-2199.
[10]	Dahai LI, Zhonghua WANG, Zhendong WANG. Dual-branch low-light image enhancement network combining spatial and frequency domain information [J]. Journal of Computer Applications, 2024, 44(7): 2175-2182.
[11]	Wenliang WEI, Yangping WANG, Biao YUE, Anzheng WANG, Zhe ZHANG. Deep learning model for infrared and visible image fusion based on illumination weight allocation and attention [J]. Journal of Computer Applications, 2024, 44(7): 2183-2191.
[12]	Wu XIONG, Congjun CAO, Xuefang SONG, Yunlong SHAO, Xusheng WANG. Handwriting identification method based on multi-scale mixed domain attention mechanism [J]. Journal of Computer Applications, 2024, 44(7): 2225-2232.
[13]	Huanhuan LI, Tianqiang HUANG, Xuemei DING, Haifeng LUO, Liqing HUANG. Public traffic demand prediction based on multi-scale spatial-temporal graph convolutional network [J]. Journal of Computer Applications, 2024, 44(7): 2065-2072.
[14]	Dianhui MAO, Xuebo LI, Junling LIU, Denghui ZHANG, Wenjing YAN. Chinese entity and relation extraction model based on parallel heterogeneous graph and sequential attention mechanism [J]. Journal of Computer Applications, 2024, 44(7): 2018-2025.
[15]	Zexin XU, Lei YANG, Kangshun LI. Shorter long-sequence time series forecasting model [J]. Journal of Computer Applications, 2024, 44(6): 1824-1831.