Object detection of Gaussian-YOLO v3 implanting attention and feature intertwine modules
LIU Dan1, WU Yajuan1, LUO Nanchao2, ZHENG Bochuan3
1. School of Computer Science, China West Normal University, Nanchong Sichuan 637002, China; 2. School of Computer Science and Technology, Aba Teachers University, Aba Sichuan 623002, China; 3. School of Mathematics and Information, China West Normal University, Nanchong Sichuan 637002, China
Abstract:Wrong object detection may lead to serious accidents, so high-precision object detection is very important in autonomous driving. An object detection method of Gaussian-YOLO v3 combining attention and feature intertwine module was proposed, in which several specific feature maps were mainly improved. First, the attention module was added to the feature map to learn the weight of each channel autonomously, enhancing the key features and suppressing the redundant features, so as to enhance the network ability to distinguish foreground object and background. Second, at the same time, different channels of the feature map were intertwined to obtain more representative features. Finally, the features obtained by the attention and feature intertwine modules were fused to form a new feature map. Experimental results show that the proposed method achieves mAP (mean Average Precision) of 20.81% and F1 score of 18.17% on BDD100K dataset, and has the false alarm rate decreased by 3.5 percentage points, reducing the false alarm rate effectively. It can be seen that the detection performance of the proposed method is better than those of YOLO v3 and Gaussian-YOLO v3.
[1] GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2014:580-587. [2] HE K, ZHANG X, REN S, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9):1904-1916. [3] GIRSHICK R. Fast R-CNN[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway:IEEE, 2015:1440-1448. [4] REN S, HE K, GIRSHICK R, et al. Faster R-CNN:towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6):1137-1149. [5] LIU W, ANGUELOV D, ERHAN D, et al. SSD:single shot multibox detector[C]//Proceedings of the 2016 European Conference on Computer Vision, LNCS 9905. Cham:Springer, 2016:21-37. [6] FU C, LIU W, RANGA A, et al. DSSD:deconvolutional single shot detector[EB/OL].[2019-12-15].https://arxiv.org/pdf/1701.06659.pdf. [7] LIU S, HUANG D, WANG Y. Receptive field block net for accurate and fast object detection[C]//Proceedings of the 2018 European Conference on Computer Vision, LNCS 11215. Cham:Springer, 2018:404-419. [8] ZHANG S, WEN L, BIAN X, et al. Single-shot refinement neural network for object detection[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2018:4203-4212. [9] REN J, CHEN X, LIU J, et al. Accurate single stage detector using recurrent rolling convolution[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2017:752-760. [10] LI Z, ZHOU F. FSSD:feature fusion single shot multibox detector[EB/OL].[2020-12-25].https://arxiv.org/pdf/1712.00960.pdf. [11] REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once:unified, real-time object detection[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:779-788. [12] REDMON J, FARHADI A. YOLO9000:better, faster, stronger[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2017:6517-6525. [13] REDMON R, FARHADI A. YOLO v3:an incremental improvement[EB/OL].[2019-12-25].https://arxiv.org/pdf/1804.02767.pdf. [14] CHOI J, CHUN D, KIM H, et al. Gaussian YOLO v3:an accurate and fast object detector using localization uncertainty for autonomous driving[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2019:502-511. [15] VINYALS O, TOSHEY A, BENGIO S, et al. Show and tell:a neural image caption generator[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2015:3156-3164. [16] MNIH V, HEESS N, GRAVES A, et al. Recurrent models of visual attention[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge:MIT Press, 2014:2204-2212. [17] ZAREMBA W, SUTSKEYER I, VINYALS O. Recurrent neural network regularization[EB/OL].[2020-01-05].https://arxiv.org/pdf/1409.2329.pdf. [18] XU K, BA J, KIROS R, et al. Show, attend and tell:neural image caption generation with visual attention[C]//Proceedings of 32nd International Conference on Machine Learning. New York:International Machine Learning Society, 2015:2048-2057. [19] WANG F, JIANG M, QIAN C, et al. Residual attention network for image classification[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2017:6450-6458. [20] WANG X, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2018:7794-7803. [21] WOO S, PARK J, LEE J Y, et al. CBAM:convolutional block attention module[C]//Proceedings of the 2018 European Conference on Computer Vision, LNCS 11211. Cham:Springer, 2018:3-19. [22] HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2018:7132-7141. [23] GAN S-H, CHENG M-M, ZHAO K, et al. Res2Net:a new multi-scale backbone architecture[EB/OL]. (2019-09-01)[2020-01-05]. https://arxiv.org/pdf/1904.01169.pdf. [24] 沈文祥,秦品乐,曾建潮. 基于多级特征和混合注意力机制的室内人群检测网络[J]. 计算机应用, 2019, 39(12):3496-3502. (SHEN W X, QIN P L, ZENG J C. Indoor crowd detection network based on multi-level features and fusion attention mechanism[J]. Journal of Computer Applications, 2019, 39(12):3496-3502.) [25] YU F, CHEN H, WANG X, et al. BDD100K:a diverse driving video database with scalable annotation tooling[EB/OL].[2020-01-15].https://arxiv.org/pdf/1805.04687.pdf. [26] 徐诚极,王晓峰,杨亚东. Attention-YOLO:引入注意力机制的YOLO检测算法[J]. 计算机工程与应用, 2019, 55(6):13-23. XU C J, WANG X F, YANG Y D. Attention-YOLO:YOLO detection algorithm that introduces attention mechanism[J]. Computer Engineering and Applications, 2019, 55(6):13-23. [27] HE K, ZHANG X, REN S, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9):1904-1916. [28] LIN T Y, DOLLAR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2017:936-944.