General object detection framework based on improved Faster R-CNN

doi:10.11772/j.issn.1001-9081.2020111852

Abstract

Abstract: Aiming at the problem that current detectors based on deep learning cannot effectively detect objects with irregular shapes or large differences between length and width, based on the traditional Faster Region-based Convolutional Neural Network (Faster R-CNN) algorithm, an improved two-stage object detection framework named Accurate R-CNN was proposed. First of all, a novel Intersection over Union (IoU) metric-Effective Intersection over Union (EIoU) was proposed to reduce the proportion of redundant bounding boxes in the training data by using the centrality weight. Then, a context related Feature Reassignment Module (FRM) was proposed to re-encode the features by the remote dependency and local context information of objects, so as to make up for the loss of shape information in the pooling process. Experimental results show that on the Microsoft Common Objects in COntext (MS COCO) dataset, for the bounding box detection task, when using Residual Networks (ResNets) with two different depths of 50 and 101 as the backbone networks, Accurate R-CNN has the Average Precision (AP) improvements of 1.7 percentage points and 1.1 percentage points respectively compared to the baseline model Faster R-CNN, which are significantly than those of the detectors based on mask with the same backbone networks. After adding mask branch, for the instance segmentation task, when ResNets with two different depths are used as the backbone networks, the mask Average Precisions of Accurate R-CNN are increased by 1.2 percentage points and 1.1 percentage points respectively compared with Mask Region-based Convolutional Neural Network (Mask R-CNN). The research results illustrate that compared to the baseline model, Accurate R-CNN achieves better performance on different datasets and different tasks.

Key words: computer vision, object detection, instance segmentation, Intersection over Union (IoU), Region of Interest Pooling (RoI Pooling)

摘要： 针对当前基于深度学习的检测器不能有效检测形状不规则或长宽相差悬殊的目标的问题，在传统Faster R-CNN算法的基础上，提出了一个改进的二阶段目标检测框架——Accurate R-CNN。首先，提出了新的交并比（IoU）度量——有效交并比（EIoU），通过提出中心度权重来降低训练数据中冗余包围框的占比。然后，提出了一个上下文相关的特征重分配模块（FRM），通过建模目标的远程依赖和局部上下文关系信息对特征进行重编码，以弥补池化过程中的形状信息损失。实验结果表明，在微软多场景通用目标（MS COCO）数据集上，对于包围框检测任务，当使用深度为50和101的残差网络（ResNet）作为骨干网络时，Accurate R-CNN比基线模型Faster R-CNN的平均精度（AP）分别提高了1.7个百分点和1.1个百分点，超越了使用同样骨干网络的基于掩膜的检测器。在添加掩膜分支后，对于实例分割任务，当使用两种不同深度的ResNet作为骨干网络时，Accurate R-CNN比Mask R-CNN的掩膜平均精度分别提高了1.2个百分点和1.1个百分点。研究结果显示，相较于基线模型，Accurate R-CNN在不同数据集、不同任务上均取得了更好的检测效果。

关键词: 计算机视觉, 目标检测, 实例分割, 交并比, 感兴趣区域池化

CLC Number:

TP391.4

MA Jialiang, CHEN Bin, SUN Xiaofei. General object detection framework based on improved Faster R-CNN[J]. Journal of Computer Applications, 2021, 41(9): 2712-2719.

马佳良, 陈斌, 孙晓飞. 基于改进的Faster R-CNN的通用目标检测框架[J]. 计算机应用, 2021, 41(9): 2712-2719.

References

[1] GIRSHICK R. Fast R-CNN[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway:IEEE, 2015:1440-1448.
[2] REN S Q,HE K M,GIRSHICK R,et al. Faster R-CNN:towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2017, 39(6):1137-1149.
[3] LIN T Y,DOLLÁR P,GIRSHICK R,et al. Feature pyramid networks for object detection[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2017:936-944.
[4] HE K M,GKIOXARI G,DOLLÁR P,et al. Mask R-CNN[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway:IEEE,2017:2980-2988.
[5] CAI Z W,VASCONCELOS N. Cascade R-CNN:delving into high quality object detection[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2018:6154-6162.
[6] LIN T Y,GOYAL P,GIRSHICK R,et al. Focal loss for dense object detection[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway:IEEE,2017:2999-3007.
[7] REDMON J,FARHADI A. YOLOv3:an incremental improvement[EB/OL].[2020-10-11]. https://arxiv.org/pdf/1804.02767.pdf.
[8] TIAN Z,SHEN C H,CHEN H,et al. FCOS:fully convolutional one-stage object detection[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway:IEEE, 2019:9627-9636.
[9] HE K M,ZHANG X Y,REN S Q,et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2015, 37(9):1904-1916.
[10] DAI J F,HE K M,SUN J. Instance-aware semantic segmentation via multi-task network cascades[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2016:3150-3158.
[11] DAI J F,LI Y,HE K M,et al. R-FCN:object detection via region-based fully convolutional networks[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook, NY:Curran Associates Inc., 2016:379-387.
[12] JIANG B R, LUO R X, MAO J Y, et al. Acquisition of localization confidence for accurate object detection[C]//Proceedings of the 2018 European Conference on Computer Vision,LNCS 11218. Cham:Springer,2018:816-832.
[13] DAI J F,QI H Z,XIONG Y W,et al. Deformable convolutional networks[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway:IEEE, 2017:764-773.
[14] 邓琉元, 杨明, 王春香, 等. 基于环视相机的无人驾驶汽车实例分割方法[J]. 华中科技大学学报(自然科学版),2018,46(12):24-29. (DENG L Y,YANG M,WANG C X,et al. Surround view cameras based instance segmentation method for autonomous vehicles[J]. Journal of Huazhong University of science and Technology (Natural Science Edition),2018,46(12):24-29.)
[15] ARNAB A,JAYASUMANA S,ZHENG S,et al. Higher order conditional random fields in deep neural networks[C]//Proceedings of the 2016 European Conference on Computer Vision,LNCS 9906. Cham:Springer,2016:524-540.
[16] VEMULAPALLI R, TUZEL O, LIU M Y, et al. Gaussian conditional random field network for semantic segmentation[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2016:3224-3233.
[17] LIN D, JI Y F, LISCHINSKI D, et al. Multi-scale context intertwining for semantic segmentation[C]//Proceedings of the 2018 European Conference on Computer Vision,LNCS 11207. Cham:Springer,2018:622-638.
[18] LIN G S,MILAN A,SHEN C H,et al. RefineNet:multi-path refinement networks for high-resolution semantic segmentation[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2017:5168-5177.
[19] TIAN Z,HE T,SHEN C H,et al. Decoders matter for semantic segmentation:data-dependent decoding enables flexible feature aggregation[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2019:3121-3130.
[20] BADRINARAYANAN V,KENDALL A,CIPOLLA R. SegNet:a deep convolutional encoder-decoder architecture for image segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(12):2481-2495.
[21] CHEN L C,PAPANDREOU G,KOKKINOS I,et al. DeepLab:semantic image segmentation with deep convolutional nets,atrous convolution,and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4):834-848.
[22] CHEN L C,PAPANDREOU G,SCHROFF F,et al. Rethinking atrous convolution for semantic image segmentation[EB/OL].[2020-09-11]. https://arxiv.org/pdf/1706.05587.pdf.
[23] CHEN L C,ZHU Y K,PAPANDREOU G,et al. Encoderdecoder with atrous separable convolution for semantic image segmentation[C]//Proceedings of the 2018 European Conference on Computer Vision, LNCS 11211. Cham:Springer, 2018:833-851.
[24] YANG M K,YU K,ZHANG C,et al. DenseASPP for semantic segmentation in street scenes[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2018:3684-3692.
[25] HUANG Z L,WANG X G,HUANG L C,et al. CCNet:crisscross attention for semantic segmentation[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway:IEEE,2019:603-612.
[26] DING H H,JIANG X D,SHUAI B,et al. Semantic correlation promoted shape-variant context for segmentation[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2019:8877-8886.
[27] ZHAO H S,SHI J P,QI X J,et al. Pyramid scene parsing network[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2017:6232-6239.
[28] SHRIVASTAVA A,GUPTA A,GIRSHICK R. Training regionbased object detectors with online hard example mining[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2016:761-769.
[29] 冯涛, 陈斌, 张跃飞. 基于改进的Mask R-CNN的染色体图像分割框架[J]. 计算机应用,2020,40(11):3332-3339.(FENG T, CHEN B,ZHANG Y F. Chromosome segmentation framework based on improved Mask R-CNN[J]. Journal of Computer Applications,2020,40(11):3332-3339.)
[30] 石国强, 赵霞. 基于联合优化的强耦合孪生区域推荐网络的目标跟踪算法[J]. 计算机应用,2020,40(10):2822-2830.(SHI G Q, ZHAO X. Object tracking algorithm based on jointlyoptimized strong-coupled Siamese region proposal network[J]. Journal of Computer Applications,2020,40(10):2822-2830.)
[31] PANG J M,CHEN K,SHI J P,et al. Libra R-CNN:towards balanced learning for object detection[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2019:821-830.
[32] CAO Y H,CHEN K,LOY C C,et al. Prime sample attention in object detection[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2020:11580-11588.
[33] ZHANG H K,CHANG H,MA B P,et al. Dynamic R-CNN:towards high quality object detection via dynamic training[EB/OL].[2020-09-06]. https://arxiv.org/pdf/2004.06002.pdf.
[34] OpenMMLab. MMDetection[EB/OL].[2020-09-12]. https://github.com/open-mmlab/mmdetection.