Small object detection method based on improved DETR algorithm

doi:10.11772/j.issn.1001-9081.2025030277

Abstract

Abstract:

To address the problem of low accuracy of DETR （DEtection TRansformer） in small object detection， a small object detection method based on improved DETR algorithm was proposed. Firstly， an improved MetaFormer combined with a multi-scale attention mechanism was adopted as the backbone network， aiming to solve the problems of weak extraction ability， low efficiency， and detail loss in small object feature extraction of backbone network ResNet-50， thereby enhancing the model’s representation capability for small objects. Secondly， a deformable attention decoder was introduced to solve the problems of slow convergence and limited feature space resolution in the Transformer attention module when processing image feature maps， so that the model was able to focus on key sampling regions around reference points， thereby accelerating the model convergence and improving detection accuracy for small objects. Finally， the Wise-IoU （WIoU） v3 loss function was incorporated for inability of the GIoU （Generalized Intersection over Union） loss function in evaluating prediction box quality， so that differentiated gradient gains were assigned to prediction boxes of varying qualities， thereby guiding the model to converge towards higher accuracy. Experimental results on the COCO2017 object detection dataset show that compared with DETR， the proposed method improves the average precision for small objects by 7.6 percentage points and the overall average precision by 4.7 percentage points， demonstrating superior detection precision of the proposed method.

Key words: DETR (DEtection TRansformer), small object, deformable attention, multi-scale attention, WIoU (Wise-IoU) v3

摘要：

针对DETR（DEtection TRansformer）在小目标检测方面精度较低的问题，提出一种基于改进DETR算法的小目标检测方法。首先，针对骨干网络ResNet-50在小目标特征提取方面提取能力弱、效率低以及易丢失细节等问题，使用一种结合多尺度注意力机制的改进MetaFormer作为DETR的骨干网络，从而增强模型对小目标的表征能力；其次，针对Transformer注意力模块在处理图像特征映射时存在的收敛慢和特征空间分辨率受限等问题，引入可变形注意力解码器，从而使模型能够聚焦于参考点周围的关键采样区域，进而加快模型收敛并提升小目标的检测精度；最后，针对GIoU（Generalized Intersection over Union）损失函数无法衡量预测框质量的问题，引入WIoU （Wise-IoU） v3损失函数，从而为不同质量的预测框赋予差异化的梯度增益，进而引导模型收敛到更高的精度。在COCO2017目标检测数据集上的实验结果表明，相较于DETR，所提方法对小目标的平均检测精度提升了7.6个百分点，整体的平均检测精度提升了4.7个百分点。可见，所提方法具有更高的检测精度。

关键词: DETR, 小目标, 可变形注意力, 多尺度注意力, WIoU v3

CLC Number:

TP391

Jun WU, Chuan ZHAO. Small object detection method based on improved DETR algorithm[J]. Journal of Computer Applications, 2026, 46(2): 564-571.

吴俊, 赵川. 基于改进DETR算法的小目标检测方法[J]. 《计算机应用》唯一官方网站, 2026, 46(2): 564-571.

Figures/Tables 13

Fig. 1 Structure of DETR

Fig. 2 Structure of improved DETR

Fig. 3 Structure of multi-scale attention module

Fig. 4 Overall Structure of backbone network EmaFormer

Fig. 5 Structure of EmaFormer block

Tab. 1 Parameters of improved MetaFormer backbone network

阶段	模块	堆叠数	输出尺寸
1	EmaFormer	3	32×160×160
2	EmaFormer	3	64×80×80
3	EmaFormer	9	128×40×40
4	EmaFormer	3	256×20×20

Fig. 6 Structure of decoder with deformable attention

Tab. 2 Key parameters for training improved DETR model

参数	设置	参数	设置
Epochs	75	Optimizer	AdamW
Batch_size	4	$α W I o U v 3$	3.0
Learning rate	0.000 1	$δ W I o U v 3$	1.9
Weight decay	0.000 1

Tab. 2 Key parameters for training improved DETR model

参数	设置	参数	设置
Epochs	75	Optimizer	AdamW
Batch_size	4	$α W I o U v 3$	3.0
Learning rate	0.000 1	$δ W I o U v 3$	1.9
Weight decay	0.000 1

Tab. 3 Ablation experimental results

实验编号	ResNet-50骨干网络	MetaFormer骨干网络	改进的MetaFormer骨干网络	注意力解码器	可变形注意力解码器	DETR原损失函数	优化的损失函数	AP/%
实验编号	ResNet-50骨干网络	MetaFormer骨干网络	改进的MetaFormer骨干网络	注意力解码器	可变形注意力解码器	DETR原损失函数	优化的损失函数	AP_S	AP₅₀	AP_50：95
1	√			√		√		22.5	63.1	43.3
2	√				√	√		25.9	61.3	44.5
3		√			√	√		27.1	63.1	46.1
4			√		√	√		29.3	64.8	47.2
5			√		√		√	30.1	65.9	48.0

Tab. 4 Results of horizontal comparison experiments

模型	参数量/10⁶	AP/%
模型	参数量/10⁶	AP_50：95	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
DETR^［1］	41	43.3	63.1	45.9	22.5	47.3	61.1
Conditional-DETR	44	45.0	65.4	48.5	25.3	49.9	62.2
Anchor-DETR	37	44.2	64.7	47.5	24.7	48.2	60.6
Deformable-DETR^［4］	40	46.2	65.2	50.0	28.8	49.2	61.7
DN-DETR	48	46.3	66.4	49.7	26.7	50.0	64.3
DAB-DETR	48	44.5	65.1	47.7	25.3	48.2	62.3
Efficient DETR	35	45.1	63.1	49.1	28.3	48.4	59.0
SMCA-DETR	40	45.6	65.5	49.1	25.9	49.3	62.6
TSP-FCOS	—	43.1	62.3	47.0	26.6	46.8	55.9
TSP-RCNN	—	43.8	63.3	48.3	28.6	46.9	55.7
Sparse DETR	41	46.3	66.0	50.1	29.0	49.5	60.8
SAM-DETR	58	45.0	65.4	47.9	26.2	49.0	63.3
本文模型	45	48.0	65.9	51.8	30.1	52.9	66.0

Fig. 7 Detection performance comparison

Fig. 8 Heatmap comparison

Fig. 9 Loss curve

References 44

[1]	CARION N， MASSA F， SYNNAEVE G， et al. End-to-end object detection with Transformers［C］// Proceedings of the 2020 European Conference on Computer Vision， LNCS 12346. Cham： Springer， 2020： 213-229.
[2]	YU W， SI C， ZHOU P， et al. MetaFormer baselines for vision［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2024， 46（2）： 896-912.
[3]	OUYANG D， HE S， ZHANG G， et al. Efficient multi-scale attention module with cross-spatial learning［C］// Proceedings of the 2023 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2023： 1-5.
[4]	ZHU X， SU W， LU L， et al. Deformable DETR： deformable Transformers for end-to-end object detection［EB/OL］. ［2024-10-13］..
[5]	TONG Z， CHEN Y， XU Z， et al. Wise-IoU： bounding box regression loss with dynamic focusing mechanism［EB/OL］. ［2024-10-02］..
[6]	GIRSHICK R， DONAHUE J， DARRELL T， et al. Rich feature hierarchies for accurate object detection and semantic segmentation［C］// Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2014： 580-587.
[7]	REN S， HE K， GIRSHICK R， et al. Faster R-CNN： towards real-time object detection with region proposal networks［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2017， 39（6）： 1137-1149.
[8]	REDMON J， DIVVALA S， GIRSHICK R， et al. You only look once： unified， real-time object detection［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 779-788.
[9]	REDMON J， FARHADI A. YOLO9000： better， faster， stronger［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 6517-6525.
[10]	REDMON J， FARHADI A. YOLOv3： an incremental improvement［EB/OL］. ［2024-12-25］..
[11]	BOCHKOVSKIY A， WANG C Y， MARK LIAO H Y. YOLOv4： optimal speed and accuracy of object detection［EB/OL］. ［2020-04-23］..
[12]	LIU W， ANGUELOV D， ERHAN D， et al. SSD： single shot MultiBox detector［C］// Proceedings of the 2016 European Conference on Computer Vision， LNCS 9905. Cham： Springer， 2016： 21-37.
[13]	DENG C， WANG M， LIU L， et al. Extended feature pyramid network for small object detection［J］. IEEE Transactions on Multimedia， 2022， 24： 1968-1979.
[14]	LIM J S， ASTRID M， YOON H J， et al. Small object detection using context and attention［C］// Proceedings of the 2021 International Conference on Artificial Intelligence in Information and Communication. Piscataway： IEEE， 2021： 181-186.
[15]	CUI L， LV P， JIANG X， et al. Context-aware block net for small object detection［J］. IEEE Transactions on Cybernetics， 2022， 52（4）： 2300-2313.
[16]	WANG G， CHEN Y， AN P， et al. UAV-YOLOv8： a small-object-detection model based on improved YOLOv8 for UAV aerial photography scenarios［J］. Sensors， 2023， 23（16）： No.7190.
[17]	LENG J， REN Y， JIANG W， et al. Realize your surroundings： exploiting context information for small object detection［J］. Neurocomputing， 2021， 433： 287-299.
[18]	LIU H， SUN F， GU J， et al. SF-YOLOv5： a lightweight small object detection algorithm based on improved feature fusion mode［J］. Sensors， 2022， 22（15）： No.5817.
[19]	MIN K， LEE G H， LEE S W. Attentional feature pyramid network for small object detection［J］. Neural Networks， 2022， 155： 439-450.
[20]	TANG S， ZHANG S， FANG Y. HIC-YOLOv5： improved YOLOv5 for small object detection［C］// Proceedings of the 2024 IEEE International Conference on Robotics and Automation. Piscataway： IEEE， 2024： 6614-6619.
[21]	JING R， ZHANG W， LIU Y， et al. An effective method for small object detection in low-resolution images［J］. Engineering Applications of Artificial Intelligence， 2024， 127（Pt A）： No.107206.
[22]	TONG K， WU Y. Small object detection using deep feature learning and feature fusion network［J］. Engineering Applications of Artificial Intelligence， 2024， 132： No.107931.
[23]	LI L， LI B， ZHOU H. Lightweight multi-scale network for small object detection［J］. PeerJ Computer Science， 2022， 8： No.e1145.
[24]	YAN B， LI J， YANG Z， et al. AIE-YOLO： auxiliary information enhanced YOLO for small object detection［J］. Sensors， 2022， 22（21）： No.8221.
[25]	HUANG S， LIU Q. Addressing scale imbalance for small object detection with dense detector［J］. Neurocomputing， 2022， 473： 68-78.
[26]	WANG M， YANG W， WANG L， et al. FE-YOLOv5： feature enhancement network based on YOLOv5 for small object detection［J］. Journal of Visual Communication and Image Representation， 2023， 90： No.103752.
[27]	JI S J， LING Q H， HAN F. An improved algorithm for small object detection based on YOLO v4 and multi-scale contextual information［J］. Computers and Electrical Engineering， 2023， 105： No.108490.
[28]	HAO C， ZHANG H， SONG W， et al. SliNet： slicing-aided learning for small object detection［J］. IEEE Signal Processing Letters， 2024， 31：790-794.
[29]	ZHANG X， LU T， WANG J， et al. Small object detection by edge-aware neural network［J］. Engineering Applications of Artificial Intelligence， 2024， 138（Pt B）： No.109406.
[30]	WANG L， ZHOU Z， SHI G， et al. Small object detection based on bidirectional feature fusion and multi-scale distillation［C］// Proceedings of the 2024 International Conference on Artificial Neural Networks， LNCS 15017. Cham： Springer， 2024： 200-214.
[31]	LI X， LI X， TAN H， et al. SAMF： small-area-aware multi-focus image fusion for object detection［C］// Proceedings of the 2024 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2024： 3845-3849.
[32]	LIU S， LI F， ZHANG H， et al. DAB-DETR： dynamic anchor boxes are better queries for DETR［EB/OL］. ［2025-04-03］..
[33]	MENG D， CHEN X， FAN Z， et al. Conditional DETR for fast training convergence［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 3631-3640.
[34]	HE K， ZHANG X， REN S， et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778.
[35]	WOO S， DEBNATH S， HU R， et al. ConvNeXt V2： co-designing and scaling ConvNets with masked autoencoders［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023： 16133-16142.
[36]	ZHU L， WANG X， KE Z， et al. BiFormer： Vision Transformer with bi-level routing attention［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023： 10323-10333.
[37]	REZATOFIGHI H， TSOI N， GWAK J， et al. Generalized intersection over union： a metric and a loss for bounding box regression［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 658-666.
[38]	WANG Y， ZHANG X， YANG T， et al. Anchor DETR： query design for transformer-based detector［C］// Proceedings of the 36th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2022： 2567-2575.
[39]	LI F， ZHANG H， LIU S， et al. DN-DETR： accelerate DETR training by introducing query denoising［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 13609-13617.
[40]	YAO Z， AI J， LI B， et al. Efficient DETR： improving end-to-end object detector with dense prior［EB/OL］. ［2025-04-03］..
[41]	GAO P， ZHENG M， WANG X， et al. Fast convergence of DETR with spatially modulated co-attention［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 3601-3610.
[42]	SUN Z， CAO S， YANG Y， et al. Rethinking Transformer-based set prediction for object detection［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 3591-3600.
[43]	ROH B， SHIN J， SHIN W， et al. Sparse DETR： efficient end-to-end object detection with learnable sparsity［EB/OL］. ［2025-01-13］..
[44]	ZHANG G， LUO Z， YU Y， et al. Accelerating DETR convergence via semantic-aligned matching［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 939-948.