Small target detection method in UAV images based on fusion of dilated convolution and Transformer

doi:10.11772/j.issn.1001-9081.2023111575

Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (11): 3595-3602.DOI: 10.11772/j.issn.1001-9081.2023111575

• Multimedia computing and computer simulation • Previous Articles Next Articles

Small target detection method in UAV images based on fusion of dilated convolution and Transformer

Lin WANG¹^,², Jingliang LIU²(), Wuwei WANG³

^1.School of Data and Computer Science，Xiamen Institute of Technology，Xiamen Fujian 361021，China
^2.School of Automation and Information Engineering，Xi'an University of Technology，Xi'an Shaanxi 710048，China
^3.School of Automation，Xi'an University of Posts and Telecommunications，Xi'an Shaanxi 710121，China

Received:2023-11-16 Revised:2023-12-22 Accepted:2023-12-22 Online:2024-01-04 Published:2024-11-10
Contact: Jingliang LIU
About author:WANG Lin， born in 1963， Ph. D.， professor. His research interests include computer vision.
WANG Wuwei， born in 1991， Ph. D.， associate professor. His research interests include machine vision， deep learning， battlefield environment perception， guidance system based on visual information.
Supported by:
National Natural Science Foundation of China(62202376);Shaanxi Association for Science and Technology Youth Talent Support Program(20220129);Special Scientific Research Program of Shaanxi Provincial Department of Education(22JK0565);Science and Technology Starting Project of Xiamen Institute of Technology(KYYKT202301)

基于空洞卷积融合Transformer的无人机图像小目标检测方法

王林¹^,², 刘景亮²(), 王无为³

^1.厦门工学院数据科学与计算机学院，福建厦门 361021
^2.西安理工大学自动化与信息工程学院，西安 710048
^3.西安邮电大学自动化学院，西安 710121

通讯作者: 刘景亮
作者简介:王林（1963—），男，江苏东台人，教授，博士，主要研究方向：计算机视觉
王无为（1991—），男，江苏东台人，副教授，博士，主要研究方向：机器视觉、深度学习、战场环境感知、基于视觉信息的制导系统。
基金资助:
国家自然科学基金资助项目(62202376);陕西省科协青年人才托举计划项目(20220129);陕西省教育厅专项科研计划项目(22JK0565);厦门工学院科学与技术研究院启动项目(KYYKT202301)

Abstract

Abstract:

A multi-scale dilated convolution based Unmanned Aerial Vehicle （UAV） image target detection algorithm Swin-Det was proposed to address the issues of complex target scenes， diverse scales of targets， dense small targets and severe occlusion of targets in UAV aerial images. Firstly， Swin Transformer was used as the backbone feature extraction network， and a Spatial Information Blending Module （SIBM） was introduced into the backbone network to solve the problem of fuzziness in target information due to occlusion between objects. Secondly， a Fusion of Dilation Feature Pyramid Network （FDFPN） was proposed to fuse feature information through multi-branch dilated convolution， thereby effectively improving the receptive field of the network and the reuse of feature information， so that the model was able to learn detailed features of different dimensions. Finally， the issues of mismatches in the prediction area and sample imbalance were addressed by using linear interpolation method and multi-task loss function， thereby improving the detection precision of the model. Experimental results on VisDrone dataset show that the Swin-Det algorithm reaches a mean Average Precision （mAP） of 27.2%， which is 4.1 percentage points higher than that of the original Swin Transformer， and converges faster under the same training batch. It can be seen tha the Swin-Det algorithm can achieve high-precision detection of UAV image targets in complex scenes.

Key words: small target detection, feature fusion, dilated convolution, Unmanned Aerial Vehicle (UAV) image, Swin Transformer

摘要：

针对无人机（UAV）航拍图像中目标场景复杂、目标尺度多样、小目标密集和目标遮挡严重的问题，提出一种多尺度空洞卷积的UAV图像目标检测算法Swin-Det。首先，采用Swin Transformer作为主干特征提取网络，并在主干网络中引入空间信息交融模块（SIBM），从而解决因物体间遮挡而导致的目标信息模糊的问题；其次，提出一种融合空洞特征金字塔网络（FDFPN），通过多分支的空洞卷积融合特征信息，以有效提高网络的感受野以及特征信息的复用，使模型可以学习到不同维度的细节特征；最后，采用线性插值法和多任务损失函数解决预测区域不匹配和样本不平衡的问题，提升模型的检测精度。在VisDrone数据集上的实验结果表明，Swin-Det算法的平均精度均值（mAP）达到了27.2％，与原始Swin Transformer相比，提高了4.1个百分点，且在同一训练批次下收敛更快。可见，Swin-Det算法可在复杂场景下实现对无人机图像目标的高精度检测。

关键词: 小目标检测, 特征融合, 空洞卷积, 无人机图像, Swin Transformer

CLC Number:

TP391.4

Lin WANG, Jingliang LIU, Wuwei WANG. Small target detection method in UAV images based on fusion of dilated convolution and Transformer[J]. Journal of Computer Applications, 2024, 44(11): 3595-3602.

王林, 刘景亮, 王无为. 基于空洞卷积融合Transformer的无人机图像小目标检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3595-3602.

Figures/Tables 14

References 25

1	LIANG J， CHEN X， LIANG C， et al. A detection approach for late-autumn shoots of litchi based on Unmanned Aerial Vehicle （UAV） remote sensing［J］. Computers and Electronics in Agriculture， 2023， 204： No.107535.
2	SILVA L A， LEITHARDT V R Q， BATISTA V F L， et al. Automated road damage detection using UAV images and deep learning techniques［J］. IEEE Access， 2023， 11： 62918-62931.
3	XUE Y， JIN G， SHEN T， et al. Template-guided frequency attention and adaptive cross-entropy loss for UAV visual tracking［J］. Chinese Journal of Aeronautics， 2023， 36（9）： 299-312.
4	LIN T Y， DOLLÁR P， GIRSHICK R， et al. Feature pyramid networks for object detection［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 936-944.
5	刘英杰，杨风暴，胡鹏. 基于Cascade R-CNN的并行特征金字塔网络无人机航拍图像目标检测算法［J］. 激光与光电子学进展， 2020， 57（20）： No.201505.
	LIU Y J， YANG F B， HU P. Parallel FPN algorithm based on Cascade R-CNN for object detection form UAV aerial images［J］. Laser and Optoelectronics Progress， 2020， 57（20）： No.201505.
6	HONG S， KANG S， CHO D. Patch-level augmentation for object detection in aerial images［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshops. Piscataway： IEEE， 2019： 127-134.
7	LIANG B， SU J， FENG K， et al. Cross-layer triple-branch parallel fusion network for small object detection in UAV images［J］. IEEE Access， 2023， 11： 39738-39750.
8	QIAO S， CHEN L C， YUILLE A. DetectoRS： detecting objects with recursive feature pyramid and switchable atrous convolution［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 10208-10219.
9	BEERY S， WU G， RATHOD V， et al. Context R-CNN： long term temporal context for per-camera object detection［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 13072-13082.
10	田婷婷，杨军. 基于多尺度特征融合网络的遥感影像目标检测［J］. 激光与光电子学进展， 2022， 59（16）： No.1628003.
	TIAN T T， YANG J. Object detection for remote sensing image based on multiscale feature fusion network［J］. Laser and Optoelectronics Progress， 2022， 59（16）： No.1628003.
11	YANG X， YANG J， YAN J， et al. SCRDet： towards more robust detection for small， cluttered and rotated objects［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 8231-8240.
12	田永林，王雨桐，王建功，等. 视觉Transformer研究的关键问题：现状及展望［J］. 自动化学报， 2022， 48（4）：957-979.
	TIAN Y L， WANG Y T， WANG J G， et al. Key problems and progress of vision Transformers： the state of the art and prospects［J］. Acta Automatica Sinica， 2022， 48（4）： 957-979.
13	DOSOVITSKIY A， BEYER L， KOLESNIKOV A， et al. An image is worth 16x16 words： Transformers for image recognition at scale［EB/OL］. （2021-06-03）［2022-10-22］..
14	LIU Z， LIN Y， CAO Y， et al. Swin Transformer： hierarchical vision Transformer using shifted windows［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 9992-10002.
15	XU Y， YANG Y， ZHANG L. DeMT： deformable mixer Transformer for multi-task learning of dense prediction［C］// Proceedings of the 37th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2023： 3072-3080.
16	HE X， ZHOU Y， ZHAO J， et al. Swin Transformer embedding UNet for remote sensing image semantic segmentation［J］. IEEE Transactions on Geoscience and Remote Sensing， 2022， 60： No.4408715.
17	JIANG X， WU Y. Remote sensing object detection based on convolution and Swin Transformer［J］. IEEE Access， 2023， 11： 38643-38656.
18	付苗苗，邓淼磊，张德贤. 基于深度学习和Transformer的目标检测算法［J］. 计算机工程与应用， 2023， 59（1）：37-48.
	FU M M， DENG M L， ZHANG D X. Object detection algorithms based on deep learning and Transformer［J］. Computer Engineering and Applications， 2023， 59（1）： 37-48.
19	WANG P， CHEN P， YUAN Y， et al. Understanding convolution for semantic segmentation［C］// Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision. Piscataway： IEEE， 2018： 1451-1460.
20	LIU W， ANGUELOV D， ERHAN D， et al. SSD： single shot multiBox detector［C］// Proceedings of the 2016 European Conference on Computer Vision， LNCS 9905. Cham： Springer， 2016： 21-37.
21	RUKHOVICH D， SOFIIUK K， GALEEV D， et al. IterDet： iterative scheme for object detection in crowded environments［C］// Proceedings of the 2020 Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition （SPR） and Structural and Syntactic Pattern Recognition （SSPR）， LNCS 12644. Cham： Springer， 2021： 344-354.
22	XU C， WANG J， YANG W， et al. RFLA： Gaussian receptive field based label assignment for tiny object detection［C］// Proceedings of the 2022 European Conference on Computer Vision， LNCS 13669. Cham： Springer， 2022： 526-543.
23	徐坚，谢正光，李洪均. 特征平衡的无人机航拍图像目标检测算法［J］. 计算机工程与应用， 2023， 59（6）：196-203.
	XU J， XIE Z G， LI H J. Feature-balanced UAV aerial image target detection algorithm［J］. Computer Engineering and Applications， 2023， 59（6）： 196-203.
24	ZHU X， SU W， LU L， et al. Deformable DETR： deformable Transformers for end-to-end object detection［EB/OL］. ［2022-11-14］..
25	ALBABA B M， OZER S. SyNet： an ensemble network for object detection in UAV images［C］// Proceedings of the 25th International Conference on Pattern Recognition. Piscataway： IEEE， 2021： 10227-10234.

实验序号	SIBM空洞率			FDFPN空洞率			mAP/%
实验序号	1	2	3	｛1，2，3｝	｛1，3，5｝	｛1，5，7｝	mAP/%
1	√	×	×	√	×	×	26.7
2	×	√	×	×	√	×	27.0
3	×	×	√	×	×	√	26.3
4	×	√	×	√	×	×	26.8
5	×	√	×	×	×	√	26.5

实验序号	SIBM空洞率			FDFPN空洞率			mAP/%
实验序号	1	2	3	｛1，2，3｝	｛1，3，5｝	｛1，5，7｝	mAP/%
1	√	×	×	√	×	×	26.7
2	×	√	×	×	√	×	27.0
3	×	×	√	×	×	√	26.3
4	×	√	×	√	×	×	26.8
5	×	√	×	×	×	√	26.5

算法	SIBM	FDFPN	线性插值	mAP/%	AP_S/%	AP_M/%	AP_L/%	Params/MB
Swin‑T	×	×	×	23.1	15.8	33.7	36.2	38.6
A	√	×	×	24.3	16.9	34.4	36.7	39.5
B	×	√	×	24.6	17.1	35.6	38.2	42.1
C	√	√	×	27.0	19.4	37.0	41.3	47.4
D	√	√	√	27.2	19.5	37.4	41.4	47.4

算法	SIBM	FDFPN	线性插值	mAP/%	AP_S/%	AP_M/%	AP_L/%	Params/MB
Swin‑T	×	×	×	23.1	15.8	33.7	36.2	38.6
A	√	×	×	24.3	16.9	34.4	36.7	39.5
B	×	√	×	24.6	17.1	35.6	38.2	42.1
C	√	√	×	27.0	19.4	37.0	41.3	47.4
D	√	√	√	27.2	19.5	37.4	41.4	47.4

算法	AP/%										mAP/%	AP₅₀/%	AP₇₅/%	帧率/（frame·s^-1）
算法	行人	人	自行车	汽车	货车	卡车	三轮车	遮阳篷三轮车	公交车	摩托车	mAP/%	AP₅₀/%	AP₇₅/%	帧率/（frame·s^-1）
SSD^［20］	13.0	7.9	3.7	45.3	19.7	11.4	9.2	4.2	27.7	12.8	15.5	27.3	15.1	37.0
FPN^［4］	14.8	9.4	5.5	42.4	23.6	16.3	12.2	7.0	32.6	13.4	17.7	33.4	15.9	24.0
IterDeT^［21］	16.5	12.1	6.8	48.7	28.4	19.0	11.4	7.2	35.4	18.7	20.4	36.8	20.3	11.2
Faster R-CNN^［22］	20.9	14.8	7.3	51.0	30.2	19.8	14.0	8.1	35.5	21.1	22.3	39.0	21.7	25.4
YOLOv5s^［23］	16.2	8.2	7.2	50.6	31.4	27.9	14.4	14.1	41.4	15.7	22.7	40.8	22.5	90.0
Swin‑T	23.0	14.0	9.3	49.6	30.2	22.8	16.0	6.7	37.4	22.3	23.1	42.5	22.0	21.3
DDETR^［24］	25.5	14.1	10.6	53.0	36.9	25.2	15.3	6.7	38.3	22.4	24.8	42.7	25.1	19.7
SyNet^［25］	26.2	15.3	11.1	50.2	33.0	23.9	16.4	8.6	39.1	26.9	25.1	48.4	26.2	16.0
Swin-Det	28.8	18.6	12.4	54.7	35.4	25.1	19.2	9.1	42.2	26.8	27.2	50.7	28.3	18.4

Small target detection method in UAV images based on fusion of dilated convolution and Transformer

基于空洞卷积融合Transformer的无人机图像小目标检测方法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 14

References 25

Related Articles 15

Recommended Articles

Metrics

[1]	Yexin PAN, Zhe YANG. Optimization model for small object detection based on multi-level feature bidirectional fusion [J]. Journal of Computer Applications, 2024, 44(9): 2871-2877.
[2]	Zhiqiang ZHAO, Peihong MA, Xinhong HEI. Crowd counting method based on dual attention mechanism [J]. Journal of Computer Applications, 2024, 44(9): 2886-2892.
[3]	Ruihua LIU, Zihe HAO, Yangyang ZOU. Gait recognition algorithm based on multi-layer refined feature fusion [J]. Journal of Computer Applications, 2024, 44(7): 2250-2257.
[4]	Mengyuan HUANG, Kan CHANG, Mingyang LING, Xinjie WEI, Tuanfa QIN. Progressive enhancement algorithm for low-light images based on layer guidance [J]. Journal of Computer Applications, 2024, 44(6): 1911-1919.
[5]	Yue LIU, Fang LIU, Aoyun WU, Qiuyue CHAI, Tianxiao WANG. 3D object detection network based on self-attention mechanism and graph convolution [J]. Journal of Computer Applications, 2024, 44(6): 1972-1977.
[6]	Hongtian LI, Xinhao SHI, Weiguo PAN, Cheng XU, Bingxin XU, Jiazheng YUAN. Few-shot object detection via fusing multi-scale and attention mechanism [J]. Journal of Computer Applications, 2024, 44(5): 1437-1444.
[7]	Jun FENG, Jiankang BI, Yiru HUO, Jiakuan LI. PIPNet： lightweight asphalt pavement crack image segmentation network [J]. Journal of Computer Applications, 2024, 44(5): 1520-1526.
[8]	Guijin HAN, Xinyuan ZHANG, Wentao ZHANG, Ya HUANG. Self-supervised image registration algorithm based on multi-feature fusion [J]. Journal of Computer Applications, 2024, 44(5): 1597-1604.
[9]	Xin LI, Qiao MENG, Junyi HUANGFU, Lingchen MENG. YOLOv5 multi-attribute classification based on separable label collaborative learning [J]. Journal of Computer Applications, 2024, 44(5): 1619-1628.
[10]	Xinyuan YOU, Heng WANG. Monaural speech enhancement based on gated dilated convolutional recurrent network [J]. Journal of Computer Applications, 2024, 44(4): 1317-1324.
[11]	Yuliang ZHENG, Yunhua CHEN, Weijie BAI, Pinghua CHEN. Vehicle target detection by fusing event data and image frames [J]. Journal of Computer Applications, 2024, 44(3): 931-937.
[12]	Zhanjun JIANG, Baijing WU, Long MA, Jing LIAN. Faster-RCNN water-floating garbage recognition based on multi-scale feature and polarized self-attention [J]. Journal of Computer Applications, 2024, 44(3): 938-944.
[13]	Ning WU, Yangyang LUO, Huajie XU. Semantic segmentation method for remote sensing images based on multi-scale feature fusion [J]. Journal of Computer Applications, 2024, 44(3): 737-744.
[14]	Xinye LI, Yening HOU, Yinghui KONG, Zhiqi YAN. Few-shot object detection combining feature fusion and enhanced attention [J]. Journal of Computer Applications, 2024, 44(3): 745-751.
[15]	Zongze JIA, Pengfei GAO, Yinglong MA, Xiaofeng LIU, Haixin XIA. Multi-feature fusion attention-based hierarchical classification method for dialogue act [J]. Journal of Computer Applications, 2024, 44(3): 715-721.