Object detection algorithm with few-shot learning based on YOLO-World

doi:10.11772/j.issn.1001-9081.2025050589

Abstract

Abstract:

Object detection has been widely applied in the field of computer vision. However， most existing methods rely on large-scale labeled data heavily， which make it difficult to address the problem of limited samples in new categories under real-world conditions. Although current Open-Vocabulary object Detection （OVD） methods have certain cross-category generalization ability， issues such as rough semantic matching and inadequate spatial localization accuracy commonly occur when facing new categories with similar structure. To overcome these issues， an object detection algorithm with few-shot learning based on YOLO-World was proposed. Firstly， Category-aware Convolution Kernel Construction Module （CCKCM） was proposed to fuse textual semantic embeddings with visual features， thereby enhancing semantic perception ability for new categories under few-shot setting. Secondly， an efficient object matching and localization mechanism was introduced by combining sliding convolution with spatial geometric constraints， thereby realizing fast matching and accurate localization of target regions while maintaining low computational complexity. Finally， an image dataset for Few-Shot Object Detection （FSOD） tasks was built， covering multiple classic scenes and object categories. Experimental results show that on the PASCAL VOC 2007+2012 dataset， the 10-shot average precision for novel classes of the proposed algorithm reaches 73.4%， which is 1.4 percentage points higher than that of FM-FSOD. It can be seen that the proposed algorithm provides a feasible technical path for the rapid recognition of objects in new categories in real-world scenarios.

Key words: YOLO-World, Open-Vocabulary object Detection (OVD), few-shot learning, feature fusion, sliding convolution

摘要：

目标检测技术在计算机视觉领域得到了广泛应用，但现有方法大多依赖大量标注数据，难以解决现实中面临的新类别样本稀缺问题。尽管现有开放词汇目标检测（OVD）方法具备一定的跨类泛化能力，但在面向结构相近的新类别时，普遍存在语义匹配粗略、空间定位精度不足的问题。针对上述问题，提出一种基于YOLO-World的少样本学习目标检测算法。首先，提出类别感知卷积核构建模块（CCKCM），将文本语义嵌入与图像特征融合，提升模型在少样本条件下对新类别的语义感知能力；其次，设计一种融合滑动卷积与几何空间约束的高效目标匹配与定位机制，在保持较低计算复杂度的同时，实现对目标区域的快速匹配与精准定位；最后，构建一个面向少样本目标检测（FSOD）任务的图像数据集，涵盖多个典型场景与目标类别。实验结果表明，所提算法在PASCAL VOC 2007+2012数据集上的10-shot下新类的平均精度达到了73.4%，比FM-FSOD提高了1.4个百分点。可见，所提算法为实际场景中新类别目标的快速识别提供了一条可行的技术路径。

关键词: YOLO-World, 开放词汇目标检测, 少样本学习, 特征融合, 滑动卷积

CLC Number:

TP391.41

Shuai HE, Chunhua DENG. Object detection algorithm with few-shot learning based on YOLO-World[J]. Journal of Computer Applications, 2026, 46(4): 1275-1282.

何帅, 邓春华. 基于YOLO-World的少样本学习目标检测算法[J]. 《计算机应用》唯一官方网站, 2026, 46(4): 1275-1282.

Figures/Tables 12

References 30

[1]	KHANAM R， HUSSAIN M. YOLOv11： an overview of the key architectural enhancements［EB/OL］. ［2024-04-11］..
[2]	ZHU Z， LIN K， JAIN A K， et al. Transfer learning in deep reinforcement learning： a survey［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2023， 45（11）： 13344-13362.
[3]	NAJDENKOSKA I， ZHEN X， WORRING M. Meta learning to bridge vision and language models for multimodal few-shot learning［EB/OL］. ［2024-04-11］..
[4]	CHENG T， SONG L， GE Y， et al. YOLO-World： real-time open-vocabulary object detection［C］// Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2024： 16901-16911.
[5]	ZHANG Y， YANG Q. A survey on multi-task learning［J］. IEEE Transactions on Knowledge and Data Engineering， 2022， 34（12）： 5586-5609.
[6]	ZHANG J， HUANG J， JIN S， et al. Vision-language models for vision tasks： a survey［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2024， 46（8）： 5625-5644.
[7]	REN S， HE K， GIRSHICK R， et al. Faster R-CNN： towards real-time object detection with region proposal networks［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2017， 39（6）： 1137-1149.
[8]	REDMON J， DIVVALA S， GIRSHICK R， et al. You only look once： unified， real-time object detection［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 779-788.
[9]	CARION N， MASSA F， SYNNAEVE G， et al. End-to-end object detection with Transformers［C］// Proceedings of the 2020 European Conference on Computer Vision， LNCS 12346. Cham： Springer， 2020： 213-229.
[10]	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 6000-6010.
[11]	SHI C， YANG S. EdaDet： open-vocabulary object detection using early dense alignment［C］// Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2023： 15678-15688.
[12]	ZHOU X， GIRDHAR R， JOULIN A， et al. Detecting twenty-thousand classes using image-level supervision［C］// Proceedings of the 2022 European Conference on Computer Vision， LNCS 13669. Cham： Springer， 2022： 350-368.
[13]	JIA C， YANG Y， XIA Y， et al. Scaling up visual and vision-language representation learning with noisy text supervision［C］// Proceedings of the 38th International Conference on Machine Learning. New York： JMLR.org， 2021： 4904-4916.
[14]	YAO L， HAN J， LIANG X， et al. DetCLIPv2： scalable open-vocabulary object detection pre-training via word-region alignment［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023： 23497-23506.
[15]	MINDERER M， GRITSENKO A， STONE A， et al. Simple open-vocabulary object detection［C］// Proceedings of the 2022 European Conference on Computer Vision， LNCS 13670. Cham： Springer， 2022： 728-755.
[16]	杭焕云，刘俊. 结合解耦与广义微调的少样本目标检测［J］. 计算机工程与应用， 2026， 62（4）： 293-301.
	HANG H Y， LIU J. Few-shot object detection combining decoupling and generalized fine-tuning［J］. Computer Engineering and Applications， 2026， 62（4）： 293-301.
[17]	HAN G， MA J， HUANG S， et al. Few-shot object detection with fully cross-Transformer［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 5321-5330.
[18]	HAN G， HUANG S， MA J， et al. Meta Faster R-CNN： towards accurate few-shot object detection with attentive feature alignment［C］// Proceedings of the 36th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2022： 780-789.
[19]	李鸿天，史鑫昊，潘卫国，等. 融合多尺度和注意力机制的小样本目标检测［J］. 计算机应用， 2024， 44（5）： 1437-1444.
	LI H T， SHI X H， PAN W G， et al. Few-shot object detection via fusing multi-scale and attention mechanism［J］. Journal of Computer Applications， 2024， 44（5）： 1437-1444.
[20]	ZHANG X， LIU Y， WANG Y， et al. Detect everything with few examples［C］// Proceedings of the 8th Conference on Robot Learning. New York： JMLR.org， 2025： 3986-4004.
[21]	WANG X， HUANG T E， DARRELL T， et al. Frustratingly simple few-shot object detection［C］// Proceedings of the 37th International Conference on Machine Learning. New York： JMLR.org， 2020： 9919-9928.
[22]	HAN G， HE Y， HUANG S， et al. Query adaptive few-shot object detection with heterogeneous graph convolutional networks［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 3243-3252.
[23]	ZHANG G， LUO Z， CUI K， et al. Meta-DETR： image-level few-shot detection with inter-class correlation exploitation［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2023， 45（11）： 12832-12843.
[24]	BULAT A， GUERRERO R， MARTINEZ B， et al. FS-DETR： few-shot detection transformer with prompting and without re-training［C］// Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2023： 11759-11768.
[25]	HAN G， LIM S N. Few-shot object detection with foundation models［C］// Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2024： 28608-28618.
[26]	WU J， LIU S， HUANG D， et al. Multi-scale positive sample refinement for few-shot object detection［C］// Proceedings of the 2020 European Conference on Computer Vision， LNCS 12361. Cham： Springer， 2020： 456-472.
[27]	SUN B， LI B， CAI S， et al. FSCE： few-shot object detection via contrastive proposal encoding［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 7348-7358.
[28]	WANG A， CHEN H， LIU L， et al. YOLOv10： real-time end-to-end object detection［C］// Proceedings of the 38th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2024： 107984-108011.
[29]	TIAN Y， YE Q， DOERMANN D. YOLOv12： attention-centric real-time object detectors［EB/OL］. ［2024-04-11］..
[30]	ZHAO Y， LV W， XU S， et al. DETRs beat YOLOs on real-time object detection［C］// Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2024： 16965-16974.

算法	nAP_0.5
算法	1-shot	3-shot	5-shot	10-shot
TFA w/ fc^［21］	5.6	12.5	15.9	19.0
TFA w/ cos^［21］	5.7	12.0	15.2	19.0
MPSR^［26］	4.1	9.4	12.5	17.8
QA-FewDet^［22］	10.2	17.9	20.4	23.8
Meta-DETR^［23］	12.4	21.6	25.2	30.6
FS-DETR^［24］	13.5	18.8	20.6	21.6
FM-FSOD^［25］	7.9	21.8	30.3	38.6
本文算法	8.7	21.6	31.4	39.7

算法	nAP_0.5
算法	1-shot	3-shot	5-shot	10-shot
TFA w/ fc^［21］	5.6	12.5	15.9	19.0
TFA w/ cos^［21］	5.7	12.0	15.2	19.0
MPSR^［26］	4.1	9.4	12.5	17.8
QA-FewDet^［22］	10.2	17.9	20.4	23.8
Meta-DETR^［23］	12.4	21.6	25.2	30.6
FS-DETR^［24］	13.5	18.8	20.6	21.6
FM-FSOD^［25］	7.9	21.8	30.3	38.6
本文算法	8.7	21.6	31.4	39.7

算法	新分割1					新分割2					新分割3
算法	1-shot	2-shot	3-shot	5-shot	10-shot	1-shot	2-shot	3-shot	5-shot	10-shot	1-shot	2-shot	3-shot	5-shot	10-shot
TFA w/ fc^［21］	36.8	29.1	43.6	55.7	57.0	18.2	29.0	33.4	35.5	39.0	27.7	33.6	42.5	48.7	50.2
TFA w/ cos^［21］	39.8	36.1	44.7	55.7	56.0	23.5	26.9	34.1	35.1	39.1	30.8	34.8	42.8	49.5	49.8
MPSR^［26］	41.7	42.5	51.4	55.2	61.8	24.4	29.3	39.2	39.9	47.8	35.6	41.8	42.3	48.0	49.7
FSCE^［27］	44.2	43.8	51.4	61.9	63.4	27.3	29.5	43.5	44.2	50.2	37.2	41.9	47.5	54.6	58.5
QA-FewDet^［22］	42.4	51.9	55.7	62.6	63.4	25.9	37.8	46.6	48.9	51.1	35.2	42.9	47.8	54.8	53.5
FCT^［17］	49.9	57.1	57.9	63.2	67.1	27.6	34.5	43.7	49.2	51.2	39.5	54.7	52.3	57.0	58.7
FS-DETR^［24］	45.0	48.5	51.5	52.7	56.1	37.3	41.3	43.4	46.6	49.0	43.8	47.1	50.6	52.1	56.9
Meta-DETR^［23］	40.6	51.4	58.0	59.2	63.6	37.0	36.6	43.7	49.1	54.6	41.6	45.9	52.7	58.9	60.6
FM-FSOD^［25］	40.1	53.5	57.0	68.6	72.0	33.1	36.3	48.8	54.8	64.7	39.2	50.2	55.7	63.4	68.1
本文算法	44.5	51.3	58.6	69.1	73.4	32.6	38.9	49.6	56.2	66.4	42.8	53.6	56.1	65.2	69.8

算法	新分割1					新分割2					新分割3
算法	1-shot	2-shot	3-shot	5-shot	10-shot	1-shot	2-shot	3-shot	5-shot	10-shot	1-shot	2-shot	3-shot	5-shot	10-shot
TFA w/ fc^［21］	36.8	29.1	43.6	55.7	57.0	18.2	29.0	33.4	35.5	39.0	27.7	33.6	42.5	48.7	50.2
TFA w/ cos^［21］	39.8	36.1	44.7	55.7	56.0	23.5	26.9	34.1	35.1	39.1	30.8	34.8	42.8	49.5	49.8
MPSR^［26］	41.7	42.5	51.4	55.2	61.8	24.4	29.3	39.2	39.9	47.8	35.6	41.8	42.3	48.0	49.7
FSCE^［27］	44.2	43.8	51.4	61.9	63.4	27.3	29.5	43.5	44.2	50.2	37.2	41.9	47.5	54.6	58.5
QA-FewDet^［22］	42.4	51.9	55.7	62.6	63.4	25.9	37.8	46.6	48.9	51.1	35.2	42.9	47.8	54.8	53.5
FCT^［17］	49.9	57.1	57.9	63.2	67.1	27.6	34.5	43.7	49.2	51.2	39.5	54.7	52.3	57.0	58.7
FS-DETR^［24］	45.0	48.5	51.5	52.7	56.1	37.3	41.3	43.4	46.6	49.0	43.8	47.1	50.6	52.1	56.9
Meta-DETR^［23］	40.6	51.4	58.0	59.2	63.6	37.0	36.6	43.7	49.1	54.6	41.6	45.9	52.7	58.9	60.6
FM-FSOD^［25］	40.1	53.5	57.0	68.6	72.0	33.1	36.3	48.8	54.8	64.7	39.2	50.2	55.7	63.4	68.1
本文算法	44.5	51.3	58.6	69.1	73.4	32.6	38.9	49.6	56.2	66.4	42.8	53.6	56.1	65.2	69.8

算法	Precision	Recall	F1	mAP_0.5
YOLOv10m^［28］	79.14	69.95	74.27	75.23
YOLOv10n^［28］	73.92	62.70	67.78	68.54
YOLO11m^［1］	77.65	69.72	73.47	75.62
YOLO11n^［1］	75.22	63.31	68.74	69.20
YOLOv12m^［29］	79.36	71.38	75.17	76.02
YOLOv12n^［29］	74.78	64.09	69.01	70.24
RT-DETR-l^［30］	79.89	73.40	76.49	77.73
YOLO-World-M^［4］	80.40	73.56	76.83	77.98
本文算法	80.86	73.95	77.26	78.42