《计算机应用》唯一官方网站 ›› 2026, Vol. 46 ›› Issue (4): 1275-1282.DOI: 10.11772/j.issn.1001-9081.2025050589

• 多媒体计算与计算机仿真 • 上一篇    

基于YOLO-World的少样本学习目标检测算法

何帅, 邓春华()   

  1. 武汉科技大学 计算机科学与技术学院,武汉 430065
  • 收稿日期:2025-05-29 修回日期:2025-08-26 接受日期:2025-09-09 发布日期:2025-09-15 出版日期:2026-04-10
  • 通讯作者: 邓春华
  • 作者简介:何帅(1997—),男,山东菏泽人,硕士研究生,主要研究方向:计算机视觉、少样本目标检测
  • 基金资助:
    湖北省重点研发计划项目(2023BAB071)

Object detection algorithm with few-shot learning based on YOLO-World

Shuai HE, Chunhua DENG()   

  1. School of Computer Science and Technology,Wuhan University of Science and Technology,Wuhan Hubei 430065,China
  • Received:2025-05-29 Revised:2025-08-26 Accepted:2025-09-09 Online:2025-09-15 Published:2026-04-10
  • Contact: Chunhua DENG
  • About author:HE Shuai, born in 1997, M. S. candidate. His research interests include computer vision, few-shot object detection.
  • Supported by:
    Key Research and Development Program of Hubei Province(2023BAB071)

摘要:

目标检测技术在计算机视觉领域得到了广泛应用,但现有方法大多依赖大量标注数据,难以解决现实中面临的新类别样本稀缺问题。尽管现有开放词汇目标检测(OVD)方法具备一定的跨类泛化能力,但在面向结构相近的新类别时,普遍存在语义匹配粗略、空间定位精度不足的问题。针对上述问题,提出一种基于YOLO-World的少样本学习目标检测算法。首先,提出类别感知卷积核构建模块(CCKCM),将文本语义嵌入与图像特征融合,提升模型在少样本条件下对新类别的语义感知能力;其次,设计一种融合滑动卷积与几何空间约束的高效目标匹配与定位机制,在保持较低计算复杂度的同时,实现对目标区域的快速匹配与精准定位;最后,构建一个面向少样本目标检测(FSOD)任务的图像数据集,涵盖多个典型场景与目标类别。实验结果表明,所提算法在PASCAL VOC 2007+2012数据集上的10-shot下新类的平均精度达到了73.4%,比FM-FSOD提高了1.4个百分点。可见,所提算法为实际场景中新类别目标的快速识别提供了一条可行的技术路径。

关键词: YOLO-World, 开放词汇目标检测, 少样本学习, 特征融合, 滑动卷积

Abstract:

Object detection has been widely applied in the field of computer vision. However, most existing methods rely on large-scale labeled data heavily, which make it difficult to address the problem of limited samples in new categories under real-world conditions. Although current Open-Vocabulary object Detection (OVD) methods have certain cross-category generalization ability, issues such as rough semantic matching and inadequate spatial localization accuracy commonly occur when facing new categories with similar structure. To overcome these issues, an object detection algorithm with few-shot learning based on YOLO-World was proposed. Firstly, Category-aware Convolution Kernel Construction Module (CCKCM) was proposed to fuse textual semantic embeddings with visual features, thereby enhancing semantic perception ability for new categories under few-shot setting. Secondly, an efficient object matching and localization mechanism was introduced by combining sliding convolution with spatial geometric constraints, thereby realizing fast matching and accurate localization of target regions while maintaining low computational complexity. Finally, an image dataset for Few-Shot Object Detection (FSOD) tasks was built, covering multiple classic scenes and object categories. Experimental results show that on the PASCAL VOC 2007+2012 dataset, the 10-shot average precision for novel classes of the proposed algorithm reaches 73.4%, which is 1.4 percentage points higher than that of FM-FSOD. It can be seen that the proposed algorithm provides a feasible technical path for the rapid recognition of objects in new categories in real-world scenarios.

Key words: YOLO-World, Open-Vocabulary object Detection (OVD), few-shot learning, feature fusion, sliding convolution

中图分类号: