《计算机应用》唯一官方网站 ›› 2026, Vol. 46 ›› Issue (4): 1264-1274.DOI: 10.11772/j.issn.1001-9081.2025050543

• 多媒体计算与计算机仿真 • 上一篇    

渐进式双阶段模态交互的单域泛化目标检测

张永兵1, 闫丽蓉1, 唐晓芬1,2()   

  1. 1.宁夏大学 信息工程学院,银川 750021
    2.宁夏“东数西算”人工智能与信息安全重点实验室(宁夏大学),银川 750021
  • 收稿日期:2025-05-19 修回日期:2025-07-25 接受日期:2025-08-01 发布日期:2025-08-08 出版日期:2026-04-10
  • 通讯作者: 唐晓芬
  • 作者简介:张永兵(1999—),男,甘肃天水人,硕士研究生,CCF会员,主要研究方向:域泛化目标检测
    闫丽蓉(1999—),女,宁夏中卫人,硕士研究生,CCF会员,主要研究方向:小样本目标检测
  • 基金资助:
    国家自然科学基金资助项目(61966029)

Progressive dual-stage modality interaction for single-domain generalized object detection

Yongbing ZHANG1, Lirong YAN1, Xiaofen TANG1,2()   

  1. 1.School of Information Engineering,Ningxia University,Yinchuan Ningxia 750021,China
    2.Ningxia Key Laboratory of Artificial Intelligence and Information Security for Channeling Computing Resources from the East to the West (Ningxia University),Yinchuan Ningxia 750021,China
  • Received:2025-05-19 Revised:2025-07-25 Accepted:2025-08-01 Online:2025-08-08 Published:2026-04-10
  • Contact: Xiaofen TANG
  • About author:ZHANG Yongbing, born in 1999, M. S. candidate. His research interests include domain generalized object detection.
    YAN Lirong, born in 1999, M. S. candidate. Her research interests include few-shot object detection.
  • Supported by:
    National Natural Science Foundation of China(61966029)

摘要:

针对现有基于视觉语言的单域泛化模型采用固定的单向文本引导视觉局部对齐操作,导致局部?全局上下文建模能力不足的问题,提出一种渐进式双阶段模态交互(PDMI)框架。PDMI能够在模态内以多层次方式提取全局域不变特征,在模态间充分挖掘视觉和文本互补语义,以获得细粒度语义知识。首先,结合固定域无关提示和可学习的自适应域提示(ADP)引导样本获得对特定域的语义感知能力;同时,在视觉主干网络ResNet-101基础上,设计多层级的模态内交互(MIMI)模块,基于自适应视觉提示引导,对源域图像进行模态内Mamba交互(IMMI)以提取图像的全局域不变特征,改善视觉特征表示的分布。其次,提出跨模态双向交互融合(CMBIF)机制,提取并对齐细粒度的跨模态特征,以视觉或文本双向引导实现细粒度模态间交互。最后,采用跨模态自适应融合(CMAF)模块自动搜索模态间信息的最佳组合,进一步减小模态间交互的冗余特征。在3个具有挑战性的领域偏移数据集Diverse Weather、Virtual-to-Reality和UAV-OD上的实验结果显示:PDMI在目标域上的平均精度(mPT)比C-Gap、SRCD(Semantic Reasoning with Compound Domains)和FDD(Frequency Domain Disentanglement)方法分别平均提高了2.0、4.0和4.2个百分点。可见,PDMI能够有效提取全局?局部域不变特征提升对未见目标域的泛化能力,这对目标域和源域存在显著分布偏移且目标域数据受限的场景至关重要。

关键词: 单域泛化目标检测, 视觉语言模型, 提示学习, 多模态融合

Abstract:

The existing vision-language-based single-domain generalization models rely on fixed unidirectional text guidance for local visual alignment, which limits their ability to model local-global context. Aiming at the problem, a Progressive Dual-stage Modality Interaction (PDMI) framework was proposed. In PDMI, global domain-invariant features were extracted hierarchically within modalities, and the complementary semantic information was fully exploited between visual and textual modalities, thereby capturing fine-grained semantic knowledge. Firstly, fixed domain-agnostic prompts and learnable Adaptive Domain Prompts (ADP) were integrated to guide the obtaining of the semantic awareness of samples toward specific domains. At the same time, based on the ResNet-101 visual backbone, a Multi-level Intra-Modality Interaction (MIMI) module was designed, in which Intra-Modality Mamba Interactions (IMMI) were performed on source domain images based on the guidance of adaptive visual prompts to extract global domain-invariant features, thereby improving the distribution of visual representations. Then, a Cross-Modality Bidirectional Interaction and Fusion (CMBIF) mechanism was adopted to extract and align fine-grained cross-modality feature, realizing fine-grained interactions between modalities through bidirectional guidance of visual or textual prompts. Finally, a Cross-Modality Adaptive Fusion (CMAF) module was employed to search for the optimal combination of inter-modal information automatically, thereby reducing redundant features of interactions between modalities. Experiments were conducted on three challenging domain shift datasets: Diverse Weather, Virtual-to-Reality, and UAV-OD. The results show that PDMI achieves higher mean Precision on the Target domain (mPT), compared to C-Gap, SRCD (Semantic Reasoning with Compound Domains), and FDD (Frequency Domain Disentanglement) methods by 2.0, 4.0, and 4.2 percentage points, respectively and averagely. It can be seen that PDMI can extract global-local domain-invariant features effectively to enhance the generalization to unseen target domains significantly, which is essential for scenarios with significant distribution shifts between the source and target domains as well as limited target domain data.

Key words: single-domain generalized object detection, Vision-Language Model (VLM), prompt learning, multimodal fusion

中图分类号: