Journal of Computer Applications

    Next Articles

Unsupervised domain adaptation for image classification via modal correlation feature alignment

  

  • Received:2025-10-14 Revised:2025-12-30 Accepted:2026-01-06 Online:2026-01-14 Published:2026-01-14
  • Supported by:
    National Natural Science Foundation of China;Hubei Provincial Natural Science Foundation of China;International Science and Technology Cooperation Program of Hubei Province;Science and Technology Research Project of Hubei Provincial Department of Education;The Key Laboratory for Crop Production and Smart Agriculture of Yunnan Province

模态相关性特征对齐的无监督领域适应图像分类

全诗宇1,2,李亚楠1*,肖振兴1,陆昊3,方智文4,屈俊峰5   

  1. 1. 武汉工程大学 计算机科学与工程学院,武汉430205; 2. 云南省作物生产与智慧农业重点实验室(云南农业大学),昆明650201; 3.华中科技大学 人工智能与自动化学院,武汉430074; 4. 南方医科大学 生物医学工程学院,广州510515; 5. 湖北文理学院 计算机工程学院,襄阳441053


  • 通讯作者: 李亚楠
  • 基金资助:
    国家自然科学基金;湖北省自然科学基金;湖北省国际科技合作项目;湖北省教育厅科学技术研究项目;云南省作物生产与智慧农业重点实验室

Abstract: Unsupervised domain adaptation methods aim to reduce the distribution discrepancies across multiple domains with varying styles and backgrounds. Recently, large Vision-Language models (VLMs), such as CLIP (Contrastive Language-Image Pre-training), have exhibited remarkable zero-shot learning capabilities in unsupervised domain adaptation tasks. However, most existing transfer learning approaches based on VLMs primarily focus on either a single language modality or visual modality, neglecting the subtle interaction between these two modalities. To address this limitation, an unsupervised domainadaptation image classification method was proposed based on modality correlation feature alignment. The CLIP model was utilized as the foundational architecture of the proposed method. First, image and text features were extracted by using the backbone network. During the encoding of visual branch features, a learnable token sequence was introduced to capture information that might otherwise be discarded. Subsequently, a modality separation network was designed to extract semantically relevant cues from the visual features and integrated them into the textual representation. Thereby the acquisition of visual features with greater domain invariance were facilitated by the enhanced text prompt. Finally, these prompted visual features were aligned with cross-modal correlation features through a modality discriminator trained in the source domain, enabling effective knowledge transfer from the source domain to the target domain. Experimental results demonstrate that the proposed method achieves average accuracies of 77.1% (ResNet-50) / 86.2% (ViT-B/16) and 87.5% (ResNet-101) / 90.2% (ViT-B/16) on the Office-Home and VisDA-2017 datasets, respectively, compared with DAPrompt (Domain Adaptation via Prompt learning), the classification accuracy has increases by 2.6% (ResNet-50), 0.7% (ResNet-101, ViT-B/16), respectively, and compared with PDA(Prompt-based Distribution Alignment for unsupervised domain adaptation), it increases by 0.5% (ViT-B/16).

Key words: unsupervised domain adaptation, Vision-Language Model (VLM), CLIP (Contrastive Language-Image Pre-training), feature alignment, image classification

摘要: 无监督领域适应方法旨在缩减不同风格与背景的多个领域数据间的分布差异。近期,以CLIP(Contrastive Language-Image Pre-training)为代表的大型视觉语言模型(VLM)在无监督领域适应任务中展现了卓越的零样本学习能力。然而,当前大多数基于VLM的迁移学习方法多聚焦于单一语言或视觉模态,而忽略了这两种模态之间存在的微妙交互作用。针对此问题,提出一种基于模态相关性特征对齐的无监督领域适应图像分类方法。该方法以CLIP模型为基础框架,首先通过骨干网络将图像与文本特征提取,在视觉特征编码过程中,加入可学习的token序列捕捉被丢弃的信息。其次,通过构建模态分离网络剥离视觉特征中的语义相关线索,以补充文本提示,经过补充后的文本提示能够引导出更多具有域不变性的视觉特征。再次,这些被提示后的视觉特征通过在源域训练的模态判别器实现模态相关性特征对齐,最终完成源域到目标域的知识迁移。实验结果表明,在Office-Home、VisDA-2017数据集上,所提方法分别取得了77.1%(ResNet-50)/86.2%(ViT-B/16)和87.5%(ResNet-101)/90.2%(ViT-B/16)的平均精度,相较于DAPrompt(Domain Adaptation via Prompt learning)分类精度提升了2.6%(ResNet-50)、0.7%(ResNet-101,ViT-B/16)、相较于PDA(Prompt-based Distribution Alignment for unsupervised domain adaptation)提高了0.5%(ViT-B/16)。

关键词: 无监督领域适应, 视觉语言模型, CLIP, 特征对齐, 图像分类

CLC Number: