Unsupervised domain adaptation for image classification via modal correlation feature alignment

doi:10.11772/j.issn.1001-9081.2025101209

Journal of Computer Applications

Received:2025-10-14 Revised:2025-12-30 Accepted:2026-01-06 Online:2026-01-14 Published:2026-01-14
Supported by:
National Natural Science Foundation of China;Hubei Provincial Natural Science Foundation of China;International Science and Technology Cooperation Program of Hubei Province;Science and Technology Research Project of Hubei Provincial Department of Education;The Key Laboratory for Crop Production and Smart Agriculture of Yunnan Province

模态相关性特征对齐的无监督领域适应图像分类

全诗宇^1,2，李亚楠^1*，肖振兴¹，陆昊³，方智文⁴，屈俊峰⁵

1. 武汉工程大学计算机科学与工程学院，武汉430205； 2. 云南省作物生产与智慧农业重点实验室(云南农业大学)，昆明650201； 3.华中科技大学人工智能与自动化学院，武汉430074； 4. 南方医科大学生物医学工程学院，广州510515； 5. 湖北文理学院计算机工程学院，襄阳441053

通讯作者: 李亚楠
基金资助:
国家自然科学基金;湖北省自然科学基金;湖北省国际科技合作项目;湖北省教育厅科学技术研究项目;云南省作物生产与智慧农业重点实验室

Abstract

Abstract: Unsupervised domain adaptation methods aim to reduce the distribution discrepancies across multiple domains with varying styles and backgrounds. Recently, large Vision-Language models (VLMs), such as CLIP (Contrastive Language-Image Pre-training), have exhibited remarkable zero-shot learning capabilities in unsupervised domain adaptation tasks. However, most existing transfer learning approaches based on VLMs primarily focus on either a single language modality or visual modality, neglecting the subtle interaction between these two modalities. To address this limitation, an unsupervised domainadaptation image classification method was proposed based on modality correlation feature alignment. The CLIP model was utilized as the foundational architecture of the proposed method. First, image and text features were extracted by using the backbone network. During the encoding of visual branch features, a learnable token sequence was introduced to capture information that might otherwise be discarded. Subsequently, a modality separation network was designed to extract semantically relevant cues from the visual features and integrated them into the textual representation. Thereby the acquisition of visual features with greater domain invariance were facilitated by the enhanced text prompt. Finally, these prompted visual features were aligned with cross-modal correlation features through a modality discriminator trained in the source domain, enabling effective knowledge transfer from the source domain to the target domain. Experimental results demonstrate that the proposed method achieves average accuracies of 77.1% (ResNet-50) / 86.2% (ViT-B/16) and 87.5% (ResNet-101) / 90.2% (ViT-B/16) on the Office-Home and VisDA-2017 datasets, respectively, compared with DAPrompt (Domain Adaptation via Prompt learning), the classification accuracy has increases by 2.6% (ResNet-50), 0.7% (ResNet-101, ViT-B/16), respectively, and compared with PDA(Prompt-based Distribution Alignment for unsupervised domain adaptation), it increases by 0.5% (ViT-B/16).

Key words: unsupervised domain adaptation, Vision-Language Model (VLM), CLIP (Contrastive Language-Image Pre-training), feature alignment, image classification

摘要： 无监督领域适应方法旨在缩减不同风格与背景的多个领域数据间的分布差异。近期，以CLIP(Contrastive Language-Image Pre-training)为代表的大型视觉语言模型(VLM)在无监督领域适应任务中展现了卓越的零样本学习能力。然而，当前大多数基于VLM的迁移学习方法多聚焦于单一语言或视觉模态，而忽略了这两种模态之间存在的微妙交互作用。针对此问题，提出一种基于模态相关性特征对齐的无监督领域适应图像分类方法。该方法以CLIP模型为基础框架，首先通过骨干网络将图像与文本特征提取，在视觉特征编码过程中，加入可学习的token序列捕捉被丢弃的信息。其次，通过构建模态分离网络剥离视觉特征中的语义相关线索，以补充文本提示，经过补充后的文本提示能够引导出更多具有域不变性的视觉特征。再次，这些被提示后的视觉特征通过在源域训练的模态判别器实现模态相关性特征对齐，最终完成源域到目标域的知识迁移。实验结果表明，在Office-Home、VisDA-2017数据集上，所提方法分别取得了77.1%(ResNet-50)/86.2%(ViT-B/16)和87.5%(ResNet-101)/90.2%(ViT-B/16)的平均精度，相较于DAPrompt(Domain Adaptation via Prompt learning)分类精度提升了2.6%(ResNet-50)、0.7%(ResNet-101，ViT-B/16)、相较于PDA(Prompt-based Distribution Alignment for unsupervised domain adaptation)提高了0.5%(ViT-B/16)。

关键词: 无监督领域适应, 视觉语言模型, CLIP, 特征对齐, 图像分类

CLC Number:

TP181

全诗宇李亚楠肖振兴陆昊方智文屈俊峰. 模态相关性特征对齐的无监督领域适应图像分类[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2025101209.

[1]	Xinyao LIU, Jun LIANG, Jiahao LONG, Renliang YAN. Fine-grained Chinese herbal medicine image classification based on feature fusion and channel information compensation [J]. Journal of Computer Applications, 2026, 46(5): 1677-1683.
[2]	Delong WANG, Haoyi WANG, Qingchuan ZHANG, Zexi SONG. Multimodal event extraction based on text-image dual-channel feature gated fusion mechanism [J]. Journal of Computer Applications, 2026, 46(4): 1077-1085.
[3]	Yongbing ZHANG, Lirong YAN, Xiaofen TANG. Progressive dual-stage modality interaction for single-domain generalized object detection [J]. Journal of Computer Applications, 2026, 46(4): 1264-1274.
[4]	Huanxian LIU, Hongtao WANG, Xian’ao WANG, Hongmei WANG, Weifeng XU. Multimodal fact verification with cross-modal semantic association [J]. Journal of Computer Applications, 2026, 46(4): 1069-1076.
[5]	Jian ZHANG, Jianbo YU, Jian TANG. Municipal solid waste incineration state recognition method based on multilayer preprocessing [J]. Journal of Computer Applications, 2026, 46(3): 940-949.
[6]	Shuo ZHANG, Guokai SUN, Yuan ZHUANG, Xiaoyu FENG, Jingzhi WANG. Dynamic detection method of eclipse attacks for blockchain node analysis [J]. Journal of Computer Applications, 2025, 45(8): 2428-2436.
[7]	Jing WANG, Jiaxing LIU, Wanying SONG, Jiaxing XUE, Wenxin DING. Few-shot skin image classification model based on spatial transformer network and feature distribution calibration [J]. Journal of Computer Applications, 2025, 45(8): 2720-2726.
[8]	Qiaoling QI, Xiaoxiao WANG, Qianqian ZHANG, Peng WANG, Yongfeng DONG. Label noise adaptive learning algorithm based on meta-learning [J]. Journal of Computer Applications, 2025, 45(7): 2113-2122.
[9]	Zimo ZHANG, Xuezhuan ZHAO. Multi-scale sparse graph guided vision graph neural networks [J]. Journal of Computer Applications, 2025, 45(7): 2188-2194.
[10]	Xiang WANG, Qianqian CUI, Xiaoming ZHANG, Jianchao WANG, Zhenzhou WANG, Jialin SONG. Wireless capsule endoscopy image classification model based on improved ConvNeXt [J]. Journal of Computer Applications, 2025, 45(6): 2016-2024.
[11]	Sijie NIU, Yuliang LIU. Auxiliary diagnostic method for retinopathy based on dual-branch structure with knowledge distillation [J]. Journal of Computer Applications, 2025, 45(5): 1410-1414.
[12]	Yiqin YAN, Chuan LUO, Tianrui LI, Hongmei CHEN. Cross-domain few-shot classification model based on relation network and Vision Transformer [J]. Journal of Computer Applications, 2025, 45(4): 1095-1103.
[13]	Liwei ZHANG, Quan LIANG, Yutao HU, Qiaole ZHU. Channel shuffle attention mechanism based on group convolution [J]. Journal of Computer Applications, 2025, 45(4): 1069-1076.
[14]	Meirong DING, Jinxin ZHUO, Yuwu LU, Qinglong LIU, Jicong LANG. Domain adaptation integrating environment label smoothing and nuclear norm discrepancy [J]. Journal of Computer Applications, 2025, 45(4): 1130-1138.
[15]	Yuanlong WANG, Tinghua LIU, Hu ZHANG. Commonsense question answering model based on cross-modal contrastive learning [J]. Journal of Computer Applications, 2025, 45(3): 732-738.