Search Result

Journals

Publication Years

Keywords

Please wait a minute...

For Selected:

Download Citations
EndNote Ris BibTeX

Toggle Thumbnails

Select

Progressive dual-stage modality interaction for single-domain generalized object detection

Yongbing ZHANG, Lirong YAN, Xiaofen TANG

Journal of Computer Applications 2026, 46 (4): 1264-1274. DOI: 10.11772/j.issn.1001-9081.2025050543

Abstract （54）

HTML （1）

PDF （3166KB）（7）

Save

The existing vision-language-based single-domain generalization models rely on fixed unidirectional text guidance for local visual alignment， which limits their ability to model local-global context. Aiming at the problem， a Progressive Dual-stage Modality Interaction （PDMI） framework was proposed. In PDMI， global domain-invariant features were extracted hierarchically within modalities， and the complementary semantic information was fully exploited between visual and textual modalities， thereby capturing fine-grained semantic knowledge. Firstly， fixed domain-agnostic prompts and learnable Adaptive Domain Prompts （ADP） were integrated to guide the obtaining of the semantic awareness of samples toward specific domains. At the same time， based on the ResNet-101 visual backbone， a Multi-level Intra-Modality Interaction （MIMI） module was designed， in which Intra-Modality Mamba Interactions （IMMI） were performed on source domain images based on the guidance of adaptive visual prompts to extract global domain-invariant features， thereby improving the distribution of visual representations. Then， a Cross-Modality Bidirectional Interaction and Fusion （CMBIF） mechanism was adopted to extract and align fine-grained cross-modality feature， realizing fine-grained interactions between modalities through bidirectional guidance of visual or textual prompts. Finally， a Cross-Modality Adaptive Fusion （CMAF） module was employed to search for the optimal combination of inter-modal information automatically， thereby reducing redundant features of interactions between modalities. Experiments were conducted on three challenging domain shift datasets： Diverse Weather， Virtual-to-Reality， and UAV-OD. The results show that PDMI achieves higher mean Precision on the Target domain （mPT）， compared to C-Gap， SRCD （Semantic Reasoning with Compound Domains）， and FDD （Frequency Domain Disentanglement） methods by 2.0， 4.0， and 4.2 percentage points， respectively and averagely. It can be seen that PDMI can extract global-local domain-invariant features effectively to enhance the generalization to unseen target domains significantly， which is essential for scenarios with significant distribution shifts between the source and target domains as well as limited target domain data.

Table and Figures | Reference | Related Articles | Metrics