Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
Progressive dual-stage modality interaction for single-domain generalized object detection
Yongbing ZHANG, Lirong YAN, Xiaofen TANG
Journal of Computer Applications    2026, 46 (4): 1264-1274.   DOI: 10.11772/j.issn.1001-9081.2025050543
Abstract54)   HTML1)    PDF (3166KB)(7)       Save

The existing vision-language-based single-domain generalization models rely on fixed unidirectional text guidance for local visual alignment, which limits their ability to model local-global context. Aiming at the problem, a Progressive Dual-stage Modality Interaction (PDMI) framework was proposed. In PDMI, global domain-invariant features were extracted hierarchically within modalities, and the complementary semantic information was fully exploited between visual and textual modalities, thereby capturing fine-grained semantic knowledge. Firstly, fixed domain-agnostic prompts and learnable Adaptive Domain Prompts (ADP) were integrated to guide the obtaining of the semantic awareness of samples toward specific domains. At the same time, based on the ResNet-101 visual backbone, a Multi-level Intra-Modality Interaction (MIMI) module was designed, in which Intra-Modality Mamba Interactions (IMMI) were performed on source domain images based on the guidance of adaptive visual prompts to extract global domain-invariant features, thereby improving the distribution of visual representations. Then, a Cross-Modality Bidirectional Interaction and Fusion (CMBIF) mechanism was adopted to extract and align fine-grained cross-modality feature, realizing fine-grained interactions between modalities through bidirectional guidance of visual or textual prompts. Finally, a Cross-Modality Adaptive Fusion (CMAF) module was employed to search for the optimal combination of inter-modal information automatically, thereby reducing redundant features of interactions between modalities. Experiments were conducted on three challenging domain shift datasets: Diverse Weather, Virtual-to-Reality, and UAV-OD. The results show that PDMI achieves higher mean Precision on the Target domain (mPT), compared to C-Gap, SRCD (Semantic Reasoning with Compound Domains), and FDD (Frequency Domain Disentanglement) methods by 2.0, 4.0, and 4.2 percentage points, respectively and averagely. It can be seen that PDMI can extract global-local domain-invariant features effectively to enhance the generalization to unseen target domains significantly, which is essential for scenarios with significant distribution shifts between the source and target domains as well as limited target domain data.

Table and Figures | Reference | Related Articles | Metrics