Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
Multi-band image captioning method based on scene concept-guided feature fusion
Wenchao MING, Suzhen LIN, Zanxia JIN
Journal of Computer Applications    2026, 46 (5): 1560-1567.   DOI: 10.11772/j.issn.1001-9081.2025050631
Abstract48)   HTML0)    PDF (1049KB)(5)       Save

When processing multi-band images in complex scenes, existing image captioning models fail to effectively align and fuse features using simple cross-attention mechanism, as the features in multi-band images have significant spatial differences. Additionally, variations in the imaging principles of multi-band images and the complexity of scenes make it difficult for models to capture key visual semantic information, leading to the absence of key targets and incompleteness in generated captions. To address these issues, a multi-band image captioning method based on scene concept-guided feature fusion was proposed. Firstly, the regional features of infrared and visible images were extracted using a pre-trained feature extractor named Faster Region-based Convolutional Neural Network (Faster R-CNN), and a scene concept-guided multi-band Feature Alignment and Fusion Module (FAFM) was constructed. Secondly, to enhance the model's capability in modeling visual semantic information, a Concept-Guided Module (CGM) was designed to retrieve and encode scene concepts for images. Finally, an Adaptive Gating Mechanism (AGM) was built on this foundation. When the decoder generated words at each time step, the model dynamically adjusted the weights of the fused and concept features of multi-band images according to different situations, thereby achieving feature fusion. Experimental results on the visible-infrared image captioning datasets show that the proposed method achieves 56.7% and 119.5% in BLEU4 (BiLingual Evaluation Understudy with 4-grams) and CIDEr (Consensus-based Image Description Evaluation) metrics, respectively, which are 1.1 and 2.9 percentage points higher than those of the suboptimal method. The proposed method effectively improves the accuracy of multi-band image captioning.

Table and Figures | Reference | Related Articles | Metrics