Search Result

Journals

Publication Years

Keywords

Please wait a minute...

For Selected:

Download Citations
EndNote Ris BibTeX

Toggle Thumbnails

Select

Multi-band image captioning method based on scene concept-guided feature fusion

Wenchao MING, Suzhen LIN, Zanxia JIN

Journal of Computer Applications 2026, 46 (5): 1560-1567. DOI: 10.11772/j.issn.1001-9081.2025050631

Abstract （48）

HTML （0）

PDF （1049KB）（5）

Save

When processing multi-band images in complex scenes， existing image captioning models fail to effectively align and fuse features using simple cross-attention mechanism， as the features in multi-band images have significant spatial differences. Additionally， variations in the imaging principles of multi-band images and the complexity of scenes make it difficult for models to capture key visual semantic information， leading to the absence of key targets and incompleteness in generated captions. To address these issues， a multi-band image captioning method based on scene concept-guided feature fusion was proposed. Firstly， the regional features of infrared and visible images were extracted using a pre-trained feature extractor named Faster Region-based Convolutional Neural Network （Faster R-CNN）， and a scene concept-guided multi-band Feature Alignment and Fusion Module （FAFM） was constructed. Secondly， to enhance the model's capability in modeling visual semantic information， a Concept-Guided Module （CGM） was designed to retrieve and encode scene concepts for images. Finally， an Adaptive Gating Mechanism （AGM） was built on this foundation. When the decoder generated words at each time step， the model dynamically adjusted the weights of the fused and concept features of multi-band images according to different situations， thereby achieving feature fusion. Experimental results on the visible-infrared image captioning datasets show that the proposed method achieves 56.7% and 119.5% in BLEU4 （BiLingual Evaluation Understudy with 4-grams） and CIDEr （Consensus-based Image Description Evaluation） metrics， respectively， which are 1.1 and 2.9 percentage points higher than those of the suboptimal method. The proposed method effectively improves the accuracy of multi-band image captioning.

Table and Figures | Reference | Related Articles | Metrics