Journal of Computer Applications

    Next Articles

Multi-band Image Captioning Method Based on Concept-guided Feature Fusion

  

  • Received:2025-06-06 Revised:2025-07-14 Accepted:2025-08-08 Online:2025-08-15 Published:2025-08-15

基于概念引导特征融合的多波段图像描述方法

明文超,蔺素珍,晋赞霞   

  1. 中北大学
  • 通讯作者: 蔺素珍
  • 基金资助:
    山西省基础研究计划资助项目;国家自然科学基金项目

Abstract: Abstract: Existing image captioning models often struggle with the alignment and fusion of multi-spectral images in complex scenes due to significant spatial differences in the features of multi-spectral images. The direct application of simple cross-attention mechanisms makes it difficult to effectively align and fuse these features. Additionally, the different imaging principles of multi-spectral images and the complexity of the scene make it challenging for models to capture key visual semantic information, leading to the omission of crucial objects or incomplete descriptions in generated outputs. To address these issues, we propose a multi-spectral image captioning generation model based on scene-concept-guided feature fusion. First, we use a pre-trained Faster R-CNN to extract regional features from infrared and visible images, constructing a multi-spectral feature alignment and fusion module (FAF) guided by scene concepts. Next, to enhance the model's ability to model visual semantic information, we design a Concept-Guided Module (CGB) to retrieve and encode scene concepts for image retrieval. Finally, we build an adaptive gating mechanism (AGA) that dynamically adjusts the weights of the fused multi-spectral image features and concept features based on different conditions during each decoding step. Experimental results on the visible-infrared image captioning dataset show that the proposed model achieves BLEU4 and CIDEr scores of 56.1% and 119.5%, respectively, outperforming the second-best method by 1.1 and 2.9 points, indicating that the proposed method can effectively improve the accuracy of multi-spectral image captioning.

Key words: Abstract: image captioning, multi-band, concept-guided, feature fusion, adaptive gating

摘要: 摘 要: 现有的图像描述模型在处理复杂场景下的多波段图像时,由于多波段图像的特征在空间上存在显著差异,直接使用简单的交叉注意力难以有效地对齐和融合;而且多波段图像成像原理的不同以及场景的复杂性,导致模型难以捕捉关键的视觉语义信息,生成的描述中出现关键目标缺失、描述不完整的情况。针对上述问题,提出一种基于场景概念引导特征融合的多波段图像描述生成模型。首先,使用预训练的Faster R-CNN提取红外和可见光图像的区域特征,构建了由场景概念引导的多波段特征对齐融合模块(FAF);其次,为了提高模型对视觉语义信息的建模能力,设计了概念引导模块(CGB)为图像检索场景概念并进行编码;最后,在此基础上构建了自适应的门控机制(AGA),解码器在每个时间步生成单词的时候,模型可以根据不同情况动态调整多波段图像的融合特征与概念特征的权重来实现特征的融合。在可见光图像-红外图像描述数据集上的实验结果表明,所提模型在BLEU4和CIDEr指标上分别达到56.1%和119.5%,较次优方法分别提高了1.1个和2.9个点,说明所提方法能有效提高多波段图像描述的准确度。

关键词: 关键词: 图像描述, 多波段图像, 概念引导, 特征融合, 自适应门控

CLC Number: