Journal of Computer Applications ›› 2026, Vol. 46 ›› Issue (5): 1560-1567.DOI: 10.11772/j.issn.1001-9081.2025050631

• Multimedia computing and computer simulation • Previous Articles    

Multi-band image captioning method based on scene concept-guided feature fusion

Wenchao MING, Suzhen LIN(), Zanxia JIN   

  1. School of Computer Science and Technology,North University of China,Taiyuan Shanxi 030051,China
  • Received:2025-06-06 Revised:2025-07-14 Accepted:2025-08-08 Online:2025-08-15 Published:2026-05-10
  • Contact: Suzhen LIN
  • About author:MING Wenchao, born in 1999, M. S. candidate. His research interests include image captioning.
    JIN Zanxia, born in 1991, Ph. D., lecturer. Her research interests include multimodal machine learning, intelligent question-answering system.
  • Supported by:
    National Natural Science Foundation of China(62406296);Natural Science Foundation of Shanxi Province(202303021211147)

基于场景概念引导特征融合的多波段图像描述生成方法

明文超, 蔺素珍(), 晋赞霞   

  1. 中北大学 计算机科学与技术学院,太原 030051
  • 通讯作者: 蔺素珍
  • 作者简介:明文超(1999—),男,山东济南人,硕士研究生,CCF会员,主要研究方向:图像描述
    晋赞霞(1991—),女,山西运城人,讲师,博士,主要研究方向:多模态机器学习、智能问答系统。
  • 基金资助:
    国家自然科学基金资助项目(62406296);山西省自然科学基金资助项目(202303021211147)

Abstract:

When processing multi-band images in complex scenes, existing image captioning models fail to effectively align and fuse features using simple cross-attention mechanism, as the features in multi-band images have significant spatial differences. Additionally, variations in the imaging principles of multi-band images and the complexity of scenes make it difficult for models to capture key visual semantic information, leading to the absence of key targets and incompleteness in generated captions. To address these issues, a multi-band image captioning method based on scene concept-guided feature fusion was proposed. Firstly, the regional features of infrared and visible images were extracted using a pre-trained feature extractor named Faster Region-based Convolutional Neural Network (Faster R-CNN), and a scene concept-guided multi-band Feature Alignment and Fusion Module (FAFM) was constructed. Secondly, to enhance the model's capability in modeling visual semantic information, a Concept-Guided Module (CGM) was designed to retrieve and encode scene concepts for images. Finally, an Adaptive Gating Mechanism (AGM) was built on this foundation. When the decoder generated words at each time step, the model dynamically adjusted the weights of the fused and concept features of multi-band images according to different situations, thereby achieving feature fusion. Experimental results on the visible-infrared image captioning datasets show that the proposed method achieves 56.7% and 119.5% in BLEU4 (BiLingual Evaluation Understudy with 4-grams) and CIDEr (Consensus-based Image Description Evaluation) metrics, respectively, which are 1.1 and 2.9 percentage points higher than those of the suboptimal method. The proposed method effectively improves the accuracy of multi-band image captioning.

Key words: image captioning, multi-band image, concept guidance, feature fusion, adaptive gating

摘要:

现有的图像描述模型在处理复杂场景下的多波段图像时,由于多波段图像的特征在空间上存在显著差异,直接使用简单的交叉注意力难以有效地对齐和融合这些特征;而且多波段图像成像原理的不同以及场景的复杂性,导致模型难以捕捉关键的视觉语义信息,生成的描述中会出现关键目标缺失、描述不完整的情况。针对上述问题,提出一种基于场景概念引导特征融合的多波段图像描述生成方法。首先,使用预训练的特征提取器Faster R-CNN(Faster Region-based Convolutional Neural Network)提取红外和可见光图像的区域特征,构建由场景概念引导的多波段特征对齐融合模块(FAFM);其次,为了提高模型对视觉语义信息的建模能力,设计概念引导模块(CGM)为图像检索场景概念并进行编码;最后,构建自适应的门控机制(AGM),当解码器在每个时间步生成单词时,模型可以根据不同情况动态调整多波段图像的融合特征与概念特征的权重,从而实现特征的融合。在可见光图像-红外图像描述数据集上的实验结果表明,所提方法在BLEU-4(BiLingual Evaluation Understudy with 4-grams)和CIDEr(Consensus-based Image Description Evaluation)指标上分别达到56.7%和119.5%,较次优方法分别提高了1.1个和2.9个百分点。可见,所提方法能有效提高多波段图像描述的准确度。

关键词: 图像描述生成, 多波段图像, 概念引导, 特征融合, 自适应门控

CLC Number: