Multi-band Image Captioning Method Based on Concept-guided Feature Fusion

doi:10.11772/j.issn.1001-9081.2025050631

Abstract

Abstract: Abstract: Existing image captioning models often struggle with the alignment and fusion of multi-spectral images in complex scenes due to significant spatial differences in the features of multi-spectral images. The direct application of simple cross-attention mechanisms makes it difficult to effectively align and fuse these features. Additionally, the different imaging principles of multi-spectral images and the complexity of the scene make it challenging for models to capture key visual semantic information, leading to the omission of crucial objects or incomplete descriptions in generated outputs. To address these issues, we propose a multi-spectral image captioning generation model based on scene-concept-guided feature fusion. First, we use a pre-trained Faster R-CNN to extract regional features from infrared and visible images, constructing a multi-spectral feature alignment and fusion module (FAF) guided by scene concepts. Next, to enhance the model's ability to model visual semantic information, we design a Concept-Guided Module (CGB) to retrieve and encode scene concepts for image retrieval. Finally, we build an adaptive gating mechanism (AGA) that dynamically adjusts the weights of the fused multi-spectral image features and concept features based on different conditions during each decoding step. Experimental results on the visible-infrared image captioning dataset show that the proposed model achieves BLEU4 and CIDEr scores of 56.1% and 119.5%, respectively, outperforming the second-best method by 1.1 and 2.9 points, indicating that the proposed method can effectively improve the accuracy of multi-spectral image captioning.

Key words: Abstract: image captioning, multi-band, concept-guided, feature fusion, adaptive gating

摘要： 摘要: 现有的图像描述模型在处理复杂场景下的多波段图像时，由于多波段图像的特征在空间上存在显著差异，直接使用简单的交叉注意力难以有效地对齐和融合；而且多波段图像成像原理的不同以及场景的复杂性，导致模型难以捕捉关键的视觉语义信息，生成的描述中出现关键目标缺失、描述不完整的情况。针对上述问题，提出一种基于场景概念引导特征融合的多波段图像描述生成模型。首先，使用预训练的Faster R-CNN提取红外和可见光图像的区域特征，构建了由场景概念引导的多波段特征对齐融合模块(FAF)；其次，为了提高模型对视觉语义信息的建模能力，设计了概念引导模块(CGB)为图像检索场景概念并进行编码；最后，在此基础上构建了自适应的门控机制(AGA)，解码器在每个时间步生成单词的时候，模型可以根据不同情况动态调整多波段图像的融合特征与概念特征的权重来实现特征的融合。在可见光图像-红外图像描述数据集上的实验结果表明,所提模型在BLEU4和CIDEr指标上分别达到56.1%和119.5%,较次优方法分别提高了1.1个和2.9个点,说明所提方法能有效提高多波段图像描述的准确度。

关键词: 关键词: 图像描述, 多波段图像, 概念引导, 特征融合, 自适应门控

CLC Number:

TP391.41

明文超蔺素珍晋赞霞. 基于概念引导特征融合的多波段图像描述方法[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2025050631.

[1]	Chengzhi YAN, Ying CHEN, Kai ZHONG, Han GAO. 3D object detection algorithm based on multi-scale network and axial attention [J]. Journal of Computer Applications, 2025, 45(8): 2537-2545.
[2]	Yimeng XI, Zhen DENG, Qian LIU, Libo LIU. Cross-modal information fusion for video-text retrieval [J]. Journal of Computer Applications, 2025, 45(8): 2448-2456.
[3]	Jinhao LIN, Chuan LUO, Tianrui LI, Hongmei CHEN. Thoracic disease classification method based on cross-scale attention network [J]. Journal of Computer Applications, 2025, 45(8): 2712-2719.
[4]	Liang CHEN, Xuan WANG, Kun LEI. Helmet wearing detection algorithm for complex scenarios based on cross-layer multi-scale feature fusion [J]. Journal of Computer Applications, 2025, 45(7): 2333-2341.
[5]	Xiang WANG, Qianqian CUI, Xiaoming ZHANG, Jianchao WANG, Zhenzhou WANG, Jialin SONG. Wireless capsule endoscopy image classification model based on improved ConvNeXt [J]. Journal of Computer Applications, 2025, 45(6): 2016-2024.
[6]	Zonghang WU, Dong ZHANG, Guanyu LI. Multimodal fusion recommendation algorithm based on joint self-supervised learning [J]. Journal of Computer Applications, 2025, 45(6): 1858-1868.
[7]	Linjia SUN, Lei QIN, Meijin KANG, Yinglin WANG. Automatic speech segmentation algorithm based on syllable type recognition [J]. Journal of Computer Applications, 2025, 45(6): 2034-2042.
[8]	Ying HUANG, Shengmei GAO, Guang CHEN, Su LIU. Low-light image enhancement network combining signal-to-noise ratio guided dual-branch structure and histogram equalization [J]. Journal of Computer Applications, 2025, 45(6): 1971-1979.
[9]	Yali YANG, Ying LI, Yutao ZHANG, Peihua SONG. Review of multi-modal research methods for face recognition [J]. Journal of Computer Applications, 2025, 45(5): 1645-1657.
[10]	Yang ZHOU, Hui LI. Remote sensing image building extraction network based on dual promotion of semantic and detailed features [J]. Journal of Computer Applications, 2025, 45(4): 1310-1316.
[11]	Shiyue GUO, Jianwu DANG, Yangping WANG, Jiu YONG. 3D hand pose estimation combining attention mechanism and multi-scale feature fusion [J]. Journal of Computer Applications, 2025, 45(4): 1293-1299.
[12]	Yiding WANG, Zehao WANG, Yaoli LI, Shaoqing CAI, Yuan YUAN. Multi-scale 2D-Adaboost microscopic image recognition algorithm of Chinese medicinal materials powder [J]. Journal of Computer Applications, 2025, 45(4): 1325-1332.
[13]	Hao ZHOU, Chao WANG, Guoheng CUI, Tingjin LUO. Visual question answering model based on association and fusion of multiple semantic features [J]. Journal of Computer Applications, 2025, 45(3): 739-745.
[14]	Qiurun HE, Jie HU, Bo PENG, Tianyuan LI. Fabric defect detection algorithm based on context information and multi-scale feature fusion [J]. Journal of Computer Applications, 2025, 45(2): 640-646.
[15]	Handa MA, Yadong WU. Multi-domain spatiotemporal hierarchical graph neural network for air quality prediction [J]. Journal of Computer Applications, 2025, 45(2): 444-452.

Multi-band Image Captioning Method Based on Concept-guided Feature Fusion

基于概念引导特征融合的多波段图像描述方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics