Journal of Computer Applications

    Next Articles

Image Caption Method Based on Swin-Transformer and Multi-Scale Feature Fusion

  

  • Received:2024-10-22 Revised:2024-12-03 Accepted:2024-12-09 Online:2024-12-17 Published:2024-12-17

基于Swin Transformer与多尺度特征融合的图像描述方法#br#

王子怡1,李卫军1,2,刘雪洋1,丁建平1,刘世侠1,苏易礌1   

  1. 1.北方民族大学 计算机科学与工程学院, 银川 750021

    2.图形图像智能处理国家民委重点实验室(北方民族大学), 银川 750021

  • 通讯作者: 李卫军
  • 基金资助:
    国家自然科学基金;国家自然科学基金;宁夏自然科学基金;中央高校科研业务费;中央高校科研业务费;中央高校科研业务费

Abstract: Image caption methods based on Transformer use multi-head attention to calculate attention weights over the entire input sequence, lacking hierarchical feature extraction capabilities. Additionally, two-stage image caption methods limit model performance. To address these issues, an image caption method based on Swin Transformer and multi-scale feature fusion (STMSF) was proposed. In the encoder, Agent Attention was used to maintain global context modeling capability while improving computational efficiency. In the decoder, Multi-Scale Cross Attention (MSCA) was introduced, combining cross-attention and depthwise separable convolution, which not only obtained multi-scale features but also better fused multi-modal features. On the MSCOCO dataset, compared to the SCD-Net (Semantic-Conditional Diffusion Networks) method, the BiLingual Evaluation Understudy with 4-grams (BLEU4) and Consensus-based Image Description Evaluation (CIDEr) metrics were improved by 1.1 percentage points and 5.3 percentage points, respectively. Comparison and ablation experiment results show that the proposed single-stage method, STMSF, can effectively improve model performance and generate high-quality image captions.

Key words: Swin Transformer, multi-scale feature, feature fusion, image caption, depthwise separable convolution

摘要: 基于Transformer的图像描述方法通过多头注意力会在整个输入序列上计算注意力权重,缺乏层次化的特征提取能力,并且两阶段的图像描述方法限制了模型性能。针对上述问题,提出一种基于Swin Transformer与多尺度特征融合的图像描述方法(STMSF)。在编码器中通过Agent Attention保持全局上下文建模能力的同时,提高计算效率;在解码器中提出多尺度交叉注意力(MSCA),融合交叉注意力与深度可分离卷积,在得到多尺度特征的同时更充分地融合多模态特征。在MSCOCO数据集上与SCD-Net(Semantic-Conditional Diffusion Networks)方法相比,BLEU4(BiLingual Evaluation Understudy with 4-grams)和CIDEr(Consensus-based Image Description Evaluation)指标分别提升了1.1个百分点和5.3个百分点。对比实验和消融实验结果表明,所提的一阶段方法STMSF能够有效提高模型性能,生成高质量的图像描述语句。

关键词: Swin Transformer, 多尺度特征, 特征融合, 图像描述, 深度可分离卷积

CLC Number: