Image Caption Method Based on Swin-Transformer and Multi-Scale Feature Fusion

doi:10.11772/j.issn.1001-9081.2024101478

Journal of Computer Applications

Received:2024-10-22 Revised:2024-12-03 Accepted:2024-12-09 Online:2024-12-17 Published:2024-12-17

基于Swin Transformer与多尺度特征融合的图像描述方法#br#

王子怡¹，李卫军^1,2，刘雪洋¹，丁建平¹，刘世侠¹，苏易礌¹

1.北方民族大学计算机科学与工程学院，银川 750021；
2.图形图像智能处理国家民委重点实验室(北方民族大学)，银川 750021

通讯作者: 李卫军
基金资助:
国家自然科学基金;国家自然科学基金;宁夏自然科学基金;中央高校科研业务费;中央高校科研业务费;中央高校科研业务费

Abstract

Abstract: Image caption methods based on Transformer use multi-head attention to calculate attention weights over the entire input sequence, lacking hierarchical feature extraction capabilities. Additionally, two-stage image caption methods limit model performance. To address these issues, an image caption method based on Swin Transformer and multi-scale feature fusion (STMSF) was proposed. In the encoder, Agent Attention was used to maintain global context modeling capability while improving computational efficiency. In the decoder, Multi-Scale Cross Attention (MSCA) was introduced, combining cross-attention and depthwise separable convolution, which not only obtained multi-scale features but also better fused multi-modal features. On the MSCOCO dataset, compared to the SCD-Net (Semantic-Conditional Diffusion Networks) method, the BiLingual Evaluation Understudy with 4-grams (BLEU4) and Consensus-based Image Description Evaluation (CIDEr) metrics were improved by 1.1 percentage points and 5.3 percentage points, respectively. Comparison and ablation experiment results show that the proposed single-stage method, STMSF, can effectively improve model performance and generate high-quality image captions.

Key words: Swin Transformer, multi-scale feature, feature fusion, image caption, depthwise separable convolution

摘要： 基于Transformer的图像描述方法通过多头注意力会在整个输入序列上计算注意力权重，缺乏层次化的特征提取能力，并且两阶段的图像描述方法限制了模型性能。针对上述问题，提出一种基于Swin Transformer与多尺度特征融合的图像描述方法（STMSF）。在编码器中通过Agent Attention保持全局上下文建模能力的同时，提高计算效率；在解码器中提出多尺度交叉注意力（MSCA），融合交叉注意力与深度可分离卷积，在得到多尺度特征的同时更充分地融合多模态特征。在MSCOCO数据集上与SCD-Net(Semantic-Conditional Diffusion Networks)方法相比，BLEU4（BiLingual Evaluation Understudy with 4-grams）和CIDEr（Consensus-based Image Description Evaluation）指标分别提升了1.1个百分点和5.3个百分点。对比实验和消融实验结果表明，所提的一阶段方法STMSF能够有效提高模型性能，生成高质量的图像描述语句。

关键词: Swin Transformer, 多尺度特征, 特征融合, 图像描述, 深度可分离卷积

CLC Number:

TP391.41

王子怡李卫军刘雪洋丁建平刘世侠苏易礌.

基于Swin Transformer与多尺度特征融合的图像描述方法#br# [J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2024101478.

[1]	Hui LI, Bingzhi JIA, Chenxi WANG, Ziyu DONG, Jilong LI, Zhaoman ZHONG, Yanyan CHEN. Generative adversarial network underwater image enhancement model based on Swin Transformer [J]. Journal of Computer Applications, 2025, 45(5): 1439-1446.
[2]	Yali YANG, Ying LI, Yutao ZHANG, Peihua SONG. Review of multi-modal research methods for face recognition [J]. Journal of Computer Applications, 2025, 45(5): 1645-1657.
[3]	Yang ZHOU, Hui LI. Remote sensing image building extraction network based on dual promotion of semantic and detailed features [J]. Journal of Computer Applications, 2025, 45(4): 1310-1316.
[4]	Shiyue GUO, Jianwu DANG, Yangping WANG, Jiu YONG. 3D hand pose estimation combining attention mechanism and multi-scale feature fusion [J]. Journal of Computer Applications, 2025, 45(4): 1293-1299.
[5]	Yiding WANG, Zehao WANG, Yaoli LI, Shaoqing CAI, Yuan YUAN. Multi-scale 2D-Adaboost microscopic image recognition algorithm of Chinese medicinal materials powder [J]. Journal of Computer Applications, 2025, 45(4): 1325-1332.
[6]	Hao ZHOU, Chao WANG, Guoheng CUI, Tingjin LUO. Visual question answering model based on association and fusion of multiple semantic features [J]. Journal of Computer Applications, 2025, 45(3): 739-745.
[7]	Qiurun HE, Jie HU, Bo PENG, Tianyuan LI. Fabric defect detection algorithm based on context information and multi-scale feature fusion [J]. Journal of Computer Applications, 2025, 45(2): 640-646.
[8]	Handa MA, Yadong WU. Multi-domain spatiotemporal hierarchical graph neural network for air quality prediction [J]. Journal of Computer Applications, 2025, 45(2): 444-452.
[9]	Zhongwei ZHANG, Jun WANG, Shudong LIU, Zhiheng WANG. Object detection in remote sensing image based on multi-scale feature fusion and weighted boxes fusion [J]. Journal of Computer Applications, 2025, 45(2): 633-639.
[10]	Rui LI, Guanfeng LI, Dezhou HU, Wenxin GAO. Knowledge graph multi-hop reasoning model fusing path and subgraph features [J]. Journal of Computer Applications, 2025, 45(1): 32-39.
[11]	Pengcheng SONG, Lijun GUO, Rong ZHANG. Weakly supervised video anomaly detection with local-global temporal dependency [J]. Journal of Computer Applications, 2025, 45(1): 240-246.
[12]	Junying CHEN, Shijie GUO, Lingling CHEN. Lightweight human pose estimation based on decoupled attention and ghost convolution [J]. Journal of Computer Applications, 2025, 45(1): 223-233.
[13]	Shang LIU, Yuwei ZHOU, Rao DAI, Linfang DONG, Meng LIU. Small target detection algorithm in remote sensing images integrating attention and contextual information [J]. Journal of Computer Applications, 2025, 45(1): 292-300.
[14]	Yan RONG, Jiawen LIU, Xinlei LI. Adaptive hybrid network for affective computing in student classroom [J]. Journal of Computer Applications, 2024, 44(9): 2919-2930.
[15]	Yexin PAN, Zhe YANG. Optimization model for small object detection based on multi-level feature bidirectional fusion [J]. Journal of Computer Applications, 2024, 44(9): 2871-2877.