Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
Image caption method based on Swin Transformer and multi-scale feature fusion
Ziyi WANG, Weijun LI, Xueyang LIU, Jianping DING, Shixia LIU, Yilei SU
Journal of Computer Applications    2025, 45 (10): 3154-3160.   DOI: 10.11772/j.issn.1001-9081.2024101478
Abstract95)   HTML0)    PDF (2194KB)(57)       Save

Image caption methods based on Transformer use multi-head attention to calculate attention weights on the entire input sequence, and lack hierarchical feature extraction capabilities. Additionally, two-stage image caption methods limit model performance. To address the above issues, an image caption method based on Swin Transformer and Multi-Scale feature Fusion (STMSF) was proposed. In the encoder of this method, Agent Attention was used to maintain global context modeling capability while improving computational efficiency. In the decoder of this method, Multi-Scale Cross Attention (MSCA) was proposed to combine cross-attention and depthwise separable convolution, which obtained multi-scale features and fused multi-modal features better. Experimental results on the MSCOCO dataset show that compared to SCD-Net (Semantic-Conditional Diffusion Network) method, STMSF has the BLEU4 (BiLingual Evaluation Understudy with 4-grams) and CIDEr (Consensus-based Image Description Evaluation) metrics improved by 1.1 and 5.3 percentage points, respectively. The above comparison experimental results as well as ablation experimental results show that the proposed single-stage STMSF can improve model performance effectively and generate high-quality image caption sentences.

Table and Figures | Reference | Related Articles | Metrics