Image caption methods based on Transformer use multi-head attention to calculate attention weights on the entire input sequence, and lack hierarchical feature extraction capabilities. Additionally, two-stage image caption methods limit model performance. To address the above issues, an image caption method based on Swin Transformer and Multi-Scale feature Fusion (STMSF) was proposed. In the encoder of this method, Agent Attention was used to maintain global context modeling capability while improving computational efficiency. In the decoder of this method, Multi-Scale Cross Attention (MSCA) was proposed to combine cross-attention and depthwise separable convolution, which obtained multi-scale features and fused multi-modal features better. Experimental results on the MSCOCO dataset show that compared to SCD-Net (Semantic-Conditional Diffusion Network) method, STMSF has the BLEU4 (BiLingual Evaluation Understudy with 4-grams) and CIDEr (Consensus-based Image Description Evaluation) metrics improved by 1.1 and 5.3 percentage points, respectively. The above comparison experimental results as well as ablation experimental results show that the proposed single-stage STMSF can improve model performance effectively and generate high-quality image caption sentences.