Image caption method based on Swin Transformer and multi-scale feature fusion

doi:10.11772/j.issn.1001-9081.2024101478

Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (10): 3154-3160.DOI: 10.11772/j.issn.1001-9081.2024101478

• Artificial intelligence • Previous Articles

Image caption method based on Swin Transformer and multi-scale feature fusion

Ziyi WANG¹, Weijun LI¹^,²(), Xueyang LIU¹, Jianping DING¹, Shixia LIU¹, Yilei SU¹

^1.School of Computer Science and Engineering，North Minzu University，Yinchuan Ningxia 750021，China
^2.The Key Laboratory of Images and Graphics Intelligent Processing of State Ethnic Affairs Commission （North Minzu University），Yinchuan Ningxia 750021，China

Received:2024-10-22 Revised:2024-12-03 Accepted:2024-12-09 Online:2024-12-17 Published:2025-10-10
Contact: Weijun LI
About author:WANG Ziyi， born in 2001， M. S. candidate. Her research interests include image caption， natural language processing.
LI Weijun， born in 1979， Ph. D.， lecturer. His research interests include construction and reuse of ontology， construction of knowledge graph.
LIU Xueyang， born in 1999， M. S. candidate. Her research interests include knowledge graph reasoning.
DING Jianping， born in 1999， M. S. candidate. His research interests include named entity recognition.
LIU Shixia， born in 2000， M. S. candidate. His research interests include reinforcement learning， knowledge reasoning.
SU Yilei， born in 2000， M. S. candidate. His research interests include text classification.
Supported by:
National Natural Science Foundation of China(62066038);Ningxia Natural Science Foundation(2021AAC03215);Fundamental Research Funds for the Central Universities （2019KYQD04, 2022PT_S04, 2021JCYJ12）

基于Swin Transformer与多尺度特征融合的图像描述方法

王子怡¹, 李卫军¹^,²(), 刘雪洋¹, 丁建平¹, 刘世侠¹, 苏易礌¹

^1.北方民族大学计算机科学与工程学院，银川 750021
^2.图形图像智能处理国家民委重点实验室（北方民族大学），银川 750021

通讯作者: 李卫军
作者简介:王子怡（2001—），女，山东泰安人，硕士研究生，CCF会员，主要研究方向：图像描述、自然语言处理
李卫军（1979—），男，陕西渭南人，讲师，博士，CCF会员，主要研究方向：本体的构建与重用、知识图谱的构建 Email:lwj@nmu.edu.cn
刘雪洋（1999—），女，河南南阳人，硕士研究生，CCF会员，主要研究方向：知识图谱推理
丁建平（1999—），男，四川资阳人，硕士研究生，CCF会员，主要研究方向：命名实体识别
刘世侠（2000—），男（壮族），广西贵港人，硕士研究生，CCF会员，主要研究方向：强化学习、知识推理
苏易礌（2000—），男，湖南常德人，硕士研究生，CCF会员，主要研究方向：文本分类。
基金资助:
国家自然科学基金资助项目(62066038);国家自然科学基金资助项目(61962001);中央高校基本科研业务费专项资金资助项目（2019KYQD04,2022PT_S04,2021JCYJ12）;宁夏自然科学基金资助项目(2021AAC03215)

Abstract

Abstract:

Image caption methods based on Transformer use multi-head attention to calculate attention weights on the entire input sequence， and lack hierarchical feature extraction capabilities. Additionally， two-stage image caption methods limit model performance. To address the above issues， an image caption method based on Swin Transformer and Multi-Scale feature Fusion （STMSF） was proposed. In the encoder of this method， Agent Attention was used to maintain global context modeling capability while improving computational efficiency. In the decoder of this method， Multi-Scale Cross Attention （MSCA） was proposed to combine cross-attention and depthwise separable convolution， which obtained multi-scale features and fused multi-modal features better. Experimental results on the MSCOCO dataset show that compared to SCD-Net （Semantic-Conditional Diffusion Network） method， STMSF has the BLEU4 （BiLingual Evaluation Understudy with 4-grams） and CIDEr （Consensus-based Image Description Evaluation） metrics improved by 1.1 and 5.3 percentage points， respectively. The above comparison experimental results as well as ablation experimental results show that the proposed single-stage STMSF can improve model performance effectively and generate high-quality image caption sentences.

Key words: Swin Transformer, multi-scale feature, feature fusion, image caption, depthwise separable convolution

摘要：

基于Transformer的图像描述方法通过多头注意力会在整个输入序列上计算注意力权重，缺乏层次化的特征提取能力，并且两阶段的图像描述方法限制了模型性能。针对上述问题，提出一种基于Swin Transformer与多尺度特征融合的图像描述方法（STMSF）。在编码器中通过Agent Attention保持全局上下文建模能力的同时，提高计算效率；在解码器中提出多尺度交叉注意力（MSCA），融合交叉注意力与深度可分离卷积，在得到多尺度特征的同时更充分地融合多模态特征。实验结果表明，在MSCOCO数据集上与SCD-Net（Semantic-Conditional Diffusion Network）方法相比，STMSF的BLEU4（BiLingual Evaluation Understudy with 4-grams）和CIDEr（Consensus-based Image Description Evaluation）指标分别提升了1.1和5.3个百分点。对比实验和消融实验的结果表明，所提的一阶段STMSF能够有效提高模型性能，生成高质量的图像描述语句。

关键词: Swin Transformer, 多尺度特征, 特征融合, 图像描述, 深度可分离卷积

CLC Number:

TP391.41

Ziyi WANG, Weijun LI, Xueyang LIU, Jianping DING, Shixia LIU, Yilei SU. Image caption method based on Swin Transformer and multi-scale feature fusion[J]. Journal of Computer Applications, 2025, 45(10): 3154-3160.

王子怡, 李卫军, 刘雪洋, 丁建平, 刘世侠, 苏易礌. 基于Swin Transformer与多尺度特征融合的图像描述方法[J]. 《计算机应用》唯一官方网站, 2025, 45(10): 3154-3160.

Figures/Tables 13

References 34

[1]	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 6000-6010.
[2]	DOSOVITSKIY A， BEYER L， KOLESNIKOV A， et al. An image is worth 16x16 words： Transformers for image recognition at scale［EB/OL］. ［2024-11-28］..
[3]	LIU Z， LIN Y， CAO Y， et al. Swin Transformer： hierarchical vision Transformer using shifted windows［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 9992-10002.
[4]	HAN D， YE T， HAN Y， et al. Agent attention： on the integration of Softmax and linear attention［C］// Proceedings of the 2024 European Conference on Computer Vision， LNCS 15108. Cham： Springer， 2025： 124-140.
[5]	CHEN L， ZHANG H， XIAO J， et al. SCA-CNN： spatial and channel-wise attention in convolutional networks for image captioning［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 6298-6306.
[6]	ANDERSON P， HE X， BUEHLER C， et al. Bottom-up and top-down attention for image captioning and visual question answering［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6077-6086.
[7]	PAN Y， YAO T， LI Y， et al. X-linear attention networks for image captioning［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 10971-10980.
[8]	CORNIA M， STEFANINI M， BARALDI L， et al. Meshed-memory Transformer for image captioning［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 10575-10584.
[9]	LUO Y， JI J， SUN X， et al. Dual-level collaborative Transformer for image captioning［C］// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2021： 2286-2293.
[10]	CHEN X， FANG H， LIN T Y， et al. Microsoft COCO captions： data collection and evaluation server［EB/OL］. ［2024-11-28］..
[11]	WU J， ZHENG H， ZHAO B， et al. Large-scale datasets for going deeper in image understanding［C］// Proceedings of the 2019 IEEE International Conference on Multimedia and Expo. Piscataway： IEEE， 2019： 1480-1485.
[12]	KARPATHY A， LI F F. Deep visual-semantic alignments for generating image descriptions［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2015： 3128-3137.
[13]	PAPINENI K， ROUKOS S， WARD T， et al. BLEU： a method for automatic evaluation of machine translation［C］// Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Stroudsburg： ACL， 2002： 311-318.
[14]	DENKOWSKI M， LAVIE A. Meteor universal： language specific translation evaluation for any target language［C］// Proceedings of the 9th Workshop on Statistical Machine Translation. Stroudsburg： ACL， 2014： 376-380.
[15]	LIN C Y. ROUGE： a package for automatic evaluation of summaries［C］// Proceedings of the ACL-04 Workshop： Text Summarization Branches Out. Stroudsburg： ACL， 2004： 74-81.
[16]	VEDANTAM R， ZITNICK C L， PARIKH D. CIDEr： consensus-based image description evaluation［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2015： 4566-4575.
[17]	ANDERSON P， FERNANDO B， JOHNSON M， et al. SPICE： semantic propositional image caption evaluation［C］// Proceedings of the 2016 European Conference on Computer Vision， LNCS 9909. Cham： Springer， 2016： 382-398.
[18]	RENNIE S J， MARCHERET E， MROUEH Y， et al. Self-critical sequence training for image captioning［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 1179-1195.
[19]	YAO T， PAN Y， LI Y， et al. Exploring visual relationship for image captioning［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11218. Cham： Springer， 2018： 711-727.
[20]	HUANG L， WANG W， CHEN J， et al. Attention on attention for image captioning［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 4633-4642.
[21]	刘茂福，施琦，聂礼强. 基于视觉关联与上下文双注意力的图像描述生成方法［J］. 软件学报， 2022， 33（9）：3210-3222.
	LIU M F， SHI Q， NIE L Q. Image captioning based on visual relevance and context dual attention［J］. Journal of Software， 2022， 33（9）： 3210-3222.
[22]	ZHANG X， SUN X， LUO Y， et al. RSTNet： captioning with adaptive attention on visual and non-visual words［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 15460-15469.
[23]	WANG C， SHEN Y， JI L. Geometry attention Transformer with position-aware LSTMs for image captioning［J］. Expert Systems with Applications， 2022， 201： No.117174.
[24]	FEI Z. Attention-aligned Transformer for image captioning［C］// Proceedings of the 36th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2022： 607-615.
[25]	ZENG P， ZHANG H， SONG J， et al. S2 Transformer for image captioning［C］// Proceedings of the 31st International Joint Conference on Artificial Intelligence. California： ijcai.org， 2022： 1608-1614.
[26]	MA Y， JI J， SUN X， et al. Towards local visual modeling for image captioning［J］. Pattern Recognition， 2023， 138： No.109420.
[27]	LUO J， LI Y， PAN Y， et al. Semantic-conditional diffusion networks for image captioning［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023： 23359-23368.
[28]	邓珍荣，张永林，杨睿，等. 结合全局和局部特征的BiGRU-RA图像中文描述模型［J］. 计算机辅助设计与图形学学报， 2021， 33（1）：49-58.
	DENG Z R， ZHANG Y L， YANG R， et al. BiGRU-RA model for image Chinese captioning via global and local features［J］. Journal of Computer-Aided Design and Computer Graphics， 2021， 33（1）： 49-58.
[29]	LIU M， HU H， LI L， et al. Chinese image caption generation via visual attention and topic modeling［J］. IEEE Transactions on Cybernetics， 2022， 52（2）： 1247-1257.
[30]	LU H， YANG R， DENG Z， et al. Chinese image captioning via fuzzy attention-based DenseNet-BiLSTM［J］. ACM Transactions on Multimedia Computing， Communications， and Applications， 2021， 17（1s）： No.14.
[31]	PAN Y， WANG L， DUAN S， et al. Chinese image caption of Inceptionv4 and double-layer GRUs based on attention mechanism［J］. Journal of Physics： Conference Series， 2021， 1861： No.012044.
[32]	HODOSH M， YOUNG P， HOCKENMAIER J. Framing image description as a ranking task： data， models and evaluation metrics［J］. Journal of Artificial Intelligence Research， 2013， 47： 853-899.
[33]	KATIYAR S， BORGOHAIN S K. Analysis of convolutional decoder for image caption generation［EB/OL］. ［2024-11-28］..
[34]	LI X， YUAN A， LU X. Multi-modal gated recurrent units for image description［J］. Multimedia Tools and Applications， 2018， 77（22）： 29847-29869.

类别	方法	B1	B4	M	R	C	S
CNN-RNN	SCST	—	34.2	26.7	55.7	114.0	—
	Up-Down	79.8	36.3	27.7	56.9	120.1	21.4
	GCN-LSTM	80.5	38.2	28.5	58.3	127.6	22.0
	AoANet	80.2	38.9	29.2	58.8	129.8	22.4
	X-LAN	80.8	39.5	29.5	59.2	132.0	23.4
	VRCDA	80.6	37.9	28.4	58.2	123.7	21.8
Transformer	X-Transformer	80.9	39.7	29.5	59.1	132.8	23.4
	M2 Transformer	80.8	39.1	29.2	58.6	131.2	22.6
	RSTNet	81.8	40.1	29.8	59.5	135.6	23.3
	DLCT	81.4	39.8	29.5	59.1	133.8	23.0
	GAT	80.8	39.7	29.1	59.0	130.5	22.9
	A2 Transformer	81.5	39.8	29.6	59.1	133.9	23.0
	S2 Transformer	81.1	39.6	29.6	59.1	133.5	23.2
	LSTNet	81.5	40.3	29.6	59.4	134.8	23.1
	SCD-Net	81.3	39.4	29.2	59.1	131.6	23.0
	STMSF	82.2	40.5	30.0	59.9	136.9	23.9

类别	方法	B1	B4	M	R	C	S
CNN-RNN	SCST	—	34.2	26.7	55.7	114.0	—
	Up-Down	79.8	36.3	27.7	56.9	120.1	21.4
	GCN-LSTM	80.5	38.2	28.5	58.3	127.6	22.0
	AoANet	80.2	38.9	29.2	58.8	129.8	22.4
	X-LAN	80.8	39.5	29.5	59.2	132.0	23.4
	VRCDA	80.6	37.9	28.4	58.2	123.7	21.8
Transformer	X-Transformer	80.9	39.7	29.5	59.1	132.8	23.4
	M2 Transformer	80.8	39.1	29.2	58.6	131.2	22.6
	RSTNet	81.8	40.1	29.8	59.5	135.6	23.3
	DLCT	81.4	39.8	29.5	59.1	133.8	23.0
	GAT	80.8	39.7	29.1	59.0	130.5	22.9
	A2 Transformer	81.5	39.8	29.6	59.1	133.9	23.0
	S2 Transformer	81.1	39.6	29.6	59.1	133.5	23.2
	LSTNet	81.5	40.3	29.6	59.4	134.8	23.1
	SCD-Net	81.3	39.4	29.2	59.1	131.6	23.0
	STMSF	82.2	40.5	30.0	59.9	136.9	23.9

方法	B1	B4	M	R	C
Up-Down	78.1	48.0	40.8	70.6	198.5
BiGRU-RA	—	—	41.3	70.9	192.0
NICVATP2L	75.9	44.3	36.5	61.9	130.8
DenseNet-BiLSTM	78.5	47.8	41.5	71.2	191.3
I-GRUs	68.8	26.8	23.9	—	85.6
STMSF	84.8	59.1	42.2	70.0	200.6
STMSF*	86.4	61.3	42.8	71.4	215.0

方法	B1	B4	M	R	C
Up-Down	78.1	48.0	40.8	70.6	198.5
BiGRU-RA	—	—	41.3	70.9	192.0
NICVATP2L	75.9	44.3	36.5	61.9	130.8
DenseNet-BiLSTM	78.5	47.8	41.5	71.2	191.3
I-GRUs	68.8	26.8	23.9	—	85.6
STMSF	84.8	59.1	42.2	70.0	200.6
STMSF*	86.4	61.3	42.8	71.4	215.0

RL	B1	B4	M	R	C	S
N	78.5	37.6	29.1	58.0	123.0	22.3
Y	82.2	40.5	30.0	59.9	136.9	23.9

Image caption method based on Swin Transformer and multi-scale feature fusion

基于Swin Transformer与多尺度特征融合的图像描述方法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 13

References 34

Related Articles 15

Recommended Articles

Metrics

Agent Attention	MSCA	B4	M	R	C	S
×	×	36.8	28.9	57.7	122.8	22.2
√	×	37.2	29.1	57.9	123.0	22.1
×	√	37.3	29.0	58.0	122.8	22.1
√	√	37.6	29.1	58.0	123.0	22.3

层数	B1/%	B4/%	M/%	R/%	C/%	S/%
1	76.3	37.7	28.0	57.8	120.1	21.2
2	78.3	37.6	28.9	57.8	123.1	21.9
3	78.5	37.6	29.1	58.0	123.0	22.3
4	77.3	36.9	28.9	57.5	122.9	22.1

骨干网络	图像大小	B1/%	B4/%	M/%	R/%	C/%	S/%
Swin-B	384×384	76.7	36.4	28.8	57.2	121.4	21.8
Swin-L	384×384	78.5	37.6	29.1	58.0	123.0	22.3

DWC模块	B1	B4	M	R	C	S
×	77.1	36.1	28.5	57.0	120.2	21.6
√	78.5	37.6	29.1	58.0	123.0	22.3

[1]	Weigang LI, Jiale SHAO, Zhiqiang TIAN. Point cloud classification and segmentation network based on dual attention mechanism and multi-scale fusion [J]. Journal of Computer Applications, 2025, 45(9): 3003-3010.
[2]	Xiang WANG, Zhixiang CHEN, Guojun MAO. Multivariate time series prediction method combining local and global correlation [J]. Journal of Computer Applications, 2025, 45(9): 2806-2816.
[3]	Zhixiong XU, Bo LI, Xiaoyong BIAN, Qiren HU. Adversarial sample embedded attention U-Net for 3D medical image segmentation [J]. Journal of Computer Applications, 2025, 45(9): 3011-3016.
[4]	Fang WANG, Jing HU, Rui ZHANG, Wenting FAN. Medical image segmentation network with content-guided multi-angle feature fusion [J]. Journal of Computer Applications, 2025, 45(9): 3017-3025.
[5]	Yiming LIANG, Jing FAN, Wenze CHAI. Multi-scale feature fusion sentiment classification based on bidirectional cross attention [J]. Journal of Computer Applications, 2025, 45(9): 2773-2782.
[6]	Chengzhi YAN, Ying CHEN, Kai ZHONG, Han GAO. 3D object detection algorithm based on multi-scale network and axial attention [J]. Journal of Computer Applications, 2025, 45(8): 2537-2545.
[7]	Jinhao LIN, Chuan LUO, Tianrui LI, Hongmei CHEN. Thoracic disease classification method based on cross-scale attention network [J]. Journal of Computer Applications, 2025, 45(8): 2712-2719.
[8]	Yimeng XI, Zhen DENG, Qian LIU, Libo LIU. Cross-modal information fusion for video-text retrieval [J]. Journal of Computer Applications, 2025, 45(8): 2448-2456.
[9]	Liang CHEN, Xuan WANG, Kun LEI. Helmet wearing detection algorithm for complex scenarios based on cross-layer multi-scale feature fusion [J]. Journal of Computer Applications, 2025, 45(7): 2333-2341.
[10]	Xiang WANG, Qianqian CUI, Xiaoming ZHANG, Jianchao WANG, Zhenzhou WANG, Jialin SONG. Wireless capsule endoscopy image classification model based on improved ConvNeXt [J]. Journal of Computer Applications, 2025, 45(6): 2016-2024.
[11]	Zonghang WU, Dong ZHANG, Guanyu LI. Multimodal fusion recommendation algorithm based on joint self-supervised learning [J]. Journal of Computer Applications, 2025, 45(6): 1858-1868.
[12]	Linjia SUN, Lei QIN, Meijin KANG, Yinglin WANG. Automatic speech segmentation algorithm based on syllable type recognition [J]. Journal of Computer Applications, 2025, 45(6): 2034-2042.
[13]	Ying HUANG, Shengmei GAO, Guang CHEN, Su LIU. Low-light image enhancement network combining signal-to-noise ratio guided dual-branch structure and histogram equalization [J]. Journal of Computer Applications, 2025, 45(6): 1971-1979.
[14]	Junyan ZHANG, Yiming ZHAO, Bing LIN, Yunping WU. Chinese image captioning method based on multi-level visual and dynamic text-image interaction [J]. Journal of Computer Applications, 2025, 45(5): 1520-1527.
[15]	Hui LI, Bingzhi JIA, Chenxi WANG, Ziyu DONG, Jilong LI, Zhaoman ZHONG, Yanyan CHEN. Generative adversarial network underwater image enhancement model based on Swin Transformer [J]. Journal of Computer Applications, 2025, 45(5): 1439-1446.