基于Swin Transformer与多尺度特征融合的图像描述方法

doi:10.11772/j.issn.1001-9081.2024101478

《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (10): 3154-3160.DOI: 10.11772/j.issn.1001-9081.2024101478

• 人工智能 • 上一篇

基于Swin Transformer与多尺度特征融合的图像描述方法

王子怡¹, 李卫军¹^,²(), 刘雪洋¹, 丁建平¹, 刘世侠¹, 苏易礌¹

^1.北方民族大学计算机科学与工程学院，银川 750021
^2.图形图像智能处理国家民委重点实验室（北方民族大学），银川 750021

收稿日期:2024-10-22 修回日期:2024-12-03 接受日期:2024-12-09 发布日期:2024-12-17 出版日期:2025-10-10
通讯作者: 李卫军
作者简介:王子怡（2001—），女，山东泰安人，硕士研究生，CCF会员，主要研究方向：图像描述、自然语言处理
李卫军（1979—），男，陕西渭南人，讲师，博士，CCF会员，主要研究方向：本体的构建与重用、知识图谱的构建 Email:lwj@nmu.edu.cn
刘雪洋（1999—），女，河南南阳人，硕士研究生，CCF会员，主要研究方向：知识图谱推理
丁建平（1999—），男，四川资阳人，硕士研究生，CCF会员，主要研究方向：命名实体识别
刘世侠（2000—），男（壮族），广西贵港人，硕士研究生，CCF会员，主要研究方向：强化学习、知识推理
苏易礌（2000—），男，湖南常德人，硕士研究生，CCF会员，主要研究方向：文本分类。
基金资助:
国家自然科学基金资助项目(62066038);国家自然科学基金资助项目(61962001);中央高校基本科研业务费专项资金资助项目（2019KYQD04,2022PT_S04,2021JCYJ12）;宁夏自然科学基金资助项目(2021AAC03215)

Image caption method based on Swin Transformer and multi-scale feature fusion

Ziyi WANG¹, Weijun LI¹^,²(), Xueyang LIU¹, Jianping DING¹, Shixia LIU¹, Yilei SU¹

^1.School of Computer Science and Engineering，North Minzu University，Yinchuan Ningxia 750021，China
^2.The Key Laboratory of Images and Graphics Intelligent Processing of State Ethnic Affairs Commission （North Minzu University），Yinchuan Ningxia 750021，China

Received:2024-10-22 Revised:2024-12-03 Accepted:2024-12-09 Online:2024-12-17 Published:2025-10-10
Contact: Weijun LI
About author:WANG Ziyi， born in 2001， M. S. candidate. Her research interests include image caption， natural language processing.
LI Weijun， born in 1979， Ph. D.， lecturer. His research interests include construction and reuse of ontology， construction of knowledge graph.
LIU Xueyang， born in 1999， M. S. candidate. Her research interests include knowledge graph reasoning.
DING Jianping， born in 1999， M. S. candidate. His research interests include named entity recognition.
LIU Shixia， born in 2000， M. S. candidate. His research interests include reinforcement learning， knowledge reasoning.
SU Yilei， born in 2000， M. S. candidate. His research interests include text classification.
Supported by:
National Natural Science Foundation of China(62066038);Ningxia Natural Science Foundation(2021AAC03215);Fundamental Research Funds for the Central Universities （2019KYQD04, 2022PT_S04, 2021JCYJ12）

摘要/Abstract

摘要：

基于Transformer的图像描述方法通过多头注意力会在整个输入序列上计算注意力权重，缺乏层次化的特征提取能力，并且两阶段的图像描述方法限制了模型性能。针对上述问题，提出一种基于Swin Transformer与多尺度特征融合的图像描述方法（STMSF）。在编码器中通过Agent Attention保持全局上下文建模能力的同时，提高计算效率；在解码器中提出多尺度交叉注意力（MSCA），融合交叉注意力与深度可分离卷积，在得到多尺度特征的同时更充分地融合多模态特征。实验结果表明，在MSCOCO数据集上与SCD-Net（Semantic-Conditional Diffusion Network）方法相比，STMSF的BLEU4（BiLingual Evaluation Understudy with 4-grams）和CIDEr（Consensus-based Image Description Evaluation）指标分别提升了1.1和5.3个百分点。对比实验和消融实验的结果表明，所提的一阶段STMSF能够有效提高模型性能，生成高质量的图像描述语句。

关键词: Swin Transformer, 多尺度特征, 特征融合, 图像描述, 深度可分离卷积

Abstract:

Image caption methods based on Transformer use multi-head attention to calculate attention weights on the entire input sequence， and lack hierarchical feature extraction capabilities. Additionally， two-stage image caption methods limit model performance. To address the above issues， an image caption method based on Swin Transformer and Multi-Scale feature Fusion （STMSF） was proposed. In the encoder of this method， Agent Attention was used to maintain global context modeling capability while improving computational efficiency. In the decoder of this method， Multi-Scale Cross Attention （MSCA） was proposed to combine cross-attention and depthwise separable convolution， which obtained multi-scale features and fused multi-modal features better. Experimental results on the MSCOCO dataset show that compared to SCD-Net （Semantic-Conditional Diffusion Network） method， STMSF has the BLEU4 （BiLingual Evaluation Understudy with 4-grams） and CIDEr （Consensus-based Image Description Evaluation） metrics improved by 1.1 and 5.3 percentage points， respectively. The above comparison experimental results as well as ablation experimental results show that the proposed single-stage STMSF can improve model performance effectively and generate high-quality image caption sentences.

Key words: Swin Transformer, multi-scale feature, feature fusion, image caption, depthwise separable convolution

中图分类号:

TP391.41

王子怡, 李卫军, 刘雪洋, 丁建平, 刘世侠, 苏易礌. 基于Swin Transformer与多尺度特征融合的图像描述方法[J]. 计算机应用, 2025, 45(10): 3154-3160.

Ziyi WANG, Weijun LI, Xueyang LIU, Jianping DING, Shixia LIU, Yilei SU. Image caption method based on Swin Transformer and multi-scale feature fusion[J]. Journal of Computer Applications, 2025, 45(10): 3154-3160.

图/表 13

参考文献 34

[1]	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 6000-6010.
[2]	DOSOVITSKIY A， BEYER L， KOLESNIKOV A， et al. An image is worth 16x16 words： Transformers for image recognition at scale［EB/OL］. ［2024-11-28］..
[3]	LIU Z， LIN Y， CAO Y， et al. Swin Transformer： hierarchical vision Transformer using shifted windows［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 9992-10002.
[4]	HAN D， YE T， HAN Y， et al. Agent attention： on the integration of Softmax and linear attention［C］// Proceedings of the 2024 European Conference on Computer Vision， LNCS 15108. Cham： Springer， 2025： 124-140.
[5]	CHEN L， ZHANG H， XIAO J， et al. SCA-CNN： spatial and channel-wise attention in convolutional networks for image captioning［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 6298-6306.
[6]	ANDERSON P， HE X， BUEHLER C， et al. Bottom-up and top-down attention for image captioning and visual question answering［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6077-6086.
[7]	PAN Y， YAO T， LI Y， et al. X-linear attention networks for image captioning［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 10971-10980.
[8]	CORNIA M， STEFANINI M， BARALDI L， et al. Meshed-memory Transformer for image captioning［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 10575-10584.
[9]	LUO Y， JI J， SUN X， et al. Dual-level collaborative Transformer for image captioning［C］// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2021： 2286-2293.
[10]	CHEN X， FANG H， LIN T Y， et al. Microsoft COCO captions： data collection and evaluation server［EB/OL］. ［2024-11-28］..
[11]	WU J， ZHENG H， ZHAO B， et al. Large-scale datasets for going deeper in image understanding［C］// Proceedings of the 2019 IEEE International Conference on Multimedia and Expo. Piscataway： IEEE， 2019： 1480-1485.
[12]	KARPATHY A， LI F F. Deep visual-semantic alignments for generating image descriptions［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2015： 3128-3137.
[13]	PAPINENI K， ROUKOS S， WARD T， et al. BLEU： a method for automatic evaluation of machine translation［C］// Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Stroudsburg： ACL， 2002： 311-318.
[14]	DENKOWSKI M， LAVIE A. Meteor universal： language specific translation evaluation for any target language［C］// Proceedings of the 9th Workshop on Statistical Machine Translation. Stroudsburg： ACL， 2014： 376-380.
[15]	LIN C Y. ROUGE： a package for automatic evaluation of summaries［C］// Proceedings of the ACL-04 Workshop： Text Summarization Branches Out. Stroudsburg： ACL， 2004： 74-81.
[16]	VEDANTAM R， ZITNICK C L， PARIKH D. CIDEr： consensus-based image description evaluation［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2015： 4566-4575.
[17]	ANDERSON P， FERNANDO B， JOHNSON M， et al. SPICE： semantic propositional image caption evaluation［C］// Proceedings of the 2016 European Conference on Computer Vision， LNCS 9909. Cham： Springer， 2016： 382-398.
[18]	RENNIE S J， MARCHERET E， MROUEH Y， et al. Self-critical sequence training for image captioning［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 1179-1195.
[19]	YAO T， PAN Y， LI Y， et al. Exploring visual relationship for image captioning［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11218. Cham： Springer， 2018： 711-727.
[20]	HUANG L， WANG W， CHEN J， et al. Attention on attention for image captioning［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 4633-4642.
[21]	刘茂福，施琦，聂礼强. 基于视觉关联与上下文双注意力的图像描述生成方法［J］. 软件学报， 2022， 33（9）：3210-3222.
	LIU M F， SHI Q， NIE L Q. Image captioning based on visual relevance and context dual attention［J］. Journal of Software， 2022， 33（9）： 3210-3222.
[22]	ZHANG X， SUN X， LUO Y， et al. RSTNet： captioning with adaptive attention on visual and non-visual words［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 15460-15469.
[23]	WANG C， SHEN Y， JI L. Geometry attention Transformer with position-aware LSTMs for image captioning［J］. Expert Systems with Applications， 2022， 201： No.117174.
[24]	FEI Z. Attention-aligned Transformer for image captioning［C］// Proceedings of the 36th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2022： 607-615.
[25]	ZENG P， ZHANG H， SONG J， et al. S2 Transformer for image captioning［C］// Proceedings of the 31st International Joint Conference on Artificial Intelligence. California： ijcai.org， 2022： 1608-1614.
[26]	MA Y， JI J， SUN X， et al. Towards local visual modeling for image captioning［J］. Pattern Recognition， 2023， 138： No.109420.
[27]	LUO J， LI Y， PAN Y， et al. Semantic-conditional diffusion networks for image captioning［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023： 23359-23368.
[28]	邓珍荣，张永林，杨睿，等. 结合全局和局部特征的BiGRU-RA图像中文描述模型［J］. 计算机辅助设计与图形学学报， 2021， 33（1）：49-58.
	DENG Z R， ZHANG Y L， YANG R， et al. BiGRU-RA model for image Chinese captioning via global and local features［J］. Journal of Computer-Aided Design and Computer Graphics， 2021， 33（1）： 49-58.
[29]	LIU M， HU H， LI L， et al. Chinese image caption generation via visual attention and topic modeling［J］. IEEE Transactions on Cybernetics， 2022， 52（2）： 1247-1257.
[30]	LU H， YANG R， DENG Z， et al. Chinese image captioning via fuzzy attention-based DenseNet-BiLSTM［J］. ACM Transactions on Multimedia Computing， Communications， and Applications， 2021， 17（1s）： No.14.
[31]	PAN Y， WANG L， DUAN S， et al. Chinese image caption of Inceptionv4 and double-layer GRUs based on attention mechanism［J］. Journal of Physics： Conference Series， 2021， 1861： No.012044.
[32]	HODOSH M， YOUNG P， HOCKENMAIER J. Framing image description as a ranking task： data， models and evaluation metrics［J］. Journal of Artificial Intelligence Research， 2013， 47： 853-899.
[33]	KATIYAR S， BORGOHAIN S K. Analysis of convolutional decoder for image caption generation［EB/OL］. ［2024-11-28］..
[34]	LI X， YUAN A， LU X. Multi-modal gated recurrent units for image description［J］. Multimedia Tools and Applications， 2018， 77（22）： 29847-29869.

类别	方法	B1	B4	M	R	C	S
CNN-RNN	SCST	—	34.2	26.7	55.7	114.0	—
	Up-Down	79.8	36.3	27.7	56.9	120.1	21.4
	GCN-LSTM	80.5	38.2	28.5	58.3	127.6	22.0
	AoANet	80.2	38.9	29.2	58.8	129.8	22.4
	X-LAN	80.8	39.5	29.5	59.2	132.0	23.4
	VRCDA	80.6	37.9	28.4	58.2	123.7	21.8
Transformer	X-Transformer	80.9	39.7	29.5	59.1	132.8	23.4
	M2 Transformer	80.8	39.1	29.2	58.6	131.2	22.6
	RSTNet	81.8	40.1	29.8	59.5	135.6	23.3
	DLCT	81.4	39.8	29.5	59.1	133.8	23.0
	GAT	80.8	39.7	29.1	59.0	130.5	22.9
	A2 Transformer	81.5	39.8	29.6	59.1	133.9	23.0
	S2 Transformer	81.1	39.6	29.6	59.1	133.5	23.2
	LSTNet	81.5	40.3	29.6	59.4	134.8	23.1
	SCD-Net	81.3	39.4	29.2	59.1	131.6	23.0
	STMSF	82.2	40.5	30.0	59.9	136.9	23.9

类别	方法	B1	B4	M	R	C	S
CNN-RNN	SCST	—	34.2	26.7	55.7	114.0	—
	Up-Down	79.8	36.3	27.7	56.9	120.1	21.4
	GCN-LSTM	80.5	38.2	28.5	58.3	127.6	22.0
	AoANet	80.2	38.9	29.2	58.8	129.8	22.4
	X-LAN	80.8	39.5	29.5	59.2	132.0	23.4
	VRCDA	80.6	37.9	28.4	58.2	123.7	21.8
Transformer	X-Transformer	80.9	39.7	29.5	59.1	132.8	23.4
	M2 Transformer	80.8	39.1	29.2	58.6	131.2	22.6
	RSTNet	81.8	40.1	29.8	59.5	135.6	23.3
	DLCT	81.4	39.8	29.5	59.1	133.8	23.0
	GAT	80.8	39.7	29.1	59.0	130.5	22.9
	A2 Transformer	81.5	39.8	29.6	59.1	133.9	23.0
	S2 Transformer	81.1	39.6	29.6	59.1	133.5	23.2
	LSTNet	81.5	40.3	29.6	59.4	134.8	23.1
	SCD-Net	81.3	39.4	29.2	59.1	131.6	23.0
	STMSF	82.2	40.5	30.0	59.9	136.9	23.9

方法	B1	B4	M	R	C
Up-Down	78.1	48.0	40.8	70.6	198.5
BiGRU-RA	—	—	41.3	70.9	192.0
NICVATP2L	75.9	44.3	36.5	61.9	130.8
DenseNet-BiLSTM	78.5	47.8	41.5	71.2	191.3
I-GRUs	68.8	26.8	23.9	—	85.6
STMSF	84.8	59.1	42.2	70.0	200.6
STMSF*	86.4	61.3	42.8	71.4	215.0

方法	B1	B4	M	R	C
Up-Down	78.1	48.0	40.8	70.6	198.5
BiGRU-RA	—	—	41.3	70.9	192.0
NICVATP2L	75.9	44.3	36.5	61.9	130.8
DenseNet-BiLSTM	78.5	47.8	41.5	71.2	191.3
I-GRUs	68.8	26.8	23.9	—	85.6
STMSF	84.8	59.1	42.2	70.0	200.6
STMSF*	86.4	61.3	42.8	71.4	215.0

RL	B1	B4	M	R	C	S
N	78.5	37.6	29.1	58.0	123.0	22.3
Y	82.2	40.5	30.0	59.9	136.9	23.9

基于Swin Transformer与多尺度特征融合的图像描述方法

Image caption method based on Swin Transformer and multi-scale feature fusion

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 13

参考文献 34

相关文章 15

编辑推荐

Metrics

Agent Attention	MSCA	B4	M	R	C	S
×	×	36.8	28.9	57.7	122.8	22.2
√	×	37.2	29.1	57.9	123.0	22.1
×	√	37.3	29.0	58.0	122.8	22.1
√	√	37.6	29.1	58.0	123.0	22.3

层数	B1/%	B4/%	M/%	R/%	C/%	S/%
1	76.3	37.7	28.0	57.8	120.1	21.2
2	78.3	37.6	28.9	57.8	123.1	21.9
3	78.5	37.6	29.1	58.0	123.0	22.3
4	77.3	36.9	28.9	57.5	122.9	22.1

骨干网络	图像大小	B1/%	B4/%	M/%	R/%	C/%	S/%
Swin-B	384×384	76.7	36.4	28.8	57.2	121.4	21.8
Swin-L	384×384	78.5	37.6	29.1	58.0	123.0	22.3

DWC模块	B1	B4	M	R	C	S
×	77.1	36.1	28.5	57.0	120.2	21.6
√	78.5	37.6	29.1	58.0	123.0	22.3

[1]	梁一鸣, 范菁, 柴汶泽. 基于双向交叉注意力的多尺度特征融合情感分类[J]. 《计算机应用》唯一官方网站, 2025, 45(9): 2773-2782.
[2]	李维刚, 邵佳乐, 田志强. 基于双注意力机制和多尺度融合的点云分类与分割网络[J]. 《计算机应用》唯一官方网站, 2025, 45(9): 3003-3010.
[3]	王翔, 陈志祥, 毛国君. 融合局部和全局相关性的多变量时间序列预测方法[J]. 《计算机应用》唯一官方网站, 2025, 45(9): 2806-2816.
[4]	许志雄, 李波, 边小勇, 胡其仁. 对抗样本嵌入注意力U型网络的3D医学图像分割[J]. 《计算机应用》唯一官方网站, 2025, 45(9): 3011-3016.
[5]	王芳, 胡静, 张睿, 范文婷. 内容引导下多角度特征融合医学图像分割网络[J]. 《计算机应用》唯一官方网站, 2025, 45(9): 3017-3025.
[6]	颜承志, 陈颖, 钟凯, 高寒. 基于多尺度网络与轴向注意力的3D目标检测算法[J]. 《计算机应用》唯一官方网站, 2025, 45(8): 2537-2545.
[7]	习怡萌, 邓箴, 刘倩, 刘立波. 跨模态信息融合的视频-文本检索[J]. 《计算机应用》唯一官方网站, 2025, 45(8): 2448-2456.
[8]	林进浩, 罗川, 李天瑞, 陈红梅. 基于跨尺度注意力网络的胸部疾病分类方法[J]. 《计算机应用》唯一官方网站, 2025, 45(8): 2712-2719.
[9]	陈亮, 王璇, 雷坤. 复杂场景下跨层多尺度特征融合的安全帽佩戴检测算法[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2333-2341.
[10]	王向, 崔倩倩, 张晓明, 王建超, 王震洲, 宋佳霖. 改进ConvNeXt的无线胶囊内镜图像分类模型[J]. 《计算机应用》唯一官方网站, 2025, 45(6): 2016-2024.
[11]	吴宗航, 张东, 李冠宇. 基于联合自监督学习的多模态融合推荐算法[J]. 《计算机应用》唯一官方网站, 2025, 45(6): 1858-1868.
[12]	孙林嘉, 秦磊, 康美金, 王莹琳. 基于音节类型识别的自动语音分割算法[J]. 《计算机应用》唯一官方网站, 2025, 45(6): 2034-2042.
[13]	黄颖, 高胜美, 陈广, 刘苏. 结合信噪比引导的双分支结构和直方图均衡的低照度图像增强网络[J]. 《计算机应用》唯一官方网站, 2025, 45(6): 1971-1979.
[14]	李慧, 贾炳志, 王晨曦, 董子宇, 李纪龙, 仲兆满, 陈艳艳. 基于Swin Transformer的生成对抗网络水下图像增强模型[J]. 《计算机应用》唯一官方网站, 2025, 45(5): 1439-1446.
[15]	杨雅莉, 黎英, 章育涛, 宋佩华. 面向人脸识别的多模态研究方法综述[J]. 《计算机应用》唯一官方网站, 2025, 45(5): 1645-1657.