Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (10): 3154-3160.DOI: 10.11772/j.issn.1001-9081.2024101478
• Artificial intelligence • Previous Articles
					
						                                                                                                                                                                                                                                                                                                                    Ziyi WANG1, Weijun LI1,2( ), Xueyang LIU1, Jianping DING1, Shixia LIU1, Yilei SU1
), Xueyang LIU1, Jianping DING1, Shixia LIU1, Yilei SU1
												  
						
						
						
					
				
Received:2024-10-22
															
							
																	Revised:2024-12-03
															
							
																	Accepted:2024-12-09
															
							
							
																	Online:2024-12-17
															
							
																	Published:2025-10-10
															
							
						Contact:
								Weijun LI   
													About author:WANG Ziyi, born in 2001, M. S. candidate. Her research interests include image caption, natural language processing.Supported by:
        
                   
            王子怡1, 李卫军1,2( ), 刘雪洋1, 丁建平1, 刘世侠1, 苏易礌1
), 刘雪洋1, 丁建平1, 刘世侠1, 苏易礌1
                  
        
        
        
        
    
通讯作者:
					李卫军
							作者简介:王子怡(2001—),女,山东泰安人,硕士研究生,CCF会员,主要研究方向:图像描述、自然语言处理基金资助:CLC Number:
Ziyi WANG, Weijun LI, Xueyang LIU, Jianping DING, Shixia LIU, Yilei SU. Image caption method based on Swin Transformer and multi-scale feature fusion[J]. Journal of Computer Applications, 2025, 45(10): 3154-3160.
王子怡, 李卫军, 刘雪洋, 丁建平, 刘世侠, 苏易礌. 基于Swin Transformer与多尺度特征融合的图像描述方法[J]. 《计算机应用》唯一官方网站, 2025, 45(10): 3154-3160.
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2024101478
| 类别 | 方法 | B1 | B4 | M | R | C | S | 
|---|---|---|---|---|---|---|---|
| CNN-RNN | SCST | — | 34.2 | 26.7 | 55.7 | 114.0 | — | 
| Up-Down | 79.8 | 36.3 | 27.7 | 56.9 | 120.1 | 21.4 | |
| GCN-LSTM | 80.5 | 38.2 | 28.5 | 58.3 | 127.6 | 22.0 | |
| AoANet | 80.2 | 38.9 | 29.2 | 58.8 | 129.8 | 22.4 | |
| X-LAN | 80.8 | 39.5 | 29.5 | 59.2 | 132.0 | 23.4 | |
| VRCDA | 80.6 | 37.9 | 28.4 | 58.2 | 123.7 | 21.8 | |
| Transformer | X-Transformer | 80.9 | 39.7 | 29.5 | 59.1 | 132.8 | 23.4 | 
| M2 Transformer | 80.8 | 39.1 | 29.2 | 58.6 | 131.2 | 22.6 | |
| RSTNet | 81.8 | 40.1 | 29.8 | 59.5 | 135.6 | 23.3 | |
| DLCT | 81.4 | 39.8 | 29.5 | 59.1 | 133.8 | 23.0 | |
| GAT | 80.8 | 39.7 | 29.1 | 59.0 | 130.5 | 22.9 | |
| A2 Transformer | 81.5 | 39.8 | 29.6 | 59.1 | 133.9 | 23.0 | |
| S2 Transformer | 81.1 | 39.6 | 29.6 | 59.1 | 133.5 | 23.2 | |
| LSTNet | 81.5 | 40.3 | 29.6 | 59.4 | 134.8 | 23.1 | |
| SCD-Net | 81.3 | 39.4 | 29.2 | 59.1 | 131.6 | 23.0 | |
| STMSF | 82.2 | 40.5 | 30.0 | 59.9 | 136.9 | 23.9 | 
Tab. 1 Performance comparison on MSCOCO test set
| 类别 | 方法 | B1 | B4 | M | R | C | S | 
|---|---|---|---|---|---|---|---|
| CNN-RNN | SCST | — | 34.2 | 26.7 | 55.7 | 114.0 | — | 
| Up-Down | 79.8 | 36.3 | 27.7 | 56.9 | 120.1 | 21.4 | |
| GCN-LSTM | 80.5 | 38.2 | 28.5 | 58.3 | 127.6 | 22.0 | |
| AoANet | 80.2 | 38.9 | 29.2 | 58.8 | 129.8 | 22.4 | |
| X-LAN | 80.8 | 39.5 | 29.5 | 59.2 | 132.0 | 23.4 | |
| VRCDA | 80.6 | 37.9 | 28.4 | 58.2 | 123.7 | 21.8 | |
| Transformer | X-Transformer | 80.9 | 39.7 | 29.5 | 59.1 | 132.8 | 23.4 | 
| M2 Transformer | 80.8 | 39.1 | 29.2 | 58.6 | 131.2 | 22.6 | |
| RSTNet | 81.8 | 40.1 | 29.8 | 59.5 | 135.6 | 23.3 | |
| DLCT | 81.4 | 39.8 | 29.5 | 59.1 | 133.8 | 23.0 | |
| GAT | 80.8 | 39.7 | 29.1 | 59.0 | 130.5 | 22.9 | |
| A2 Transformer | 81.5 | 39.8 | 29.6 | 59.1 | 133.9 | 23.0 | |
| S2 Transformer | 81.1 | 39.6 | 29.6 | 59.1 | 133.5 | 23.2 | |
| LSTNet | 81.5 | 40.3 | 29.6 | 59.4 | 134.8 | 23.1 | |
| SCD-Net | 81.3 | 39.4 | 29.2 | 59.1 | 131.6 | 23.0 | |
| STMSF | 82.2 | 40.5 | 30.0 | 59.9 | 136.9 | 23.9 | 
| 方法 | B1 | B4 | M | R | C | 
|---|---|---|---|---|---|
| Up-Down | 78.1 | 48.0 | 40.8 | 70.6 | 198.5 | 
| BiGRU-RA | — | — | 41.3 | 70.9 | 192.0 | 
| NICVATP2L | 75.9 | 44.3 | 36.5 | 61.9 | 130.8 | 
| DenseNet-BiLSTM | 78.5 | 47.8 | 41.5 | 71.2 | 191.3 | 
| I-GRUs | 68.8 | 26.8 | 23.9 | — | 85.6 | 
| STMSF | 84.8 | 59.1 | 42.2 | 70.0 | 200.6 | 
| STMSF* | 86.4 | 61.3 | 42.8 | 71.4 | 215.0 | 
Tab. 2 Performance comparison on AI Challenger test set
| 方法 | B1 | B4 | M | R | C | 
|---|---|---|---|---|---|
| Up-Down | 78.1 | 48.0 | 40.8 | 70.6 | 198.5 | 
| BiGRU-RA | — | — | 41.3 | 70.9 | 192.0 | 
| NICVATP2L | 75.9 | 44.3 | 36.5 | 61.9 | 130.8 | 
| DenseNet-BiLSTM | 78.5 | 47.8 | 41.5 | 71.2 | 191.3 | 
| I-GRUs | 68.8 | 26.8 | 23.9 | — | 85.6 | 
| STMSF | 84.8 | 59.1 | 42.2 | 70.0 | 200.6 | 
| STMSF* | 86.4 | 61.3 | 42.8 | 71.4 | 215.0 | 
| RL | B1 | B4 | M | R | C | S | 
|---|---|---|---|---|---|---|
| N | 78.5 | 37.6 | 29.1 | 58.0 | 123.0 | 22.3 | 
| Y | 82.2 | 40.5 | 30.0 | 59.9 | 136.9 | 23.9 | 
Tab. 3 Performance comparison before and after incorporating reinforcement learning
| RL | B1 | B4 | M | R | C | S | 
|---|---|---|---|---|---|---|
| N | 78.5 | 37.6 | 29.1 | 58.0 | 123.0 | 22.3 | 
| Y | 82.2 | 40.5 | 30.0 | 59.9 | 136.9 | 23.9 | 
| Agent Attention | MSCA | B4 | M | R | C | S | 
|---|---|---|---|---|---|---|
| × | × | 36.8 | 28.9 | 57.7 | 122.8 | 22.2 | 
| √ | × | 37.2 | 29.1 | 57.9 | 123.0 | 22.1 | 
| × | √ | 37.3 | 29.0 | 58.0 | 122.8 | 22.1 | 
| √ | √ | 37.6 | 29.1 | 58.0 | 123.0 | 22.3 | 
Tab. 4 Comparison of ablation results for each module
| Agent Attention | MSCA | B4 | M | R | C | S | 
|---|---|---|---|---|---|---|
| × | × | 36.8 | 28.9 | 57.7 | 122.8 | 22.2 | 
| √ | × | 37.2 | 29.1 | 57.9 | 123.0 | 22.1 | 
| × | √ | 37.3 | 29.0 | 58.0 | 122.8 | 22.1 | 
| √ | √ | 37.6 | 29.1 | 58.0 | 123.0 | 22.3 | 
| 层数 | B1/% | B4/% | M/% | R/% | C/% | S/% | 
|---|---|---|---|---|---|---|
| 1 | 76.3 | 37.7 | 28.0 | 57.8 | 120.1 | 21.2 | 
| 2 | 78.3 | 37.6 | 28.9 | 57.8 | 123.1 | 21.9 | 
| 3 | 78.5 | 37.6 | 29.1 | 58.0 | 123.0 | 22.3 | 
| 4 | 77.3 | 36.9 | 28.9 | 57.5 | 122.9 | 22.1 | 
Tab. 5 Performance comparison of different encoder-decoder layers
| 层数 | B1/% | B4/% | M/% | R/% | C/% | S/% | 
|---|---|---|---|---|---|---|
| 1 | 76.3 | 37.7 | 28.0 | 57.8 | 120.1 | 21.2 | 
| 2 | 78.3 | 37.6 | 28.9 | 57.8 | 123.1 | 21.9 | 
| 3 | 78.5 | 37.6 | 29.1 | 58.0 | 123.0 | 22.3 | 
| 4 | 77.3 | 36.9 | 28.9 | 57.5 | 122.9 | 22.1 | 
| 骨干网络 | 图像大小 | B1/% | B4/% | M/% | R/% | C/% | S/% | 
|---|---|---|---|---|---|---|---|
| Swin-B | 384×384 | 76.7 | 36.4 | 28.8 | 57.2 | 121.4 | 21.8 | 
| Swin-L | 384×384 | 78.5 | 37.6 | 29.1 | 58.0 | 123.0 | 22.3 | 
Tab. 6 Performance comparison of different backbone networks
| 骨干网络 | 图像大小 | B1/% | B4/% | M/% | R/% | C/% | S/% | 
|---|---|---|---|---|---|---|---|
| Swin-B | 384×384 | 76.7 | 36.4 | 28.8 | 57.2 | 121.4 | 21.8 | 
| Swin-L | 384×384 | 78.5 | 37.6 | 29.1 | 58.0 | 123.0 | 22.3 | 
| DWC模块 | B1 | B4 | M | R | C | S | 
|---|---|---|---|---|---|---|
| × | 77.1 | 36.1 | 28.5 | 57.0 | 120.2 | 21.6 | 
| √ | 78.5 | 37.6 | 29.1 | 58.0 | 123.0 | 22.3 | 
Tab. 7 Performance verification of encoder DWC module
| DWC模块 | B1 | B4 | M | R | C | S | 
|---|---|---|---|---|---|---|
| × | 77.1 | 36.1 | 28.5 | 57.0 | 120.2 | 21.6 | 
| √ | 78.5 | 37.6 | 29.1 | 58.0 | 123.0 | 22.3 | 
| [1] | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6000-6010. | 
| [2] | DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[EB/OL]. [2024-11-28].. | 
| [3] | LIU Z, LIN Y, CAO Y, et al. Swin Transformer: hierarchical vision Transformer using shifted windows[C]// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 9992-10002. | 
| [4] | HAN D, YE T, HAN Y, et al. Agent attention: on the integration of Softmax and linear attention[C]// Proceedings of the 2024 European Conference on Computer Vision, LNCS 15108. Cham: Springer, 2025: 124-140. | 
| [5] | CHEN L, ZHANG H, XIAO J, et al. SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 6298-6306. | 
| [6] | ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 6077-6086. | 
| [7] | PAN Y, YAO T, LI Y, et al. X-linear attention networks for image captioning[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 10971-10980. | 
| [8] | CORNIA M, STEFANINI M, BARALDI L, et al. Meshed-memory Transformer for image captioning[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 10575-10584. | 
| [9] | LUO Y, JI J, SUN X, et al. Dual-level collaborative Transformer for image captioning[C]// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2021: 2286-2293. | 
| [10] | CHEN X, FANG H, LIN T Y, et al. Microsoft COCO captions: data collection and evaluation server[EB/OL]. [2024-11-28].. | 
| [11] | WU J, ZHENG H, ZHAO B, et al. Large-scale datasets for going deeper in image understanding[C]// Proceedings of the 2019 IEEE International Conference on Multimedia and Expo. Piscataway: IEEE, 2019: 1480-1485. | 
| [12] | KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2015: 3128-3137. | 
| [13] | PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C]// Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2002: 311-318. | 
| [14] | DENKOWSKI M, LAVIE A. Meteor universal: language specific translation evaluation for any target language[C]// Proceedings of the 9th Workshop on Statistical Machine Translation. Stroudsburg: ACL, 2014: 376-380. | 
| [15] | LIN C Y. ROUGE: a package for automatic evaluation of summaries[C]// Proceedings of the ACL-04 Workshop: Text Summarization Branches Out. Stroudsburg: ACL, 2004: 74-81. | 
| [16] | VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2015: 4566-4575. | 
| [17] | ANDERSON P, FERNANDO B, JOHNSON M, et al. SPICE: semantic propositional image caption evaluation[C]// Proceedings of the 2016 European Conference on Computer Vision, LNCS 9909. Cham: Springer, 2016: 382-398. | 
| [18] | RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 1179-1195. | 
| [19] | YAO T, PAN Y, LI Y, et al. Exploring visual relationship for image captioning[C]// Proceedings of the 2018 European Conference on Computer Vision, LNCS 11218. Cham: Springer, 2018: 711-727. | 
| [20] | HUANG L, WANG W, CHEN J, et al. Attention on attention for image captioning[C]// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 4633-4642. | 
| [21] | 刘茂福,施琦,聂礼强. 基于视觉关联与上下文双注意力的图像描述生成方法[J]. 软件学报, 2022, 33(9):3210-3222. | 
| LIU M F, SHI Q, NIE L Q. Image captioning based on visual relevance and context dual attention[J]. Journal of Software, 2022, 33(9): 3210-3222. | |
| [22] | ZHANG X, SUN X, LUO Y, et al. RSTNet: captioning with adaptive attention on visual and non-visual words[C]// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 15460-15469. | 
| [23] | WANG C, SHEN Y, JI L. Geometry attention Transformer with position-aware LSTMs for image captioning[J]. Expert Systems with Applications, 2022, 201: No.117174. | 
| [24] | FEI Z. Attention-aligned Transformer for image captioning[C]// Proceedings of the 36th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2022: 607-615. | 
| [25] | ZENG P, ZHANG H, SONG J, et al. S2 Transformer for image captioning[C]// Proceedings of the 31st International Joint Conference on Artificial Intelligence. California: ijcai.org, 2022: 1608-1614. | 
| [26] | MA Y, JI J, SUN X, et al. Towards local visual modeling for image captioning[J]. Pattern Recognition, 2023, 138: No.109420. | 
| [27] | LUO J, LI Y, PAN Y, et al. Semantic-conditional diffusion networks for image captioning[C]// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 23359-23368. | 
| [28] | 邓珍荣,张永林,杨睿,等. 结合全局和局部特征的BiGRU-RA图像中文描述模型[J]. 计算机辅助设计与图形学学报, 2021, 33(1):49-58. | 
| DENG Z R, ZHANG Y L, YANG R, et al. BiGRU-RA model for image Chinese captioning via global and local features[J]. Journal of Computer-Aided Design and Computer Graphics, 2021, 33(1): 49-58. | |
| [29] | LIU M, HU H, LI L, et al. Chinese image caption generation via visual attention and topic modeling[J]. IEEE Transactions on Cybernetics, 2022, 52(2): 1247-1257. | 
| [30] | LU H, YANG R, DENG Z, et al. Chinese image captioning via fuzzy attention-based DenseNet-BiLSTM[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2021, 17(1s): No.14. | 
| [31] | PAN Y, WANG L, DUAN S, et al. Chinese image caption of Inceptionv4 and double-layer GRUs based on attention mechanism[J]. Journal of Physics: Conference Series, 2021, 1861: No.012044. | 
| [32] | HODOSH M, YOUNG P, HOCKENMAIER J. Framing image description as a ranking task: data, models and evaluation metrics[J]. Journal of Artificial Intelligence Research, 2013, 47: 853-899. | 
| [33] | KATIYAR S, BORGOHAIN S K. Analysis of convolutional decoder for image caption generation[EB/OL]. [2024-11-28].. | 
| [34] | LI X, YUAN A, LU X. Multi-modal gated recurrent units for image description[J]. Multimedia Tools and Applications, 2018, 77(22): 29847-29869. | 
| [1] | Weigang LI, Jiale SHAO, Zhiqiang TIAN. Point cloud classification and segmentation network based on dual attention mechanism and multi-scale fusion [J]. Journal of Computer Applications, 2025, 45(9): 3003-3010. | 
| [2] | Xiang WANG, Zhixiang CHEN, Guojun MAO. Multivariate time series prediction method combining local and global correlation [J]. Journal of Computer Applications, 2025, 45(9): 2806-2816. | 
| [3] | Zhixiong XU, Bo LI, Xiaoyong BIAN, Qiren HU. Adversarial sample embedded attention U-Net for 3D medical image segmentation [J]. Journal of Computer Applications, 2025, 45(9): 3011-3016. | 
| [4] | Fang WANG, Jing HU, Rui ZHANG, Wenting FAN. Medical image segmentation network with content-guided multi-angle feature fusion [J]. Journal of Computer Applications, 2025, 45(9): 3017-3025. | 
| [5] | Yiming LIANG, Jing FAN, Wenze CHAI. Multi-scale feature fusion sentiment classification based on bidirectional cross attention [J]. Journal of Computer Applications, 2025, 45(9): 2773-2782. | 
| [6] | Chengzhi YAN, Ying CHEN, Kai ZHONG, Han GAO. 3D object detection algorithm based on multi-scale network and axial attention [J]. Journal of Computer Applications, 2025, 45(8): 2537-2545. | 
| [7] | Jinhao LIN, Chuan LUO, Tianrui LI, Hongmei CHEN. Thoracic disease classification method based on cross-scale attention network [J]. Journal of Computer Applications, 2025, 45(8): 2712-2719. | 
| [8] | Yimeng XI, Zhen DENG, Qian LIU, Libo LIU. Cross-modal information fusion for video-text retrieval [J]. Journal of Computer Applications, 2025, 45(8): 2448-2456. | 
| [9] | Liang CHEN, Xuan WANG, Kun LEI. Helmet wearing detection algorithm for complex scenarios based on cross-layer multi-scale feature fusion [J]. Journal of Computer Applications, 2025, 45(7): 2333-2341. | 
| [10] | Xiang WANG, Qianqian CUI, Xiaoming ZHANG, Jianchao WANG, Zhenzhou WANG, Jialin SONG. Wireless capsule endoscopy image classification model based on improved ConvNeXt [J]. Journal of Computer Applications, 2025, 45(6): 2016-2024. | 
| [11] | Zonghang WU, Dong ZHANG, Guanyu LI. Multimodal fusion recommendation algorithm based on joint self-supervised learning [J]. Journal of Computer Applications, 2025, 45(6): 1858-1868. | 
| [12] | Linjia SUN, Lei QIN, Meijin KANG, Yinglin WANG. Automatic speech segmentation algorithm based on syllable type recognition [J]. Journal of Computer Applications, 2025, 45(6): 2034-2042. | 
| [13] | Ying HUANG, Shengmei GAO, Guang CHEN, Su LIU. Low-light image enhancement network combining signal-to-noise ratio guided dual-branch structure and histogram equalization [J]. Journal of Computer Applications, 2025, 45(6): 1971-1979. | 
| [14] | Junyan ZHANG, Yiming ZHAO, Bing LIN, Yunping WU. Chinese image captioning method based on multi-level visual and dynamic text-image interaction [J]. Journal of Computer Applications, 2025, 45(5): 1520-1527. | 
| [15] | Hui LI, Bingzhi JIA, Chenxi WANG, Ziyu DONG, Jilong LI, Zhaoman ZHONG, Yanyan CHEN. Generative adversarial network underwater image enhancement model based on Swin Transformer [J]. Journal of Computer Applications, 2025, 45(5): 1439-1446. | 
| Viewed | ||||||
| Full text |  | |||||
| Abstract |  | |||||