《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (10): 3154-3160.DOI: 10.11772/j.issn.1001-9081.2024101478

• 人工智能 • 上一篇    

基于Swin Transformer与多尺度特征融合的图像描述方法

王子怡1, 李卫军1,2(), 刘雪洋1, 丁建平1, 刘世侠1, 苏易礌1   

  1. 1.北方民族大学 计算机科学与工程学院,银川 750021
    2.图形图像智能处理国家民委重点实验室(北方民族大学),银川 750021
  • 收稿日期:2024-10-22 修回日期:2024-12-03 接受日期:2024-12-09 发布日期:2024-12-17 出版日期:2025-10-10
  • 通讯作者: 李卫军
  • 作者简介:王子怡(2001—),女,山东泰安人,硕士研究生,CCF会员,主要研究方向:图像描述、自然语言处理
    李卫军(1979—),男,陕西渭南人,讲师,博士,CCF会员,主要研究方向:本体的构建与重用、知识图谱的构建 Email:lwj@nmu.edu.cn
    刘雪洋(1999—),女,河南南阳人,硕士研究生,CCF会员,主要研究方向:知识图谱推理
    丁建平(1999—),男,四川资阳人,硕士研究生,CCF会员,主要研究方向:命名实体识别
    刘世侠(2000—),男(壮族),广西贵港人,硕士研究生,CCF会员,主要研究方向:强化学习、知识推理
    苏易礌(2000—),男,湖南常德人,硕士研究生,CCF会员,主要研究方向:文本分类。
  • 基金资助:
    国家自然科学基金资助项目(62066038);国家自然科学基金资助项目(61962001);中央高校基本科研业务费专项资金资助项目(2019KYQD04,2022PT_S04,2021JCYJ12);宁夏自然科学基金资助项目(2021AAC03215)

Image caption method based on Swin Transformer and multi-scale feature fusion

Ziyi WANG1, Weijun LI1,2(), Xueyang LIU1, Jianping DING1, Shixia LIU1, Yilei SU1   

  1. 1.School of Computer Science and Engineering,North Minzu University,Yinchuan Ningxia 750021,China
    2.The Key Laboratory of Images and Graphics Intelligent Processing of State Ethnic Affairs Commission (North Minzu University),Yinchuan Ningxia 750021,China
  • Received:2024-10-22 Revised:2024-12-03 Accepted:2024-12-09 Online:2024-12-17 Published:2025-10-10
  • Contact: Weijun LI
  • About author:WANG Ziyi, born in 2001, M. S. candidate. Her research interests include image caption, natural language processing.
    LI Weijun, born in 1979, Ph. D., lecturer. His research interests include construction and reuse of ontology, construction of knowledge graph.
    LIU Xueyang, born in 1999, M. S. candidate. Her research interests include knowledge graph reasoning.
    DING Jianping, born in 1999, M. S. candidate. His research interests include named entity recognition.
    LIU Shixia, born in 2000, M. S. candidate. His research interests include reinforcement learning, knowledge reasoning.
    SU Yilei, born in 2000, M. S. candidate. His research interests include text classification.
  • Supported by:
    National Natural Science Foundation of China(62066038);Ningxia Natural Science Foundation(2021AAC03215);Fundamental Research Funds for the Central Universities (2019KYQD04, 2022PT_S04, 2021JCYJ12)

摘要:

基于Transformer的图像描述方法通过多头注意力会在整个输入序列上计算注意力权重,缺乏层次化的特征提取能力,并且两阶段的图像描述方法限制了模型性能。针对上述问题,提出一种基于Swin Transformer与多尺度特征融合的图像描述方法(STMSF)。在编码器中通过Agent Attention保持全局上下文建模能力的同时,提高计算效率;在解码器中提出多尺度交叉注意力(MSCA),融合交叉注意力与深度可分离卷积,在得到多尺度特征的同时更充分地融合多模态特征。实验结果表明,在MSCOCO数据集上与SCD-Net(Semantic-Conditional Diffusion Network)方法相比,STMSF的BLEU4(BiLingual Evaluation Understudy with 4-grams)和CIDEr(Consensus-based Image Description Evaluation)指标分别提升了1.1和5.3个百分点。对比实验和消融实验的结果表明,所提的一阶段STMSF能够有效提高模型性能,生成高质量的图像描述语句。

关键词: Swin Transformer, 多尺度特征, 特征融合, 图像描述, 深度可分离卷积

Abstract:

Image caption methods based on Transformer use multi-head attention to calculate attention weights on the entire input sequence, and lack hierarchical feature extraction capabilities. Additionally, two-stage image caption methods limit model performance. To address the above issues, an image caption method based on Swin Transformer and Multi-Scale feature Fusion (STMSF) was proposed. In the encoder of this method, Agent Attention was used to maintain global context modeling capability while improving computational efficiency. In the decoder of this method, Multi-Scale Cross Attention (MSCA) was proposed to combine cross-attention and depthwise separable convolution, which obtained multi-scale features and fused multi-modal features better. Experimental results on the MSCOCO dataset show that compared to SCD-Net (Semantic-Conditional Diffusion Network) method, STMSF has the BLEU4 (BiLingual Evaluation Understudy with 4-grams) and CIDEr (Consensus-based Image Description Evaluation) metrics improved by 1.1 and 5.3 percentage points, respectively. The above comparison experimental results as well as ablation experimental results show that the proposed single-stage STMSF can improve model performance effectively and generate high-quality image caption sentences.

Key words: Swin Transformer, multi-scale feature, feature fusion, image caption, depthwise separable convolution

中图分类号: