《计算机应用》唯一官方网站 ›› 2026, Vol. 46 ›› Issue (3): 741-749.DOI: 10.11772/j.issn.1001-9081.2025040414

• 人工智能 • 上一篇    下一篇

结合密集多尺度特征融合和特征知识增强Transformer的遥感图像描述模型

刘汉卿, 桑国明(), 张益嘉   

  1. 大连海事大学 信息科学技术学院,辽宁 大连 116026
  • 收稿日期:2025-04-18 修回日期:2025-06-25 接受日期:2025-06-26 发布日期:2025-07-03 出版日期:2026-03-10
  • 通讯作者: 桑国明
  • 作者简介:刘汉卿(2001—),男,山东济南人,硕士研究生,CCF会员,主要研究方向:遥感图像描述、多模态
    张益嘉(1979—),男,辽宁大连人,教授,博士,主要研究方向:自然语言处理、社交媒体计算。
  • 基金资助:
    国家自然科学基金资助项目(62072070)

Remote sensing image captioning model combining dense multi-scale feature fusion and feature knowledge-enhanced Transformer

Hanqing LIU, Guoming SANG(), Yijia ZHANG   

  1. Information Science and Technology College,Dalian Maritime University,Dalian Liaoning 116026,China
  • Received:2025-04-18 Revised:2025-06-25 Accepted:2025-06-26 Online:2025-07-03 Published:2026-03-10
  • Contact: Guoming SANG
  • About author:LIU Hanqing, born in 2001, M. S. candidate. His research interests include remote sensing image captioning, multimodal.
    ZHANG Yijia, born in 1979, Ph. D., professor. His research interests include natural language processing, social media computing.
  • Supported by:
    National Natural Science Foundation of China(62072070)

摘要:

针对遥感图像描述任务中多尺度特征利用不足、纹理重复区域细节关联度低及多目标特征协同建模困难等问题,提出一种结合密集多尺度特征融合和特征知识增强Transformer的遥感图像描述模型DMFKF-T(Dense Multi-scale Feature and Knowledge Fusion Transformer)。设计密集多尺度特征融合模块(DMFFM),通过跨层级跳跃连接动态聚合不同尺度的特征图,同步捕获全局场景特征与局部细节信息;在解码阶段,引入语义融合增强(SFA)模块增强模型捕捉长距离依赖关系与理解上下文信息的能力,并结合离散余弦变换(DCT)的频率增强通道注意力机制分析频域特征的相关性,从而强化对复杂空间拓扑和非线性关系的建模能力。实验结果表明,在RSICD(Remote Sensing Image Captioning Dataset)上,与SD-RSIC(Summarization-driven Deep Remote Sensing Image Captioning)模型相比,DMFKF-T的BLEU-4(BiLingual Evaluation Understudy with 4-grams)和CIDEr(Consensus-based Image Description Evaluation)指标分别提升了8.6%和14.4%。可见,DMFKF-T可以准确地生成语义丰富的遥感图像描述语句。

关键词: 密集多尺度特征融合, 语义融合增强, 频率增强通道注意力, 特征知识增强Transformer, 遥感图像描述

Abstract:

To address the challenges of insufficient multi-scale feature utilization, low inter-region detail correlation in texture-repetitive areas, and difficulty in multi-target feature collaborative modeling in remote sensing image captioning tasks, a remote sensing image captioning model combining dense multi-scale feature fusion and feature knowledge-enhanced Transformer — DMFKF-T (Dense Multi-scale Feature and Knowledge Fusion Transformer) was proposed. A Dense Multi-scale Feature Fusion Module (DMFFM) was designed to aggregate feature maps with different scales dynamically through cross-layer skip connections, thereby capturing global scene features and local detail information simultaneously. During the decoding stage, a Semantic Fusion Amplifier (SFA) module was introduced to enhance the model's abilities to capture long-range dependencies and comprehend contextual information, and the frequency-enhanced channel attention mechanism in Discrete Cosine Transform (DCT) was incorporated to analyze the correlation of frequency-domain features, thereby strengthening the modeling capability for complex spatial topologies and nonlinear relationships. On the Remote Sensing Image Captioning Dataset (RSICD), compared with SD-RSIC (Summarization-driven Deep Remote Sensing Image Captioning) model, DMFKF-T improves the BLEU-4(BiLingual Evaluation Understudy with 4-grams) and CIDEr (Consensus-based Image Description Evaluation) metrics of by 8.6% and 14.4%, respectively. It can be seen that DMFKF-T can generate semantically rich descriptions for remote sensing images accurately.

Key words: dense multi-scale feature fusion, Semantic Fusion Amplifier (SFA), frequency-enhanced channel attention, feature-knowledge enhanced Transformer, remote sensing image captioning

中图分类号: