《计算机应用》唯一官方网站

• •    下一篇

基于图注意力网络的全局图像描述生成方法

隋佳宏1毛莺池1,2*于慧敏1王子成3平萍1,2   

  1. 1.河海大学 计算机与信息学院,南京 210098

    2.水利部水利大数据重点实验室(河海大学),南京 210098 3.中国电建集团昆明勘测设计研究院有限公司,昆明 650051

  • 收稿日期:2022-04-15 修回日期:2022-07-14 接受日期:2022-07-14 发布日期:2022-08-12 出版日期:2022-08-12
  • 通讯作者: 毛莺池
  • 基金资助:
    国家自然科学基金;江苏省重点研发计划;华能集团总部科技项目

Global image caption method based on graph attention network

  • Received:2022-04-15 Revised:2022-07-14 Accepted:2022-07-14 Online:2022-08-12 Published:2022-08-12
  • Contact: Yingchi Mao
  • Supported by:
    National Natural Science Foundation of China Key Program;Key R&D Program of Jiangsu Province Funded Projects;Key Project of China Huaneng Group

摘要: 现有图像描述生成方法仅考虑网格的空间位置特征,网格特征交互不足,并且未充分利用图像的全局特征。为此,提出了一种基于图注意力网络(GAT)的全局图像描述生成方法,生成更高质量的图像描述。首先,利用多层卷积神经网络(CNN)进行图像编码,提取给定图像的网格特征和整幅图像特征,构建网格特征交互图;然后,通过图注意力网络将特征提取问题转化成节点分类问题,包括一个全局节点和多个局部节点,更新优化后可以充分利用全局和局部特征;最后,基于Transformer的解码模块利用改进的视觉特征,进行图像描述生成。在Microsoft COCO数据集上进行了实验与评估,分析结果表明所提方法有效捕捉了图像的全局和局部特征,在CIDEr指标上达到了133.1%。所提方法能有效提高文字描述图像的精确度,从而可以使用文字对图像进行分类、检索、分析等处理任务。

关键词: 网格特征, 图注意力网络, 卷积神经网络, 图像描述生成, 全局特征

Abstract: Since existing image caption approaches only focused on the grid spatial location features without considering grid feature interaction and image global features, an image caption method based on Graph ATtention network(GAT) with global context was proposed to generate higher-quality image captions. By building a grid feature interaction graph, a multi-layer convolutional neural network(CNN) was utilized for visual encoding, and the grid features and entire image features of a given image were retrieved. Then, using the graph attention network, which included a global node and many local nodes, the feature extraction problem was changed into a node classification problem, and the global and local features could be completely utilized after updating and optimization. Finally, the Transformer-based decoding module made use of the enhanced visual features to provide image captions. The Microsoft COCO dataset is used for experiments evaluation. The experimental results demonstrate that the image caption method based on graph attention network with global context successfully captures the global and local features of the image and achieves 133.1% CIDEr(Consensus-based Image Description Evaluation). The image caption method is effective in improving the accuracy of image captioning, which allows perform processing tasks such as classification, retrieval, and analysis of images.

Key words: grid feature, graph Graph attention ATttention network(GAT), Convolutional Neural Network (CNN), image captioning, global feature

中图分类号: