Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (5): 1409-1415.DOI: 10.11772/j.issn.1001-9081.2022040513

• Artificial intelligence • Previous Articles    

Global image captioning method based on graph attention network

Jiahong SUI1, Yingchi MAO1,2(), Huimin YU1, Zicheng WANG3, Ping PING1,2   

  1. 1.College of Computer and Information,Hohai University,Nanjing Jiangsu 210098,China
    2.Key Laboratory of Water Big Data Technology of Ministry of Water Resources (Hohai University),Nanjing Jiangsu 210098,China
    3.Power China Kunming Engineering Corporation Limited,Kunming Yunnan 650051,China
  • Received:2022-04-05 Revised:2022-07-11 Accepted:2022-07-14 Online:2023-05-08 Published:2023-05-10
  • Contact: Yingchi MAO
  • About author:SUI Jiahong, born in 1998, M. S. candidate. Her research interests include computer vision.
    MAO Yingchi, born in 1976, Ph. D., professor. Her research interests include edge intelligent computing.
    YU Huimin, born in 1998, M. S. candidate. Her research interests include computer vision.
    WANG Zicheng, born in 1990, M. S., engineer. His research interests include digital image processing, three-dimensional modeling.
    PING Ping, born in 1982, Ph. D., associate professor. Her research interests include digital image processing.
  • Supported by:
    National Natural Science Foundation of China(61902110);Key Research and Development Program of Jiangsu Province(BE2020729);Science and Technology Project of China Huaneng Group Headquarter(HNKJ19-H12)

基于图注意力网络的全局图像描述生成方法

隋佳宏1, 毛莺池1,2(), 于慧敏1, 王子成3, 平萍1,2   

  1. 1.河海大学 计算机与信息学院,南京 210098
    2.水利部水利大数据重点实验室(河海大学),南京 210098
    3.中国电建集团昆明勘测设计研究院有限公司,昆明 650051
  • 通讯作者: 毛莺池
  • 作者简介:隋佳宏(1998—),女,山东烟台人,硕士研究生,CCF会员,主要研究方向:计算机视觉
    毛莺池(1976—),女,上海人,教授,博士,CCF高级会员,主要研究方向:边缘智能计算 yingchimao@hhu.edu.cn
    于慧敏(1998—),女,山西大同人,硕士研究生,CCF会员,主要研究方向:计算机视觉
    王子成(1990—),男,湖北荆州人,工程师,硕士,主要研究方向:数字图像处理、三维建模
    平萍(1982—),女,江苏吴江人,副教授,博士,CCF会员,主要研究方向:数字图像处理。
  • 基金资助:
    国家自然科学基金资助项目(61902110);江苏省重点研发计划项目(BE2020729);华能集团总部科技项目(HNKJ19?H12)

Abstract:

The existing image captioning methods only focus on the grid spatial location features without enough grid feature interaction and full use of image global features. To generate higher-quality image captions, a global image captioning method based on Graph ATtention network (GAT) was proposed. Firstly, a multi-layer Convolutional Neural Network (CNN) was utilized for visual encoding, extracting the grid features and entire image features of the given image and building a grid feature interaction graph. Then, by using GAT, the feature extraction problem was transformed into a node classification problem, including a global node and many local nodes, and the global and local features were able to be fully utilized after updating the optimization. Finally, through the Transformer-based decoding module, the improved visual features were adopted to realize image captioning. Experimental results on the Microsoft COCO dataset demonstrated that the proposed method effectively captured the global and local features of the image, achieving 133.1% in CIDEr (Consensus-based Image Description Evaluation) metric. It can be seen that the proposed image captioning method is effective in improving the accuracy of image captioning, thus allowing processing tasks such as classification, retrieval, and analysis of images by words.

Key words: grid feature, Graph ATttention network (GAT), Convolutional Neural Network (CNN), image captioning, global feature

摘要:

现有图像描述生成方法仅考虑网格的空间位置特征,网格特征交互不足,并且未充分利用图像的全局特征。为生成更高质量的图像描述,提出一种基于图注意力网络(GAT)的全局图像描述生成方法。首先,利用多层卷积神经网络(CNN)进行视觉编码,提取给定图像的网格特征和整幅图像特征,并构建网格特征交互图;然后,通过GAT将特征提取问题转化成节点分类问题,包括一个全局节点和多个局部节点,更新优化后可以充分利用全局和局部特征;最后,基于Transformer的解码模块利用改进的视觉特征生成图像描述。在Microsoft COCO数据集上的实验结果表明,所提方法能有效捕捉图像的全局和局部特征,在CIDEr(Consensus-based Image Description Evaluation)指标上达到了133.1%。可见基于GAT的全局图像描述生成方法能有效提高文字描述图像的准确度,从而可以使用文字对图像进行分类、检索、分析等处理。

关键词: 网格特征, 图注意力网络, 卷积神经网络, 图像描述生成, 全局特征

CLC Number: