Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (9): 2504-2509.DOI: 10.11772/j.issn.1001-9081.2020111838

Special Issue: 人工智能

• Artificial intelligence • Previous Articles     Next Articles

Multi-layer encoding and decoding model for image captioning based on attention mechanism

LI Kangkang, ZHANG Jing   

  1. School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
  • Received:2020-11-23 Revised:2021-02-21 Online:2021-09-10 Published:2021-05-12
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61402174).


李康康, 张静   

  1. 华东理工大学 信息科学与工程学院, 上海 200237
  • 通讯作者: 张静
  • 作者简介:李康康(1995-),男(蒙古族),河南洛阳人,硕士研究生,CCF会员,主要研究方向:计算机视觉、图像描述;张静(1978-),女,河南三门峡人,副教授,博士,CCF会员,主要研究方向:深度学习、计算机视觉、图像检索、图像描述、视觉问答。
  • 基金资助:

Abstract: The task of image captioning is an important branch of image understanding. It requires not only the ability to correctly recognize the image content, but also the ability to generate grammatically and semantically correct sentences. The traditional encoder-decoder based model cannot make full use of image features and has only a single decoding method. In response to these problems, a multi-layer encoding and decoding model for image captioning based on attention mechanism named MLED was proposed. Firstly, Faster Region-based Convolutional Neural Network (Faster R-CNN) was used to extract image features. Then, Transformer was employed to extract three kinds of high-level features of the image. At the same time, the pyramid fusion method was used to effectively fuse the features. Finally, three Long Short-Term Memory (LSTM) Networks were constructed to decode the features of different layers hierarchically. In the decoding part, the soft attention mechanism was used to enable the model to pay attention to the important information required at the current step. The proposed model was tested on MSCOCO dataset and evaluated by BLEU, METEOR, ROUGE-L and CIDEr. Experimental results show that on the indicators BLEU-4, METEOR and CIDEr, the model is increased by 2.5 percentage points, 2.6 percentage points and 8.8 percentage points compared to the Recall what you see (Recall) model respectively, and is improved by 1.2 percentage points, 0.5 percentage points and 3.5 percentage points compared to the Hierarchical Attention-based Fusion (HAF) model respectively. The visualization of the generated description sentences show that the sentence generated by the proposed model can accurately reflect the image content.

Key words: image captioning, Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) network, multi-layer encoding, multi-layer decoding, attention mechanism

摘要: 图像描述任务是图像理解的一个重要分支,它不仅要求能够正确识别图像的内容,还要求能够生成在语法和语义上正确的句子。传统的基于编码器-解码器的模型不能充分利用图像特征并且解码方式单一。针对这些问题,提出一种基于注意力机制的多层次编码和解码的图像描述模型。首先使用Faster R-CNN(Faster Region-based Convolutional Neural Network)提取图像特征,然后采用Transformer提取图像的3种高层次特征,并利用金字塔型的融合方式对特征进行有效融合,最后构建3个长短期记忆(LSTM)网络对不同层次特征进行层次化解码。在解码部分,利用软注意力机制使得模型能够关注当前步骤所需要的重要信息。在MSCOCO大型数据集上进行实验,利用多种指标(BLEU、METEOR、ROUGE-L、CIDEr)对模型进行评价,该模型在指标BLEU-4、METEOR和CIDEr上相较于Recall(Recall what you see)模型分别提升了2.5个百分点、2.6个百分点和8.8个百分点;相较于HAF(Hierarchical Attention-based Fusion)模型分别提升了1.2个百分点、0.5个百分点和3.5个百分点。此外,通过可视化生成的描述语句可以看出,所提出模型所生成的描述语句能够准确反映图像内容。

关键词: 图像描述, 卷积神经网络, 长短期记忆网络, 多层次编码, 多层次解码, 注意力机制

CLC Number: