Multi-layer encoding and decoding model for image captioning based on attention mechanism

doi:10.11772/j.issn.1001-9081.2020111838

Abstract

Abstract: The task of image captioning is an important branch of image understanding. It requires not only the ability to correctly recognize the image content, but also the ability to generate grammatically and semantically correct sentences. The traditional encoder-decoder based model cannot make full use of image features and has only a single decoding method. In response to these problems, a multi-layer encoding and decoding model for image captioning based on attention mechanism named MLED was proposed. Firstly, Faster Region-based Convolutional Neural Network (Faster R-CNN) was used to extract image features. Then, Transformer was employed to extract three kinds of high-level features of the image. At the same time, the pyramid fusion method was used to effectively fuse the features. Finally, three Long Short-Term Memory (LSTM) Networks were constructed to decode the features of different layers hierarchically. In the decoding part, the soft attention mechanism was used to enable the model to pay attention to the important information required at the current step. The proposed model was tested on MSCOCO dataset and evaluated by BLEU, METEOR, ROUGE-L and CIDEr. Experimental results show that on the indicators BLEU-4, METEOR and CIDEr, the model is increased by 2.5 percentage points, 2.6 percentage points and 8.8 percentage points compared to the Recall what you see (Recall) model respectively, and is improved by 1.2 percentage points, 0.5 percentage points and 3.5 percentage points compared to the Hierarchical Attention-based Fusion (HAF) model respectively. The visualization of the generated description sentences show that the sentence generated by the proposed model can accurately reflect the image content.

Key words: image captioning, Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) network, multi-layer encoding, multi-layer decoding, attention mechanism

摘要： 图像描述任务是图像理解的一个重要分支，它不仅要求能够正确识别图像的内容，还要求能够生成在语法和语义上正确的句子。传统的基于编码器-解码器的模型不能充分利用图像特征并且解码方式单一。针对这些问题，提出一种基于注意力机制的多层次编码和解码的图像描述模型。首先使用Faster R-CNN（Faster Region-based Convolutional Neural Network）提取图像特征，然后采用Transformer提取图像的3种高层次特征，并利用金字塔型的融合方式对特征进行有效融合，最后构建3个长短期记忆（LSTM）网络对不同层次特征进行层次化解码。在解码部分，利用软注意力机制使得模型能够关注当前步骤所需要的重要信息。在MSCOCO大型数据集上进行实验，利用多种指标（BLEU、METEOR、ROUGE-L、CIDEr）对模型进行评价，该模型在指标BLEU-4、METEOR和CIDEr上相较于Recall（Recall what you see）模型分别提升了2.5个百分点、2.6个百分点和8.8个百分点；相较于HAF（Hierarchical Attention-based Fusion）模型分别提升了1.2个百分点、0.5个百分点和3.5个百分点。此外，通过可视化生成的描述语句可以看出，所提出模型所生成的描述语句能够准确反映图像内容。

关键词: 图像描述, 卷积神经网络, 长短期记忆网络, 多层次编码, 多层次解码, 注意力机制

CLC Number:

TP391

LI Kangkang, ZHANG Jing. Multi-layer encoding and decoding model for image captioning based on attention mechanism[J]. Journal of Computer Applications, 2021, 41(9): 2504-2509.

李康康, 张静. 基于注意力机制的多层次编码和解码的图像描述模型[J]. 计算机应用, 2021, 41(9): 2504-2509.

References

[1] VINYALS O,TOSHEV A,BENGIO S,et al. Show and tell:a neural image caption generator[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2015:3156-3164.
[2] 李文惠, 曾上游, 王金金. 基于改进注意力机制的图像描述生成算法[J]. 计算机应用,2021,41(5):1262-1267.(LI W H, ZENG S Y,WANG J J. Image description generation algorithm based on improved attention mechanism[J]. Journal of Computer Applications,2021,41(5):1262-1267.)
[3] SIMONYAN K,ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. (2015-04-10)[2021-01-20]. https://arxiv.org/pdf/1409.1556.pdf.
[4] HE K M,ZHANG X Y,REN S Q,et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:770-778.
[5] HOCHREITER S,SCHMIDHUBER J. Long short-term memory[J]. Neural Computation,1997,9(8):1735-1780.
[6] BAHDANAU D,CHO K,BENGIO Y. Neural machine translation by jointly learning to align and translate[EB/OL]. (2016-05-19)[2019-09-01]. https://arxiv.org/pdf/1409.0473.pdf.
[7] XU K,BA J,KIROS R,et al. Show,attend and tell:neural image caption generation with visual attention[C]//Proceedings of the 32nd International Conference on Machine Learning. New York:JMLR. org,2015:2048-2057.
[8] CHEN L,ZHANG H W,XIAO J,et al. SCA-CNN:spatial and channel-wise attention in convolutional networks for image captioning[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2017:6298-6306.
[9] XIAO X Y, WANG L F, DING K, et al. Dense semantic embedding network for image captioning[J]. Pattern Recognition, 2019,90:285-296.
[10] ZHANG M X,YANG Y,ZHANG H W,et al. More is better:precise and detailed image captioning using online positive recall and missing concepts mining[J]. IEEE Transactions on Image Processing,2019,28(1):32-44.
[11] REN S Q,HE K M,GIRSHICK R,et al. Faster R-CNN:towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2017, 39(6):1137-1149.
[12] LU J S,XIONG C M,PARIKH D,et al. Knowing when to look:adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2017:3242-3250.
[13] 赵小虎, 李晓. 基于多特征提取的图像语义描述算法[J]. 计算机应用,2021,41(6):1640-1646.(ZHAO X H,LI X. Image captioning algorithm based on multi-feature extraction[J]. Journal of Computer Applications,2021,41(6):1640-1646.)
[14] XIAO X Y,WANG L F,DING K,et al. Deep hierarchical encoder-decoder network for image captioning[J]. IEEE Transactions on Multimedia,2019,21(11):2942-2956.
[15] VASWANI A,SHAZEER N,PARMAR N,et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook,NY:Curran Associates Inc.,2017:6000-6010.
[16] LIN T Y,DOLLÁR P,GIRSHICK R,et al. Feature pyramid networks for object detection[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2017:936-944.
[17] ZHANG Z J, WU Q, WANG Y, et al. High-quality image captioning with fine-grained and semantic-guided visual attention[J]. IEEE Transactions on Multimedia,2019,21(7):1681-1693.
[18] YAO T,PAN Y W,LI Y H,et al. Boosting image captioning with attributes[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway:IEEE,2017:4904-4912.
[19] WU Q,SHEN C H,WANG P,et al. Image captioning and visual question answering based on attributes and external knowledge[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,40(6):1367-1381.
[20] YU N G,HU X L,SONG B H,et al. Topic-oriented image captioning based on order-embedding[J]. IEEE Transactions on Image Processing,2019,28(6):2743-2754.
[21] ANDERSON P,HE X D,BUEHLER C,et al. Bottom-up and topdown attention for image captioning and visual question answering[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2018:6077-6086.
[22] WU L X,XU M,WANG J Q,et al. Recall what you see continually using GridLSTM in image captioning[J]. IEEE Transactions on Multimedia,2020,22(3):808-818.
[23] LIN T Y,MAIRE M,BELONGIE S,et al. Microsoft COCO:common objects in context[C]//Proceedings of the 2014 European Conference on Computer Vision,LNCS 8693. Cham:Springer, 2014:740-755.
[24] PAPINENI K,ROUKOS S,WARD T,et al. BLEU:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics,2002:311-318.
[25] BANERJEE S,LAVIE A. METEOR:an automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the 2005 ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Stroudsburg,PA:Association for Computational Linguistics,2005:65-72.
[26] LIN C Y. ROUGE:a package for automatic evaluation of summaries[C]//Proceedings of the ACL 2004 Workshop on Text Summarization. Stroudsburg,PA:Association for Computational Linguistics,2004:74-81.
[27] VEDANTAM R,ZITNICK C L,PARIKH D. CIDEr:consensusbased image description evaluation[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2015:4566-4575.
[28] KINGMA D P, BA J M. ADAM:a method for stochastic optimization[EB/OL]. (2017-01-30)[2020-04-22]. https://arxiv.org/pdf/1412.6980.pdf.
[29] 韦人予, 蒙祖强. 基于注意力特征自适应校正的图像描述模型[J]. 计算机应用,2020,40(S1):45-50.(WEI R Y,MENG Z Q. Image caption model based on attention feature adaptive recalibration[J]. Journal of Computer Applications,2020,40(S1):45-50.)
[30] WU C L,YUAN S Z,CAO H W,et al. Hierarchical attentionbased fusion for image caption with multi-grained rewards[J]. IEEE Access,2020,8:57943-57951.