基于注意力机制的多层次编码和解码的图像描述模型

doi:10.11772/j.issn.1001-9081.2020111838

计算机应用 ›› 2021, Vol. 41 ›› Issue (9): 2504-2509.DOI: 10.11772/j.issn.1001-9081.2020111838

所属专题：人工智能

基于注意力机制的多层次编码和解码的图像描述模型

李康康, 张静

华东理工大学信息科学与工程学院, 上海 200237

收稿日期:2020-11-23 修回日期:2021-02-21 发布日期:2021-05-12 出版日期:2021-09-10
通讯作者: 张静
作者简介:李康康(1995-),男(蒙古族),河南洛阳人,硕士研究生,CCF会员,主要研究方向:计算机视觉、图像描述;张静(1978-),女,河南三门峡人,副教授,博士,CCF会员,主要研究方向:深度学习、计算机视觉、图像检索、图像描述、视觉问答。
基金资助:
国家自然科学基金资助项目（61402174）。

Multi-layer encoding and decoding model for image captioning based on attention mechanism

LI Kangkang, ZHANG Jing

School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China

Received:2020-11-23 Revised:2021-02-21 Online:2021-05-12 Published:2021-09-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61402174).

摘要/Abstract

摘要： 图像描述任务是图像理解的一个重要分支，它不仅要求能够正确识别图像的内容，还要求能够生成在语法和语义上正确的句子。传统的基于编码器-解码器的模型不能充分利用图像特征并且解码方式单一。针对这些问题，提出一种基于注意力机制的多层次编码和解码的图像描述模型。首先使用Faster R-CNN（Faster Region-based Convolutional Neural Network）提取图像特征，然后采用Transformer提取图像的3种高层次特征，并利用金字塔型的融合方式对特征进行有效融合，最后构建3个长短期记忆（LSTM）网络对不同层次特征进行层次化解码。在解码部分，利用软注意力机制使得模型能够关注当前步骤所需要的重要信息。在MSCOCO大型数据集上进行实验，利用多种指标（BLEU、METEOR、ROUGE-L、CIDEr）对模型进行评价，该模型在指标BLEU-4、METEOR和CIDEr上相较于Recall（Recall what you see）模型分别提升了2.5个百分点、2.6个百分点和8.8个百分点；相较于HAF（Hierarchical Attention-based Fusion）模型分别提升了1.2个百分点、0.5个百分点和3.5个百分点。此外，通过可视化生成的描述语句可以看出，所提出模型所生成的描述语句能够准确反映图像内容。

关键词: 图像描述, 卷积神经网络, 长短期记忆网络, 多层次编码, 多层次解码, 注意力机制

Abstract: The task of image captioning is an important branch of image understanding. It requires not only the ability to correctly recognize the image content, but also the ability to generate grammatically and semantically correct sentences. The traditional encoder-decoder based model cannot make full use of image features and has only a single decoding method. In response to these problems, a multi-layer encoding and decoding model for image captioning based on attention mechanism named MLED was proposed. Firstly, Faster Region-based Convolutional Neural Network (Faster R-CNN) was used to extract image features. Then, Transformer was employed to extract three kinds of high-level features of the image. At the same time, the pyramid fusion method was used to effectively fuse the features. Finally, three Long Short-Term Memory (LSTM) Networks were constructed to decode the features of different layers hierarchically. In the decoding part, the soft attention mechanism was used to enable the model to pay attention to the important information required at the current step. The proposed model was tested on MSCOCO dataset and evaluated by BLEU, METEOR, ROUGE-L and CIDEr. Experimental results show that on the indicators BLEU-4, METEOR and CIDEr, the model is increased by 2.5 percentage points, 2.6 percentage points and 8.8 percentage points compared to the Recall what you see (Recall) model respectively, and is improved by 1.2 percentage points, 0.5 percentage points and 3.5 percentage points compared to the Hierarchical Attention-based Fusion (HAF) model respectively. The visualization of the generated description sentences show that the sentence generated by the proposed model can accurately reflect the image content.

Key words: image captioning, Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) network, multi-layer encoding, multi-layer decoding, attention mechanism

中图分类号:

TP391

李康康, 张静. 基于注意力机制的多层次编码和解码的图像描述模型[J]. 计算机应用, 2021, 41(9): 2504-2509.

LI Kangkang, ZHANG Jing. Multi-layer encoding and decoding model for image captioning based on attention mechanism[J]. Journal of Computer Applications, 2021, 41(9): 2504-2509.

参考文献

[1] VINYALS O,TOSHEV A,BENGIO S,et al. Show and tell:a neural image caption generator[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2015:3156-3164.
[2] 李文惠, 曾上游, 王金金. 基于改进注意力机制的图像描述生成算法[J]. 计算机应用,2021,41(5):1262-1267.(LI W H, ZENG S Y,WANG J J. Image description generation algorithm based on improved attention mechanism[J]. Journal of Computer Applications,2021,41(5):1262-1267.)
[3] SIMONYAN K,ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. (2015-04-10)[2021-01-20]. https://arxiv.org/pdf/1409.1556.pdf.
[4] HE K M,ZHANG X Y,REN S Q,et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:770-778.
[5] HOCHREITER S,SCHMIDHUBER J. Long short-term memory[J]. Neural Computation,1997,9(8):1735-1780.
[6] BAHDANAU D,CHO K,BENGIO Y. Neural machine translation by jointly learning to align and translate[EB/OL]. (2016-05-19)[2019-09-01]. https://arxiv.org/pdf/1409.0473.pdf.
[7] XU K,BA J,KIROS R,et al. Show,attend and tell:neural image caption generation with visual attention[C]//Proceedings of the 32nd International Conference on Machine Learning. New York:JMLR. org,2015:2048-2057.
[8] CHEN L,ZHANG H W,XIAO J,et al. SCA-CNN:spatial and channel-wise attention in convolutional networks for image captioning[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2017:6298-6306.
[9] XIAO X Y, WANG L F, DING K, et al. Dense semantic embedding network for image captioning[J]. Pattern Recognition, 2019,90:285-296.
[10] ZHANG M X,YANG Y,ZHANG H W,et al. More is better:precise and detailed image captioning using online positive recall and missing concepts mining[J]. IEEE Transactions on Image Processing,2019,28(1):32-44.
[11] REN S Q,HE K M,GIRSHICK R,et al. Faster R-CNN:towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2017, 39(6):1137-1149.
[12] LU J S,XIONG C M,PARIKH D,et al. Knowing when to look:adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2017:3242-3250.
[13] 赵小虎, 李晓. 基于多特征提取的图像语义描述算法[J]. 计算机应用,2021,41(6):1640-1646.(ZHAO X H,LI X. Image captioning algorithm based on multi-feature extraction[J]. Journal of Computer Applications,2021,41(6):1640-1646.)
[14] XIAO X Y,WANG L F,DING K,et al. Deep hierarchical encoder-decoder network for image captioning[J]. IEEE Transactions on Multimedia,2019,21(11):2942-2956.
[15] VASWANI A,SHAZEER N,PARMAR N,et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook,NY:Curran Associates Inc.,2017:6000-6010.
[16] LIN T Y,DOLLÁR P,GIRSHICK R,et al. Feature pyramid networks for object detection[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2017:936-944.
[17] ZHANG Z J, WU Q, WANG Y, et al. High-quality image captioning with fine-grained and semantic-guided visual attention[J]. IEEE Transactions on Multimedia,2019,21(7):1681-1693.
[18] YAO T,PAN Y W,LI Y H,et al. Boosting image captioning with attributes[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway:IEEE,2017:4904-4912.
[19] WU Q,SHEN C H,WANG P,et al. Image captioning and visual question answering based on attributes and external knowledge[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,40(6):1367-1381.
[20] YU N G,HU X L,SONG B H,et al. Topic-oriented image captioning based on order-embedding[J]. IEEE Transactions on Image Processing,2019,28(6):2743-2754.
[21] ANDERSON P,HE X D,BUEHLER C,et al. Bottom-up and topdown attention for image captioning and visual question answering[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2018:6077-6086.
[22] WU L X,XU M,WANG J Q,et al. Recall what you see continually using GridLSTM in image captioning[J]. IEEE Transactions on Multimedia,2020,22(3):808-818.
[23] LIN T Y,MAIRE M,BELONGIE S,et al. Microsoft COCO:common objects in context[C]//Proceedings of the 2014 European Conference on Computer Vision,LNCS 8693. Cham:Springer, 2014:740-755.
[24] PAPINENI K,ROUKOS S,WARD T,et al. BLEU:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics,2002:311-318.
[25] BANERJEE S,LAVIE A. METEOR:an automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the 2005 ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Stroudsburg,PA:Association for Computational Linguistics,2005:65-72.
[26] LIN C Y. ROUGE:a package for automatic evaluation of summaries[C]//Proceedings of the ACL 2004 Workshop on Text Summarization. Stroudsburg,PA:Association for Computational Linguistics,2004:74-81.
[27] VEDANTAM R,ZITNICK C L,PARIKH D. CIDEr:consensusbased image description evaluation[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2015:4566-4575.
[28] KINGMA D P, BA J M. ADAM:a method for stochastic optimization[EB/OL]. (2017-01-30)[2020-04-22]. https://arxiv.org/pdf/1412.6980.pdf.
[29] 韦人予, 蒙祖强. 基于注意力特征自适应校正的图像描述模型[J]. 计算机应用,2020,40(S1):45-50.(WEI R Y,MENG Z Q. Image caption model based on attention feature adaptive recalibration[J]. Journal of Computer Applications,2020,40(S1):45-50.)
[30] WU C L,YUAN S Z,CAO H W,et al. Hierarchical attentionbased fusion for image caption with multi-grained rewards[J]. IEEE Access,2020,8:57943-57951.

基于注意力机制的多层次编码和解码的图像描述模型

Multi-layer encoding and decoding model for image captioning based on attention mechanism

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[2]	李力铤, 华蓓, 贺若舟, 徐况. 基于解耦注意力机制的多变量时序预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2732-2738.
[3]	李云, 王富铕, 井佩光, 王粟, 肖澳. 基于不确定度感知的帧关联短视频事件检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2903-2910.
[4]	赵志强, 马培红, 黑新宏. 基于双重注意力机制的人群计数方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2886-2892.
[5]	陈虹, 齐兵, 金海波, 武聪, 张立昂. 融合1D-CNN与BiGRU的类不平衡流量异常检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2493-2499.
[6]	赵宇博, 张丽萍, 闫盛, 侯敏, 高茂. 基于改进分段卷积神经网络和知识蒸馏的学科知识实体间关系抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2421-2429.
[7]	张春雪, 仇丽青, 孙承爱, 荆彩霞. 基于两阶段动态兴趣识别的购买行为预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2365-2371.
[8]	薛凯鹏, 徐涛, 廖春节. 融合自监督和多层交叉注意力的多模态情感分析网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2387-2392.
[9]	汪雨晴, 朱广丽, 段文杰, 李书羽, 周若彤. 基于交互注意力机制的心理咨询文本情感分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2393-2399.
[10]	高鹏淇, 黄鹤鸣, 樊永红. 融合坐标与多头注意力机制的交互语音情感识别[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2400-2406.
[11]	李钟华, 白云起, 王雪津, 黄雷雷, 林初俊, 廖诗宇. 基于图像增强的低照度人脸检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2588-2594.
[12]	莫尚斌, 王文君, 董凌, 高盛祥, 余正涛. 基于多路信息聚合协同解码的单通道语音增强[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2611-2617.
[13]	高阳峄, 雷涛, 杜晓刚, 李岁永, 王营博, 闵重丹. 基于像素距离图和四维动态卷积网络的密集人群计数与定位方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2233-2242.
[14]	刘丽, 侯海金, 王安红, 张涛. 基于多尺度注意力的生成式信息隐藏算法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2102-2109.
[15]	徐松, 张文博, 王一帆. 基于时空信息的轻量视频显著性目标检测网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2192-2199.