Image description generation algorithm based on improved attention mechanism

doi:10.11772/j.issn.1001-9081.2020071078

Abstract

Abstract: Image description is to express the global information contained in the image in sentences. It requires that the image description generation model can extract image information and express the extracted image information in sentences. The traditional model is based on Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN), which can realize the function of image-to-sentence translation to a certain extent. However, this model has low accuracy and training speed when extracting key information of the image. To solve this problem, an improved attention mechanism image description generation model based on CNN and Long Short-Term Memory (LSTM) network was proposed. VGG19 and ResNet101 were used as the feature extraction networks, and group convolution was introduced into the attention mechanism to replace the traditional fully connected operation, so as to improve the evaluation indices.The model was trained by public datasets Flickr8K and Flickr30K and validated by various evaluation indices (BLEU(Bilingual Evaluation Understudy), ROUGE_L(Recall-Oriented Understudy for Gisting Evaluation), CIDEr(Consensus-based Image Description Evaluation), METEOR(Metric for Evaluation of Translation with Explicit Ordering)). Experimental results show that compared with the model with traditional attention mechanism, the proposed improved image description generation model with attention mechanism improves the accuracy of the image description task, and this model is better than the traditional model on all the four evaluation indices.

Key words: image description, natural language processing, Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) network, attention mechanism

摘要： 图像描述是将图像所包含的全局信息用语句来表示。它要求图像描述生成模型既能提取出图像信息，又能将提取出来的图像信息用语句表达出来。传统的模型是基于卷积神经网络（CNN）和循环神经网络（RNN）搭建的，在一定程度上可以实现图像转语句的功能，但该模型在提取图像关键信息时精度不高且训练速度缓慢。针对这一问题，提出了一种基于CNN和长短期记忆（LSTM）网络改进的注意力机制图像描述生成模型。采用VGG19和ResNet101作为特征提取网络，在注意力机制中引入分组卷积替代传统的全连接操作，从而提高评价值指标。使用了公共数据集Flickr8K、Flickr30K对该模型进行训练，采用多种评价指标（BLEU、ROUGE_L、CIDEr、METEOR）对模型进行验证。实验结果表明，与引入传统的注意力机制模型相比，提出的改进注意力机制图像描述生成模型对图像描述任务的准确性有所提升，并且该模型在5种评价指标上均优于传统的模型。

关键词: 图像描述, 自然语言处理, 卷积神经网络, 长短期记忆网络, 注意力机制

CLC Number:

TP391

LI Wenhui, ZENG Shangyou, WANG Jinjin. Image description generation algorithm based on improved attention mechanism[J]. Journal of Computer Applications, 2021, 41(5): 1262-1267.

李文惠, 曾上游, 王金金. 基于改进注意力机制的图像描述生成算法[J]. 计算机应用, 2021, 41(5): 1262-1267.

References

[1] VINYALS O,TOSHEV A,BENGIO S,et al. Show and tell:a neural image caption generator[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2015:3156-3164.
[2] HE K,ZHANG X,REN S,et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:770-778.
[3] SIMONYAN K,ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL].[2020-04-21]. https://arxiv.org/pdf/1409.1556.pdf.
[4] WEI Y,XIA W,LIN M,et al. HCP:a flexible CNN framework for multi-label image classification[J]. IEEE Transactions on Software Engineering,2016,38(9):1901-1907.
[5] SOCHER R, KARPATHY A, LE Q V, et al. Grounded compositional semantics for finding and describing images with sentences[J]. Transactions of the Association for Computational Linguistics,2014,2:207-218.
[6] CHEN X, ZITNICK C L. Mind's eye:a recurrent visual representation for image caption generation[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2015:2422-2431.
[7] GAO J,WANG S,WANG S,et al. Self-critical n-step training for image captioning[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2019:6293-6301.
[8] LU J,XIONG C,PARIKH D,et al. Knowing when to look:adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2017:3242-3250.
[9] CHEN L,ZHANG H,XIAO J,et al. SCA-CNN:spatial and channel-wise attention in convolutional networks for image captioning[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2017:6298-6306.
[10] 陈龙杰, 张钰, 张玉梅, 等. 基于多注意力多尺度特征融合的图像描述生成算法[J]. 计算机应用,2019,39(2):354-359. (CHEN L J,ZHANG Y,ZHANG Y M,et al. Image caption algorithm based on multi-attention and multi-scale feature fusion[J]. Journal of Computer Applications,2019,39(2):354-359.)
[11] XU K,BA J L,KIROS R,et al. Show,attend and tell:neural image caption generation with visual attention[C]//Proceedings of the 32nd International Conference on Machine Learning. New York:JMLR. org,2015:2048-2057.
[12] HOCHREITER S,SCHMIDHUBER J. Long short-term memory[J]. Neural Computation,1997,9(8):1735-1780.
[13] 黄友文, 游亚东, 赵朋. 融合卷积注意力机制的图像描述生成模型[J]. 计算机应用,2020,40(1):23-27.(HUANG Y W, YOU Y D, ZHAO P. Image caption generation model with convolutional attention mechanism[J]. Journal of Computer Applications,2020,40(1):23-27.)
[14] 杨丽, 吴雨茜, 王俊丽, 等. 循环神经网络研究综述[J]. 计算机应用,2018,38(S2):1-6,26.(YANG L,WU Y X,WANG J L, et al. Research on recurrent neural network[J]. Journal of Computer Applications,2018,38(S2):1-6,26.)
[15] PAPINENI K,ROUKOS S,WARD T,et al. BLEU:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics,2002:311-318.
[16] VEDANTAM R,ZITNICK C L,PARIKH D. CIDEr:consensusbased image description evaluation[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2015:4566-4575.
[17] LIN C Y. ROUGE:a package for automatic evaluation of summaries[C]//Proceedings of the 2004 ACL Workshop on Text Summarization Branches Out. Stroudsburg,PA:Association for Computational Linguistics,2004:74-81.
[18] BANERJEE S,LAVIE A. METEOR:an automatic metric for mt evaluation with improved correlation with human judgments[C]//Proceedings of the 2005 ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Stroudsburg,PA:Association for Computational Linguistics,2005:65-72.
[19] HODOSH M,YOUNG P,HOCKENMAIER J. Framing image description as a ranking task:data,models and evaluation metrics[J]. Journal of Artificial Intelligence Research, 2013, 47:853-899.
[20] YOUNG P,LAI A,HODOSH M,et al. From image descriptions to visual denotations:new similarity metrics for semantic inference over event descriptions[J]. Transactions of the Association for Computational Linguistics,2014,2:67-78.
[21] KARPATHY A,LI F F. Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2015:3128-3137.
[22] 陶云松, 张丽红. 基于双向注意力机制图像描述方法研究[J]. 测试技术学报,2019,33(4):347-350,364.(TAO Y S, ZHANG L H. Research on image description method based on bidirectional attentional mechanism[J]. Journal of Test and Measurement Technology,2019,33(4):347-350,364.)