计算机应用

• 人工智能与仿真 •    下一篇

基于注意力特征自适应校正的图像描述模型

韦人予1,蒙祖强2   

  1. 1. 广西大学 计算机与电子信息学院, 南宁 530004
  • 收稿日期:2019-12-25 修回日期:2020-03-01 发布日期:2020-03-01 出版日期:2020-05-13
  • 通讯作者: 蒙祖强

Image caption model based on attention feature adaptive recalibration

  • Received:2019-12-25 Revised:2020-03-01 Online:2020-03-01 Published:2020-05-13

摘要: 深度学习框架下的图像描述模型存在对图像特征选择不准确、利用不充分的问题,导致生成的图像描述 语句整体质量不高。为此,提出了一种基于注意力特征自适应校正的图像描述模型。应用卷积神经网络提取图像特 征,融合注意力机制,能够在有序输出单词的同时动态聚焦在图像的各个区域,从而得到带有位置信息的注意力特 征;通过一个通道激活层全面捕获通道之间依赖关系,进行注意力特征自适应校正,提高特征表示能力,进而提升由 长短期记忆(LSTM)网络生成的图像描述语句质量。在MS COCO、Flickr8K、Flickr30K三个标准数据集上对模型进行 对比实验,实验结果表明,所提的模型在 MS COCO 数据集上的 BLEU_1、BLEU_2、BLEU_3、BLEU_4、Meteor、CIDEr得 分分别可达到69. 4%、52. 3%、38. 6%、28. 5%、23. 3%和83. 6%,优于传统神经网络图像描述模型,能够生成更准确的 图像描述。

关键词: 图像描述, 深度学习, 注意力机制, 多模态, 自然语言处理

Abstract: For that the image caption model under deep learning framework has the problem of inaccurate selection and insufficient utilization of image features,which leads to the low quality of generated image caption statements,an image caption model based on attention feature adaptive recalibration was proposed. Firstly,the convolutional neural network was utilized to extract image features,and the attention mechanism was integrated in the image features to obtain the attention features with location information,so that the corresponding image regions were dynamically focused when the words were output in order. Then,a channel activation layer was constructed to fully capture channel-wise dependencies for attention feature adaptive recalibration,which boosted the representational power of the features,and ultimately improved the quality of generated sentences by Long Short-Term Memory(LSTM)network. A comparison experiment was conducted on the three standard data sets of MS COCO,Flickr8K and Flickr30K. The experiment results show that the scores of BLEU_1,BLEU_2, BLEU_3,BLEU_4,Meteor and CIDEr of the proposed model on MS COCO data set can achieve 69. 4%,52. 3%,38. 6%, 28. 5%,23. 3% and 83. 6%,which are superior to the traditional neural network image caption model. The proposed model can generate more accurate image caption.

Key words: image caption, deep learning, attention mechanism, multimodal, Natural Language Processing (NLP)

中图分类号: