计算机应用 ›› 2020, Vol. 40 ›› Issue (1): 23-27.DOI: 10.11772/j.issn.1001-9081.2019050943

• 人工智能 • 上一篇    下一篇

融合卷积注意力机制的图像描述生成模型

黄友文, 游亚东, 赵朋   

  1. 江西理工大学 信息工程学院, 江西 赣州 341000
  • 收稿日期:2019-06-04 修回日期:2019-09-25 出版日期:2020-01-10 发布日期:2020-01-17
  • 通讯作者: 游亚东
  • 作者简介:黄友文(1982-),男,江西赣州人,副教授,博士,主要研究方向:深度学习、机器视觉;游亚东(1996-),男,江西抚州人,硕士研究生,主要研究方向:深度学习、图像描述;赵朋(1992-),男,河南周口人,硕士研究生,主要研究方向:深度学习。
  • 基金资助:
    江西省教育厅科技项目(GJJ180443);江西理工大学校级重点课题资助项目(NSFJ2014-K18)。

Image caption generation model with convolutional attention mechanism

HUANG Youwen, YOU Yadong, ZHAO Peng   

  1. School of Information Engineering, Jiangxi University of Science and Technology, Ganzhou Jiangxi 341000, China
  • Received:2019-06-04 Revised:2019-09-25 Online:2020-01-10 Published:2020-01-17
  • Supported by:
    This work is partially supported by the Science and Technology Project of Department of Education of Jiangxi Province (GJJ180443), the School-level Key Project of Jiangxi University of Science and Technology (NSFJ2014-K18).

摘要: 图像描述模型需要提取出图像中的特征,然后通过自然语言处理(NLP)技术将特征用语句表达出来。现有的基于卷积神经网络(CNN)和循环神经网络(RNN)搭建的图像描述模型在提取图像关键信息时精度不高且训练速度缓慢。针对这个问题,提出了一种基于卷积注意力机制和长短期记忆(LSTM)网络的图像描述生成模型。采用Inception-ResNet-V2作为特征提取网络,在注意力机制中引入全卷积操作替代传统的全连接操作,减少了模型参数的数量。将图像特征与文本特征有效融合后送入LSTM单元中完成训练,最终产生描述图像内容的语义信息。模型采用MSCOCO数据集进行训练,使用多种评价指标(BLEU-1、BLEU-4、METEOR、CIDEr等)对模型进行验证。实验结果表明,提出的模型能够对图像内容进行准确描述,在多种评价指标上均优于基于传统注意力机制的方法。

关键词: 图像描述, 卷积神经网络, 自然语言处理, 长短期记忆神经网络, 卷积注意力机制

Abstract: The image caption model needs to extract features in the image, and then express the features in sentence by Natural Language Processing (NLP) techniques. The existing image caption model based on Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) have the problems of low precision and slow training speed during the extraction of key information from the image. To solve the problems, an image caption generation model based on convolutional attention mechanism and Long Short-Term Memory (LSTM) network was proposed. The Inception-ResNet-V2 was used as the feature extraction network, and the full convolution operation was introduced in the attention mechanism to replace traditional full connection operation, reducing the number of model parameters. The image features and the text features were effectively fused together and sent to the LSTM unit for training in order to generate the semantic information to caption image content. The model was trained by the MSCOCO dataset and validated by a variety of evaluation metrics (BLEU-1, BLEU-4, METEOR, CIDEr, etc.). The experimental results show that the proposed model can caption the image content accurately and perform better than the method based on traditional attention mechanism on various evaluation metrics.

Key words: image caption, Convolutional Neural Network (CNN), Natural Language Processing (NLP), Long Short Term Memory (LSTM) neural network, convolutional attention mechanism

中图分类号: