Image caption generation model with convolutional attention mechanism

doi:10.11772/j.issn.1001-9081.2019050943

Journal of Computer Applications ›› 2020, Vol. 40 ›› Issue (1): 23-27.DOI: 10.11772/j.issn.1001-9081.2019050943

• Artificial intelligence • Previous Articles Next Articles

Image caption generation model with convolutional attention mechanism

HUANG Youwen, YOU Yadong, ZHAO Peng

School of Information Engineering, Jiangxi University of Science and Technology, Ganzhou Jiangxi 341000, China

Received:2019-06-04 Revised:2019-09-25 Online:2020-01-10 Published:2020-01-17
Supported by:
This work is partially supported by the Science and Technology Project of Department of Education of Jiangxi Province (GJJ180443), the School-level Key Project of Jiangxi University of Science and Technology (NSFJ2014-K18).

融合卷积注意力机制的图像描述生成模型

黄友文, 游亚东, 赵朋

江西理工大学信息工程学院, 江西赣州 341000

通讯作者: 游亚东
作者简介:黄友文(1982-),男,江西赣州人,副教授,博士,主要研究方向:深度学习、机器视觉;游亚东(1996-),男,江西抚州人,硕士研究生,主要研究方向:深度学习、图像描述;赵朋(1992-),男,河南周口人,硕士研究生,主要研究方向:深度学习。
基金资助:
江西省教育厅科技项目（GJJ180443）；江西理工大学校级重点课题资助项目（NSFJ2014-K18）。

Abstract

Abstract: The image caption model needs to extract features in the image, and then express the features in sentence by Natural Language Processing (NLP) techniques. The existing image caption model based on Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) have the problems of low precision and slow training speed during the extraction of key information from the image. To solve the problems, an image caption generation model based on convolutional attention mechanism and Long Short-Term Memory (LSTM) network was proposed. The Inception-ResNet-V2 was used as the feature extraction network, and the full convolution operation was introduced in the attention mechanism to replace traditional full connection operation, reducing the number of model parameters. The image features and the text features were effectively fused together and sent to the LSTM unit for training in order to generate the semantic information to caption image content. The model was trained by the MSCOCO dataset and validated by a variety of evaluation metrics (BLEU-1, BLEU-4, METEOR, CIDEr, etc.). The experimental results show that the proposed model can caption the image content accurately and perform better than the method based on traditional attention mechanism on various evaluation metrics.

Key words: image caption, Convolutional Neural Network (CNN), Natural Language Processing (NLP), Long Short Term Memory (LSTM) neural network, convolutional attention mechanism

摘要： 图像描述模型需要提取出图像中的特征，然后通过自然语言处理（NLP）技术将特征用语句表达出来。现有的基于卷积神经网络（CNN）和循环神经网络（RNN）搭建的图像描述模型在提取图像关键信息时精度不高且训练速度缓慢。针对这个问题，提出了一种基于卷积注意力机制和长短期记忆（LSTM）网络的图像描述生成模型。采用Inception-ResNet-V2作为特征提取网络，在注意力机制中引入全卷积操作替代传统的全连接操作，减少了模型参数的数量。将图像特征与文本特征有效融合后送入LSTM单元中完成训练，最终产生描述图像内容的语义信息。模型采用MSCOCO数据集进行训练，使用多种评价指标（BLEU-1、BLEU-4、METEOR、CIDEr等）对模型进行验证。实验结果表明，提出的模型能够对图像内容进行准确描述，在多种评价指标上均优于基于传统注意力机制的方法。

关键词: 图像描述, 卷积神经网络, 自然语言处理, 长短期记忆神经网络, 卷积注意力机制

CLC Number:

TP391

HUANG Youwen, YOU Yadong, ZHAO Peng. Image caption generation model with convolutional attention mechanism[J]. Journal of Computer Applications, 2020, 40(1): 23-27.

黄友文, 游亚东, 赵朋. 融合卷积注意力机制的图像描述生成模型[J]. 计算机应用, 2020, 40(1): 23-27.

References

[1] 陈龙杰,张钰,张玉梅,等. 基于多注意力多尺度特征融合的图像描述生成算法[J]. 计算机应用, 2019, 39(2):354-359. (CHEN L J, ZHANG Y, ZHANG Y M, et al. Image caption algorithm based on multi-attention and multi-scale feature fusion[J]. Journal of Computer Applications, 2019, 39(2):354-359.)
[2] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell:a neural image caption generator[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Washington, DC:IEEE Computer Society, 2015:3156-3164.
[3] XU K, BA J, KIROS R, et al. Show, attend and tell:neural image caption generation with visual attention[C]//Proceedings of the 2015 International Conference on Machine Learning. New York:International Machine Learning Society, 2015:2048-2057.
[4] 汤鹏杰,谭云兰,李金忠. 融合图像场景及物体先验知识的图像描述生成模型[J]. 中国图象图形学报, 2017, 22(9):1251-1260. (TANG P J, TAN Y L, LI J Z. Image description based on the fusion of scene and object category prior knowledge[J]. Journal of Image and Graphics, 2017, 22(9):1251-1260)
[5] 杨楠,南琳,张丁一,等. 基于深度学习的图像描述研究[J]. 红外与激光工程, 2018, 47(2):9-16. (YANG N, NAN L, ZHANG D Y, et al. Research on image interpretation based on deep learning[J]. Infrared and Laser Engineering, 2018, 47(2):9-16.)
[6] SZEGEDY C, IOFFE S, VANHOUCKE V, et al. Inception-v4, inception-ResNet and the impact of residual connections on learning[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence. Pola Alto, CA:AAAI Press, 2017:4278-4284.
[7] LEE J, SEO S, CHOI Y S. Semantic relation classification via bidirectional LSTM networks with entity-aware attention using latent entity typing[J]. Symmetry, 2019, 11(6):No.785.
[8] LIU Y, LIU Z, CHUA T S, et al. Topical word embeddings[C]//Proceedings of the 29th AAAI Conference on Artificial Intelligence. Pola Alto, CA:AAAI Press, 2015:2418-2424.
[9] 杨丽,吴雨茜,王俊丽,等. 循环神经网络研究综述[J]. 计算机应用, 2018, 38(S2):1-6, 26. (YANG L, WU Y X, WANG J L, et al. Research on recurrent neural network[J]. Journal of Computer Applications, 2018, 38(S2):1-6, 26)
[10] VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr:consensus-based image description evaluation[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Washington, DC:IEEE Computer Society, 2015:4566-4575.
[11] KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4):664-676.
[12] DONAHUE J, HENDRICKS L A, GUADARRAMA S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Washington, DC:IEEE Computer Society, 2015:2625-2634.
[13] KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4):664-676.
[14] 汤鹏杰,王瀚漓,许恺晟. LSTM逐层多目标优化及多层概率融合的图像描述[J]. 自动化学报, 2018, 44(7):1237-1249. (TANG P J, WANG H L, XU K S. Multi-objective layer-wise optimization and multi-level probability fusion for image description generation using LSTM[J]. Acta Automatica Sinica, 2018, 44(7):1237-1249.)

[1]	SONG Zhongshan, LIANG Jiarui, ZHENG Lu, LIU Zhenyu, TIE Jun. Remote sensing scene classification based on bidirectional gated scale feature fusion [J]. Journal of Computer Applications, 2021, 41(9): 2726-2735.
[2]	LI Kangkang, ZHANG Jing. Multi-layer encoding and decoding model for image captioning based on attention mechanism [J]. Journal of Computer Applications, 2021, 41(9): 2504-2509.
[3]	ZHANG Yongbin, CHANG Wenxin, SUN Lianshan, ZHANG Hang. Detection method of domains generated by dictionary-based domain generation algorithm [J]. Journal of Computer Applications, 2021, 41(9): 2609-2614.
[4]	ZHAO Hong, KONG Dongyi. Chinese description of image content based on fusion of image feature attention and adaptive attention [J]. Journal of Computer Applications, 2021, 41(9): 2496-2503.
[5]	XU Jianglang, LI Linyan, WAN Xinjun, HU Fuyuan. Indoor scene recognition method combined with object detection [J]. Journal of Computer Applications, 2021, 41(9): 2720-2725.
[6]	WANG Hebing, ZHANG Chunmei. Facial landmark detection based on ResNeXt with asymmetric convolution and squeeze excitation [J]. Journal of Computer Applications, 2021, 41(9): 2741-2747.
[7]	XIE Defeng, JI Jianmin. Syntax-enhanced semantic parsing with syntax-aware representation [J]. Journal of Computer Applications, 2021, 41(9): 2489-2495.
[8]	LIU Yaxuan, ZHONG Yong. Joint extraction method of entities and relations based on subject attention [J]. Journal of Computer Applications, 2021, 41(9): 2517-2522.
[9]	CAO Yuhong, XU Hai, LIU Sun'ao, WANG Zixiao, LI Hongliang. Review of deep learning-based medical image segmentation [J]. Journal of Computer Applications, 2021, 41(8): 2273-2287.
[10]	QIN Binbin, PENG Liangkang, LU Xiangming, QIAN Jiangbo. Research progress on driver distracted driving detection [J]. Journal of Computer Applications, 2021, 41(8): 2330-2337.
[11]	HUANG Chengcheng, DONG Xiaoxiao, LI Zhao. Deep pipeline 5×5 convolution method based on two-dimensional Winograd algorithm [J]. Journal of Computer Applications, 2021, 41(8): 2258-2264.
[12]	ZENG Xiangyin, ZHENG Bochuan, LIU Dan. Detection of left and right railway tracks based on deep convolutional neural network and clustering [J]. Journal of Computer Applications, 2021, 41(8): 2324-2329.
[13]	GAO Qinquan, HUANG Bingcheng, LIU Wenzhe, TONG Tong. Bamboo strip surface defect detection method based on improved CenterNet [J]. Journal of Computer Applications, 2021, 41(7): 1933-1938.
[14]	TAN Daoqiang, ZENG Cheng, QIAO Jinxia, ZHANG Jun. Shadow detection method based on hybrid attention model [J]. Journal of Computer Applications, 2021, 41(7): 2076-2081.
[15]	YANG Su, OUYANG Zhi, DU Nisuo. Unsupervised parallel hash image retrieval based on correlation distance [J]. Journal of Computer Applications, 2021, 41(7): 1902-1907.

Image caption generation model with convolutional attention mechanism

融合卷积注意力机制的图像描述生成模型

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics