Image captioning algorithm based on multi-feature extraction

doi:10.11772/j.issn.1001-9081.2020091439

Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (6): 1640-1646.DOI: 10.11772/j.issn.1001-9081.2020091439

Special Issue: 人工智能

• Artificial intelligence • Previous Articles Next Articles

Image captioning algorithm based on multi-feature extraction

ZHAO Xiaohu^1,2, LI Xiao^1,2

1. National and Local Joint Engineering Laboratory of Internet Applied Technology on Mines(China University of Mining and Technology), Xuzhou Jiangsu 221008, China;
2. School of Information and Control Engineering, China University of Mining and Technology, Xuzhou Jiangsu 221008, China

Received:2020-09-16 Revised:2020-12-05 Online:2021-06-10 Published:2020-12-18
Supported by:
This work is partially supported by the National Key Research and Development Program of China (2017YFC0804400), the Key Research and Development Science and Technology Project of Xuzhou (KC19112).

基于多特征提取的图像语义描述算法

赵小虎^1,2, 李晓^1,2

1. 矿山互联网应用技术国家地方联合工程实验室(中国矿业大学), 江苏徐州 221008;
2. 中国矿业大学信息与控制工程学院, 江苏徐州 221008

通讯作者: 李晓
作者简介:赵小虎(1976-),男,江苏徐州人,教授,博士,主要研究方向:矿山物联网、矿山通信、监视和控制、计算机网络、智能计算;李晓(1994-),女,安徽宿州人,硕士研究生,主要研究方向:计算机视觉处理、图像语义描述。
基金资助:
国家重点研发计划项目（2017YFC0804400）；徐州市重点研发科技项目（KC19112）。

Abstract

Abstract: In image caption methods, image feature information is not completely extracted and the vanishing gradient is generated by the Recurrent Neural Network (RNN). In order to solve the problems, a new image captioning algorithm based on multi-feature extraction was proposed. The constructed model was consisted of three parts:Convolutional Neural Network (CNN) was used for image feature extraction, ATTribute extraction model (ATT) was used for image attribute extraction, and Bidirectional Long Short-Term Memory (Bi-LSTM) network was used for word prediction. In the constructed model, image representation was enhanced by extracting image attribute information, so as to accurately describe the things in the image, and Bi-LSTM was used to capture bidirectional semantic dependency, so that the long-term visual language interaction learning was carried out. Firstly, CNN and ATT were used to extract the global image features and image attribute features respectively. Then, the two kinds of feature information were input into Bi-LSTM to generate sentences that were able to reflect the image content. Finally, the effectiveness of the proposed method was validated on Microsoft COCO Caption, Flickr8k, and Flickr30k datasets. Experimental results show that, compared with the multimodal Recurrent Neural Network (m-RNN) method, the proposed algorithm has improved the description performance by 6.8-11.6 percentage points. The proposed algorithm can effectively improve the semantic description performance of the constructed model for images.

Key words: image captioning, image attribute, Bidirectional Long Short-Term Memory (Bi-LSTM) network, Convolutional Neural Network (CNN), Recurrent Neural Network (RNN)

摘要： 针对图像语义描述方法中存在的图像特征信息提取不完全以及循环神经网络（RNN）产生的梯度消失问题，提出了一种基于多特征提取的图像语义描述算法。所构建模型由三个部分组成：卷积神经网络（CNN）用于图像特征提取，属性提取模型（ATT）用于图像属性提取，而双向长短时记忆（Bi-LSTM）网络用于单词预测。该模型通过提取图像属性信息来增强图像表示，从而精确描述图中事物，并且使用Bi-LSTM捕捉双向语义依赖，从而进行长期的视觉语言交互学习。首先，使用CNN和ATT分别提取图像全局特征与图像属性特征；其次，将两种特征信息输入到Bi-LSTM中生成能够反映图像内容的句子；最后，在Microsoft COCO Caption、Flickr8k和Flickr30k数据集上验证了所提出算法的有效性。实验结果表明，与m-RNN方法相比，所提出的算法在描述性能方面提高了6.8~11.6个百分点。所提算法能够有效地提高模型对图像的语义描述性能。

关键词: 图像语义描述, 图像属性, 双向长短时记忆网络, 卷积神经网络, 循环神经网络

CLC Number:

TP391.41

ZHAO Xiaohu, LI Xiao. Image captioning algorithm based on multi-feature extraction[J]. Journal of Computer Applications, 2021, 41(6): 1640-1646.

赵小虎, 李晓. 基于多特征提取的图像语义描述算法[J]. 计算机应用, 2021, 41(6): 1640-1646.

References

[1] KULKARNI G, PREMRAJ V, DHAR S, et al. BabyTalk:understanding and generating simple image descriptions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2013, 35(12):2891-2903.
[2] MITCHELL M,DODGE J,GOYAL A,et al. Midge:generating image descriptions from computer vision detections[C]//Proceedings of the 2012 13th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg:ACL, 2012:747-756.
[3] ELLIOTT D, KELLER F. Image description using visual dependency representations[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Stroudsburg:ACL,2013:1292-1302.
[4] FARHADI A,HEJRATI M,SADEGHI M A,et al. Every picture tells a story:generating sentences from images[C]//Proceedings of the 2010 European Conference on Computer Vision,LNCS 6314. Berlin:Springer,2010:15-29.
[5] SOCHER R, KARPATHY A, LE Q V, et al. Grounded compositional semantics for finding and describing images with sentences[J]. Transactions of the Association for Computational Linguistics,2014,2:207-218.
[6] KUZNETSOVA P,ORDONEZ V,BERG T L,et al. TreeTalk:composition and compression of trees for image descriptions[J]. Transactions of the Association for Computational Linguistics, 2014,2:351-362.
[7] KUZNETSOVA P,ORDONEZ V,BERG A,et al. Generalizing image captions for image-text parallel corpus[C]//Proceedings of the 201351st Annual Meeting of the Association for Computational Linguistics. Stroudsburg:ACL,2013:790-796.
[8] MASON R,CHARNIAK E. Nonparametric method for data-driven image captioning[C]//Proceedings of the 201452nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg:ACL,2014:592-598.
[9] MAO J,XU W,YANG Y,et al. Deep captioning with multimodal Recurrent Neural Networks (m-RNN)[EB/OL].[2020-11-17]. https://arxiv.org/pdf/1412.6632.pdf.
[10] KIROS R,SALAKHUTDINOV R,ZEMEL R S. Unifying visualsemantic embeddings with multimodal neural language models[EB/OL].[2020-11-17]. https://arxiv.org/pdf/1411.2539.pdf.
[11] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2015:1-9.
[12] JIA X,GAVVES E,FERNANDO B,et al. Guiding the long-short term memory model for image caption generation[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway:IEEE,2015:2407-2415.
[13] XU K,BA JI,KIROS R,et al. Show,attend and tell:neural image caption generation with visual attention[C]//Proceedings of the 201532nd International Conference on Machine Learning. New York:JMLR. org,2015:2048-2057.
[14] LI L,TANG S,DENG L,et al. Image caption with global-local attention[C]//Proceedings of the 201731st AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press,2017:4133-4139.
[15] LUO M,CHANG X,LI Z,et al. Simple to complex cross-modal learning to rank[J]. Computer Vision and Image Understanding, 2017,163:67-77.
[16] HE X,SHI B,BAI X,et al. Image caption generation with part of speech guidance[J]. Pattern Recognition Letters,2019,119:229-237.
[17] YANG J, SUN Y, LIANG J, et al. Image captioning by incorporating affective concepts learned from both visual and textual components[J]. Neurocomputing,2019,328:56-68.
[18] ZHAO D,CHANG Z,GUO S. A multimodal fusion approach for image captioning[J]. Neurocomputing,2019,329:476-485.
[19] LIU W,ANGUELOV D,ERHAN D,et al. SSD:single shot MultiBox detector[C]//Proceedings of the 2016 European Conference on Computer Vision,LNCS 9905. Cham:Springer, 2016:21-37.
[20] HE K,ZHANG X,REN S,et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:770-778.
[21] WANG C,YANG H,MEINEL C. Image captioning with deep bidirectional LSTMs and multi-task learning[J]. ACM Transactions on Multimedia Computing, Communications, and Applications,2018,14(2s):Article No. 40.
[22] VINYALS O,TOSHEV A,BENGIO S,et al. Show and tell:a neural image caption generator[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2015:3156-3164.
[23] DONAHUE J,HENDRICKS L A,ROHRBACH M,et al. Longterm recurrent convolutional networks for visual recognition and description[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(4):677-691.
[24] KARPATHY A,LI F F. Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2015:3128-3137.
[25] CHENG Y, HUANG F, ZHOU L, et al. A hierarchical multimodal attention-based neural network for image captioning[C]//Proceedings of the 2017 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM,2017:889-892.
[26] QU S,XI Y,DING S. Visual attention based on long-short term memory model for image caption generation[C]//Proceedings of the 2017 29th Chinese Control and Decision Conference. Piscataway:IEEE,2017:4789-4794.
[27] 王媛华. 基于多融合模型的图像语义描述研究[J]. 河南科技, 2019(14):34-36.(WANG Y H. Image caption based on multifusion model[J]. Henan Science and Technology,2019(14):34-36.)

Image captioning algorithm based on multi-feature extraction

基于多特征提取的图像语义描述算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	SONG Zhongshan, LIANG Jiarui, ZHENG Lu, LIU Zhenyu, TIE Jun. Remote sensing scene classification based on bidirectional gated scale feature fusion [J]. Journal of Computer Applications, 2021, 41(9): 2726-2735.
[2]	LI Kangkang, ZHANG Jing. Multi-layer encoding and decoding model for image captioning based on attention mechanism [J]. Journal of Computer Applications, 2021, 41(9): 2504-2509.
[3]	ZHANG Yongbin, CHANG Wenxin, SUN Lianshan, ZHANG Hang. Detection method of domains generated by dictionary-based domain generation algorithm [J]. Journal of Computer Applications, 2021, 41(9): 2609-2614.
[4]	ZHAO Hong, KONG Dongyi. Chinese description of image content based on fusion of image feature attention and adaptive attention [J]. Journal of Computer Applications, 2021, 41(9): 2496-2503.
[5]	XU Jianglang, LI Linyan, WAN Xinjun, HU Fuyuan. Indoor scene recognition method combined with object detection [J]. Journal of Computer Applications, 2021, 41(9): 2720-2725.
[6]	WANG Hebing, ZHANG Chunmei. Facial landmark detection based on ResNeXt with asymmetric convolution and squeeze excitation [J]. Journal of Computer Applications, 2021, 41(9): 2741-2747.
[7]	LIU Zichen, LI Xiaojuan, WEI Wei. Automatic patent price evaluation based on recurrent neural network [J]. Journal of Computer Applications, 2021, 41(9): 2532-2538.
[8]	CAO Yuhong, XU Hai, LIU Sun'ao, WANG Zixiao, LI Hongliang. Review of deep learning-based medical image segmentation [J]. Journal of Computer Applications, 2021, 41(8): 2273-2287.
[9]	QIN Binbin, PENG Liangkang, LU Xiangming, QIAN Jiangbo. Research progress on driver distracted driving detection [J]. Journal of Computer Applications, 2021, 41(8): 2330-2337.
[10]	HUANG Chengcheng, DONG Xiaoxiao, LI Zhao. Deep pipeline 5×5 convolution method based on two-dimensional Winograd algorithm [J]. Journal of Computer Applications, 2021, 41(8): 2258-2264.
[11]	ZENG Xiangyin, ZHENG Bochuan, LIU Dan. Detection of left and right railway tracks based on deep convolutional neural network and clustering [J]. Journal of Computer Applications, 2021, 41(8): 2324-2329.
[12]	TAN Daoqiang, ZENG Cheng, QIAO Jinxia, ZHANG Jun. Shadow detection method based on hybrid attention model [J]. Journal of Computer Applications, 2021, 41(7): 2076-2081.
[13]	WU Guangli, LI Leiting, GUO Zhenzhou, WANG Chengxiang. Video summarization generation model based on improved bi-directional long short-term memory network [J]. Journal of Computer Applications, 2021, 41(7): 1908-1914.
[14]	GAO Qinquan, HUANG Bingcheng, LIU Wenzhe, TONG Tong. Bamboo strip surface defect detection method based on improved CenterNet [J]. Journal of Computer Applications, 2021, 41(7): 1933-1938.
[15]	YANG Su, OUYANG Zhi, DU Nisuo. Unsupervised parallel hash image retrieval based on correlation distance [J]. Journal of Computer Applications, 2021, 41(7): 1902-1907.