基于多特征提取的图像语义描述算法

doi:10.11772/j.issn.1001-9081.2020091439

计算机应用 ›› 2021, Vol. 41 ›› Issue (6): 1640-1646.DOI: 10.11772/j.issn.1001-9081.2020091439

所属专题：人工智能

基于多特征提取的图像语义描述算法

赵小虎^1,2, 李晓^1,2

1. 矿山互联网应用技术国家地方联合工程实验室(中国矿业大学), 江苏徐州 221008;
2. 中国矿业大学信息与控制工程学院, 江苏徐州 221008

收稿日期:2020-09-16 修回日期:2020-12-05 出版日期:2021-06-10 发布日期:2020-12-18
通讯作者: 李晓
作者简介:赵小虎(1976-),男,江苏徐州人,教授,博士,主要研究方向:矿山物联网、矿山通信、监视和控制、计算机网络、智能计算;李晓(1994-),女,安徽宿州人,硕士研究生,主要研究方向:计算机视觉处理、图像语义描述。
基金资助:
国家重点研发计划项目（2017YFC0804400）；徐州市重点研发科技项目（KC19112）。

Image captioning algorithm based on multi-feature extraction

ZHAO Xiaohu^1,2, LI Xiao^1,2

1. National and Local Joint Engineering Laboratory of Internet Applied Technology on Mines(China University of Mining and Technology), Xuzhou Jiangsu 221008, China;
2. School of Information and Control Engineering, China University of Mining and Technology, Xuzhou Jiangsu 221008, China

Received:2020-09-16 Revised:2020-12-05 Online:2021-06-10 Published:2020-12-18
Supported by:
This work is partially supported by the National Key Research and Development Program of China (2017YFC0804400), the Key Research and Development Science and Technology Project of Xuzhou (KC19112).

摘要/Abstract

摘要： 针对图像语义描述方法中存在的图像特征信息提取不完全以及循环神经网络（RNN）产生的梯度消失问题，提出了一种基于多特征提取的图像语义描述算法。所构建模型由三个部分组成：卷积神经网络（CNN）用于图像特征提取，属性提取模型（ATT）用于图像属性提取，而双向长短时记忆（Bi-LSTM）网络用于单词预测。该模型通过提取图像属性信息来增强图像表示，从而精确描述图中事物，并且使用Bi-LSTM捕捉双向语义依赖，从而进行长期的视觉语言交互学习。首先，使用CNN和ATT分别提取图像全局特征与图像属性特征；其次，将两种特征信息输入到Bi-LSTM中生成能够反映图像内容的句子；最后，在Microsoft COCO Caption、Flickr8k和Flickr30k数据集上验证了所提出算法的有效性。实验结果表明，与m-RNN方法相比，所提出的算法在描述性能方面提高了6.8~11.6个百分点。所提算法能够有效地提高模型对图像的语义描述性能。

关键词: 图像语义描述, 图像属性, 双向长短时记忆网络, 卷积神经网络, 循环神经网络

Abstract: In image caption methods, image feature information is not completely extracted and the vanishing gradient is generated by the Recurrent Neural Network (RNN). In order to solve the problems, a new image captioning algorithm based on multi-feature extraction was proposed. The constructed model was consisted of three parts:Convolutional Neural Network (CNN) was used for image feature extraction, ATTribute extraction model (ATT) was used for image attribute extraction, and Bidirectional Long Short-Term Memory (Bi-LSTM) network was used for word prediction. In the constructed model, image representation was enhanced by extracting image attribute information, so as to accurately describe the things in the image, and Bi-LSTM was used to capture bidirectional semantic dependency, so that the long-term visual language interaction learning was carried out. Firstly, CNN and ATT were used to extract the global image features and image attribute features respectively. Then, the two kinds of feature information were input into Bi-LSTM to generate sentences that were able to reflect the image content. Finally, the effectiveness of the proposed method was validated on Microsoft COCO Caption, Flickr8k, and Flickr30k datasets. Experimental results show that, compared with the multimodal Recurrent Neural Network (m-RNN) method, the proposed algorithm has improved the description performance by 6.8-11.6 percentage points. The proposed algorithm can effectively improve the semantic description performance of the constructed model for images.

Key words: image captioning, image attribute, Bidirectional Long Short-Term Memory (Bi-LSTM) network, Convolutional Neural Network (CNN), Recurrent Neural Network (RNN)

中图分类号:

TP391.41

赵小虎, 李晓. 基于多特征提取的图像语义描述算法[J]. 计算机应用, 2021, 41(6): 1640-1646.

ZHAO Xiaohu, LI Xiao. Image captioning algorithm based on multi-feature extraction[J]. Journal of Computer Applications, 2021, 41(6): 1640-1646.

参考文献

[1] KULKARNI G, PREMRAJ V, DHAR S, et al. BabyTalk:understanding and generating simple image descriptions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2013, 35(12):2891-2903.
[2] MITCHELL M,DODGE J,GOYAL A,et al. Midge:generating image descriptions from computer vision detections[C]//Proceedings of the 2012 13th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg:ACL, 2012:747-756.
[3] ELLIOTT D, KELLER F. Image description using visual dependency representations[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Stroudsburg:ACL,2013:1292-1302.
[4] FARHADI A,HEJRATI M,SADEGHI M A,et al. Every picture tells a story:generating sentences from images[C]//Proceedings of the 2010 European Conference on Computer Vision,LNCS 6314. Berlin:Springer,2010:15-29.
[5] SOCHER R, KARPATHY A, LE Q V, et al. Grounded compositional semantics for finding and describing images with sentences[J]. Transactions of the Association for Computational Linguistics,2014,2:207-218.
[6] KUZNETSOVA P,ORDONEZ V,BERG T L,et al. TreeTalk:composition and compression of trees for image descriptions[J]. Transactions of the Association for Computational Linguistics, 2014,2:351-362.
[7] KUZNETSOVA P,ORDONEZ V,BERG A,et al. Generalizing image captions for image-text parallel corpus[C]//Proceedings of the 201351st Annual Meeting of the Association for Computational Linguistics. Stroudsburg:ACL,2013:790-796.
[8] MASON R,CHARNIAK E. Nonparametric method for data-driven image captioning[C]//Proceedings of the 201452nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg:ACL,2014:592-598.
[9] MAO J,XU W,YANG Y,et al. Deep captioning with multimodal Recurrent Neural Networks (m-RNN)[EB/OL].[2020-11-17]. https://arxiv.org/pdf/1412.6632.pdf.
[10] KIROS R,SALAKHUTDINOV R,ZEMEL R S. Unifying visualsemantic embeddings with multimodal neural language models[EB/OL].[2020-11-17]. https://arxiv.org/pdf/1411.2539.pdf.
[11] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2015:1-9.
[12] JIA X,GAVVES E,FERNANDO B,et al. Guiding the long-short term memory model for image caption generation[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway:IEEE,2015:2407-2415.
[13] XU K,BA JI,KIROS R,et al. Show,attend and tell:neural image caption generation with visual attention[C]//Proceedings of the 201532nd International Conference on Machine Learning. New York:JMLR. org,2015:2048-2057.
[14] LI L,TANG S,DENG L,et al. Image caption with global-local attention[C]//Proceedings of the 201731st AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press,2017:4133-4139.
[15] LUO M,CHANG X,LI Z,et al. Simple to complex cross-modal learning to rank[J]. Computer Vision and Image Understanding, 2017,163:67-77.
[16] HE X,SHI B,BAI X,et al. Image caption generation with part of speech guidance[J]. Pattern Recognition Letters,2019,119:229-237.
[17] YANG J, SUN Y, LIANG J, et al. Image captioning by incorporating affective concepts learned from both visual and textual components[J]. Neurocomputing,2019,328:56-68.
[18] ZHAO D,CHANG Z,GUO S. A multimodal fusion approach for image captioning[J]. Neurocomputing,2019,329:476-485.
[19] LIU W,ANGUELOV D,ERHAN D,et al. SSD:single shot MultiBox detector[C]//Proceedings of the 2016 European Conference on Computer Vision,LNCS 9905. Cham:Springer, 2016:21-37.
[20] HE K,ZHANG X,REN S,et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:770-778.
[21] WANG C,YANG H,MEINEL C. Image captioning with deep bidirectional LSTMs and multi-task learning[J]. ACM Transactions on Multimedia Computing, Communications, and Applications,2018,14(2s):Article No. 40.
[22] VINYALS O,TOSHEV A,BENGIO S,et al. Show and tell:a neural image caption generator[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2015:3156-3164.
[23] DONAHUE J,HENDRICKS L A,ROHRBACH M,et al. Longterm recurrent convolutional networks for visual recognition and description[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(4):677-691.
[24] KARPATHY A,LI F F. Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2015:3128-3137.
[25] CHENG Y, HUANG F, ZHOU L, et al. A hierarchical multimodal attention-based neural network for image captioning[C]//Proceedings of the 2017 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM,2017:889-892.
[26] QU S,XI Y,DING S. Visual attention based on long-short term memory model for image caption generation[C]//Proceedings of the 2017 29th Chinese Control and Decision Conference. Piscataway:IEEE,2017:4789-4794.
[27] 王媛华. 基于多融合模型的图像语义描述研究[J]. 河南科技, 2019(14):34-36.(WANG Y H. Image caption based on multifusion model[J]. Henan Science and Technology,2019(14):34-36.)

基于多特征提取的图像语义描述算法

Image captioning algorithm based on multi-feature extraction

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	宋中山, 梁家锐, 郑禄, 刘振宇, 帖军. 基于双向门控尺度特征融合的遥感场景分类[J]. 计算机应用, 2021, 41(9): 2726-2735.
[2]	李康康, 张静. 基于注意力机制的多层次编码和解码的图像描述模型[J]. 计算机应用, 2021, 41(9): 2504-2509.
[3]	张永斌, 常文欣, 孙连山, 张航. 基于字典的域名生成算法生成域名的检测方法[J]. 计算机应用, 2021, 41(9): 2609-2614.
[4]	赵宏, 孔东一. 图像特征注意力与自适应注意力融合的图像内容中文描述[J]. 计算机应用, 2021, 41(9): 2496-2503.
[5]	徐江浪, 李林燕, 万新军, 胡伏原. 结合目标检测的室内场景识别方法[J]. 计算机应用, 2021, 41(9): 2720-2725.
[6]	牟长宁, 王海鹏, 周丕宇, 侯鑫行. 基于图卷积神经网络的串联质谱从头测序[J]. 计算机应用, 2021, 41(9): 2773-2779.
[7]	刘子辰, 李小娟, 韦伟. 基于循环神经网络的专利价格自动评估[J]. 计算机应用, 2021, 41(9): 2532-2538.
[8]	王贺兵, 张春梅. 基于非对称卷积-压缩激发-次代残差网络的人脸关键点检测[J]. 计算机应用, 2021, 41(9): 2741-2747.
[9]	曹玉红, 徐海, 刘荪傲, 王紫霄, 李宏亮. 基于深度学习的医学影像分割研究综述[J]. 计算机应用, 2021, 41(8): 2273-2287.
[10]	丁尹, 桑楠, 李晓瑜, 吴飞舟. 基于循环神经网络的电信行业容量数据预测方法[J]. 计算机应用, 2021, 41(8): 2373-2378.
[11]	秦斌斌, 彭良康, 卢向明, 钱江波. 司机分心驾驶检测研究进展[J]. 计算机应用, 2021, 41(8): 2330-2337.
[12]	黄程程, 董霄霄, 李钊. 基于二维Winograd算法的深流水线5×5卷积方法[J]. 计算机应用, 2021, 41(8): 2258-2264.
[13]	曾祥银, 郑伯川, 刘丹. 基于深度卷积神经网络和聚类的左右轨道线检测[J]. 计算机应用, 2021, 41(8): 2324-2329.
[14]	高钦泉, 黄炳城, 刘文哲, 童同. 基于改进CenterNet的竹条表面缺陷检测方法[J]. 计算机应用, 2021, 41(7): 1933-1938.
[15]	武光利, 李雷霆, 郭振洲, 王成祥. 基于改进的双向长短期记忆网络的视频摘要生成模型[J]. 计算机应用, 2021, 41(7): 1908-1914.