计算机应用 ›› 2021, Vol. 41 ›› Issue (6): 1640-1646.DOI: 10.11772/j.issn.1001-9081.2020091439

所属专题: 人工智能

• 人工智能 • 上一篇    下一篇

基于多特征提取的图像语义描述算法

赵小虎1,2, 李晓1,2   

  1. 1. 矿山互联网应用技术国家地方联合工程实验室(中国矿业大学), 江苏 徐州 221008;
    2. 中国矿业大学 信息与控制工程学院, 江苏 徐州 221008
  • 收稿日期:2020-09-16 修回日期:2020-12-05 出版日期:2021-06-10 发布日期:2020-12-18
  • 通讯作者: 李晓
  • 作者简介:赵小虎(1976-),男,江苏徐州人,教授,博士,主要研究方向:矿山物联网、矿山通信、监视和控制、计算机网络、智能计算;李晓(1994-),女,安徽宿州人,硕士研究生,主要研究方向:计算机视觉处理、图像语义描述。
  • 基金资助:
    国家重点研发计划项目(2017YFC0804400);徐州市重点研发科技项目(KC19112)。

Image captioning algorithm based on multi-feature extraction

ZHAO Xiaohu1,2, LI Xiao1,2   

  1. 1. National and Local Joint Engineering Laboratory of Internet Applied Technology on Mines(China University of Mining and Technology), Xuzhou Jiangsu 221008, China;
    2. School of Information and Control Engineering, China University of Mining and Technology, Xuzhou Jiangsu 221008, China
  • Received:2020-09-16 Revised:2020-12-05 Online:2021-06-10 Published:2020-12-18
  • Supported by:
    This work is partially supported by the National Key Research and Development Program of China (2017YFC0804400), the Key Research and Development Science and Technology Project of Xuzhou (KC19112).

摘要: 针对图像语义描述方法中存在的图像特征信息提取不完全以及循环神经网络(RNN)产生的梯度消失问题,提出了一种基于多特征提取的图像语义描述算法。所构建模型由三个部分组成:卷积神经网络(CNN)用于图像特征提取,属性提取模型(ATT)用于图像属性提取,而双向长短时记忆(Bi-LSTM)网络用于单词预测。该模型通过提取图像属性信息来增强图像表示,从而精确描述图中事物,并且使用Bi-LSTM捕捉双向语义依赖,从而进行长期的视觉语言交互学习。首先,使用CNN和ATT分别提取图像全局特征与图像属性特征;其次,将两种特征信息输入到Bi-LSTM中生成能够反映图像内容的句子;最后,在Microsoft COCO Caption、Flickr8k和Flickr30k数据集上验证了所提出算法的有效性。实验结果表明,与m-RNN方法相比,所提出的算法在描述性能方面提高了6.8~11.6个百分点。所提算法能够有效地提高模型对图像的语义描述性能。

关键词: 图像语义描述, 图像属性, 双向长短时记忆网络, 卷积神经网络, 循环神经网络

Abstract: In image caption methods, image feature information is not completely extracted and the vanishing gradient is generated by the Recurrent Neural Network (RNN). In order to solve the problems, a new image captioning algorithm based on multi-feature extraction was proposed. The constructed model was consisted of three parts:Convolutional Neural Network (CNN) was used for image feature extraction, ATTribute extraction model (ATT) was used for image attribute extraction, and Bidirectional Long Short-Term Memory (Bi-LSTM) network was used for word prediction. In the constructed model, image representation was enhanced by extracting image attribute information, so as to accurately describe the things in the image, and Bi-LSTM was used to capture bidirectional semantic dependency, so that the long-term visual language interaction learning was carried out. Firstly, CNN and ATT were used to extract the global image features and image attribute features respectively. Then, the two kinds of feature information were input into Bi-LSTM to generate sentences that were able to reflect the image content. Finally, the effectiveness of the proposed method was validated on Microsoft COCO Caption, Flickr8k, and Flickr30k datasets. Experimental results show that, compared with the multimodal Recurrent Neural Network (m-RNN) method, the proposed algorithm has improved the description performance by 6.8-11.6 percentage points. The proposed algorithm can effectively improve the semantic description performance of the constructed model for images.

Key words: image captioning, image attribute, Bidirectional Long Short-Term Memory (Bi-LSTM) network, Convolutional Neural Network (CNN), Recurrent Neural Network (RNN)

中图分类号: