Journal of Computer Applications ›› 2019, Vol. 39 ›› Issue (2): 354-359.DOI: 10.11772/j.issn.1001-9081.2018071464

Previous Articles     Next Articles

Image caption genaration algorithm based on multi-attention and multi-scale feature fusion

CHEN Longjie1,2,3,4, ZHANG Yu1,2,3,4, ZHANG Yumei1,2,3,4, WU Xiaojun1,2,3,4   

  1. 1. Key Laboratory of Modern Teaching Technology, Ministry of Education(Shaanxi Normal University), Xi'an Shaanxi 710062, China;
    2. Engineering Laboratory of Teaching Information Technology of Shaanxi Province(Shaanxi Normal University), Xi'an Shaanxi 710119, China;
    3. Culture, Education and Intelligent Communication Engineering Technology Research Center(Shaanxi Normal University), Xi'an Shaanxi 710119, China;
    4. School of Computer Science, Shaanxi Normal University, Xi'an Shaanxi 710119, China
  • Received:2018-07-17 Revised:2018-09-12 Online:2019-02-10 Published:2019-02-15
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (11772178, 61741208, 11502133), the Fundamental Research Funds for the Central Universities (GK201801004, GK201803089, GK201703082), the Natural Science Foundation of Shaanxi Province (2017JQ6074), the National Key Research and Development Program of China (2017YFB1402102), the Natural Science Basic Research Plan of Shaanxi Province (2017JM6103, 2017JM6060), the Teaching Reform and Research Project of Shaanxi Normal University (17JG33).

基于多注意力多尺度特征融合的图像描述生成算法

陈龙杰1,2,3,4, 张钰1,2,3,4, 张玉梅1,2,3,4, 吴晓军1,2,3,4   

  1. 1. 现代教学技术教育部重点实验室(陕西师范大学), 西安 710062;
    2. 陕西省教学信息技术工程实验室(陕西师范大学), 西安 710119;
    3. 文化教育智慧传播工程技术研究中心(陕西师范大学), 西安 710119;
    4. 陕西师范大学 计算机科学学院, 西安 710119
  • 通讯作者: 吴晓军
  • 作者简介:陈龙杰(1990-),男,河南平顶山人,硕士研究生,主要研究方向:自然语言处理、计算机视觉;张钰(1982-),男,陕西汉中人,讲师,博士,主要研究方向:计算机视觉、模式识别;张玉梅(1977-),女,陕西绥德人,副教授,博士,CCF会员,主要研究方向:信号处理、智能系统;吴晓军(1970-),男,陕西宝鸡人,教授,博士,主要研究方向:复杂网络、模式识别。
  • 基金资助:
    国家自然科学基金资助项目(11772178,61741208,11502133);中央高校基本科研业务费资助项目(GK201801004,GK201803089,GK201703082);陕西省自然科学基金资助项目(2017JQ6074);国家重点研发计划项目(2017YFB1402102);陕西省自然科学基础研究计划项目(2017JM6103,2017JM6060);陕西师范大学2017年度校级综合教改研究项目(17JG33)。

Abstract: Focusing on the issues of low quality of image caption, insufficient utilization of image features and single-level structure of recurrent neural network in image caption generation, an image caption generation algorithm based on multi-attention and multi-scale feature fusion was proposed. The pre-trained target detection network was used to extract the features of the image from the convolutional neural network, which were input into the multi-attention structures at different layers. Each attention part with features of different levels was related to the multi-level recurrent neural networks sequentially, constructing a multi-level image caption generation network model. By introducing residual connections in the recurrent networks, the network complexity was reduced and the network degradation caused by deepening network was avoided. In MSCOCO datasets, the BLEU-1 and CIDEr scores of the proposed algorithm can achieve 0.804 and 1.167, which is obviously superior to top-down image caption generation algorithm based on single attention structure. Both artificial observation and comparison results velidate that the image caption generated by the proposed algorithm can show better details.

Key words: Long Short-Term Memory (LSTM) network, image caption, multi-attention mechanism, multi-scale feature fusion, deep neural network

摘要: 针对图像描述生成中对图像细节表述质量不高、图像特征利用不充分、循环神经网络层次单一等问题,提出基于多注意力、多尺度特征融合的图像描述生成算法。该算法使用经过预训练的目标检测网络来提取图像在卷积神经网络不同层上的特征,将图像特征分层输入多注意力结构中,依次将多注意力结构与多层循环神经网络相连,构造出多层次的图像描述生成网络模型。在多层循环神经网络中加入残差连接来提高网络性能,并且可以有效避免因为网络加深导致的网络退化问题。在MSCOCO测试集中,所提算法的BLEU-1和CIDEr得分分别可以达到0.804及1.167,明显优于基于单一注意力结构的自上而下图像描述生成算法;通过人工观察对比可知,所提算法生成的图像描述可以表现出更好的图像细节。

关键词: 长短期记忆网络, 图像描述, 多注意力机制, 多尺度特征融合, 深度神经网络

CLC Number: