计算机应用 ›› 2017, Vol. 37 ›› Issue (4): 1179-1184.DOI: 10.11772/j.issn.1001-9081.2017.04.1179

• 计算机视觉与虚拟现实 • 上一篇    下一篇

基于多特征融合的深度视频自然语言描述方法

梁锐1, 朱清新1, 廖淑娇1, 牛新征2   

  1. 1. 电子科技大学 信息与软件工程学院, 成都 610054;
    2. 电子科技大学 计算机科学与工程学院, 成都 610054
  • 收稿日期:2016-09-14 修回日期:2016-12-25 出版日期:2017-04-10 发布日期:2017-04-19
  • 通讯作者: 朱清新
  • 作者简介:梁锐(1985-),男,四川遂宁人,博士研究生,CCF会员,主要研究方向:计算机视觉、视频语义分析;朱清新(1954-),男,四川成都人,教授,博士,CCF会员,主要研究方向:软件工程、图形与视觉、计算运筹学、生物信息学;廖淑娇(1981-),女,福建漳州人,博士研究生,CCF会员,主要研究方向:机器学习、粒计算;牛新征(1978-),男,四川成都人,副教授,博士,主要研究方向:机器学习、大数据、移动计算。
  • 基金资助:
    国家自然科学基金资助项目(61300192);中央高校基本科研业务费专项资金资助项目(ZYGX2014J052)。

Deep natural language description method for video based on multi-feature fusion

LIANG Rui1, ZHU Qingxin1, LIAO Shujiao1, NIU Xinzheng2   

  1. 1. School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu Sichuan 610054, China;
    2. School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu Sichuan 610054, China
  • Received:2016-09-14 Revised:2016-12-25 Online:2017-04-10 Published:2017-04-19
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61300192), the Fundamental Research Funds for the Central Universities (ZYGX2014J052).

摘要: 针对计算机对视频进行自动标注和描述准确率不高的问题,提出一种基于多特征融合的深度视频自然语言描述的方法。该方法提取视频帧序列的空间特征、运动特征、视频特征,进行特征的融合,使用融合的特征训练基于长短期记忆(LSTM)的自然语言描述模型。通过不同的特征组合训练多个自然语言描述模型,在测试时再进行后期融合,即先选择一个模型获取当前输入的多个可能的输出,再使用其他模型计算当前输出的概率,对这些输出的概率进行加权求和,取概率最高的作为输出。此方法中的特征融合的方法包括前期融合:特征的拼接、不同特征对齐加权求和;后期融合:不同特征模型输出的概率的加权融合,使用前期融合的特征对已生成的LSTM模型进行微调。在标准测试集MSVD上进行实验,结果表明:融合不同类型的特征方法能够获得更高评测分值的提升;相同类型的特征融合的评测结果不会高于单个特征的分值;使用特征对预训练好的模型进行微调的方法效果较差。其中使用前期融合与后期融合相结合的方法生成的视频自然语言描述得到的METEOR评测分值为0.302,比目前查到的最高值高1.34%,表明该方法可以提升视频自动描述的准确性。

关键词: 深度学习, 特征融合, 视频语义分析, 视频描述, 递归神经网络, 长短时记忆

Abstract: Concerning the low accuracy of automatically labelling or describing videos by computers, a deep natural language description method for video based on multi-feature fusion was proposed. The spatial features, motion features and video features of video frame sequence were extracted and fused to train a Long-Short Term Memory (LSTM) based natural language description model. Several natural language description models were trained through the combination of different features from early fusion, then did a late fusion when testing. One of the models was selected to predict possible outputs under current inputs, and the probabilities of these outputs were recomputed with other models, then a weighted sum of these outputs was computed and the output with the highest probability was used as the next output. The feature fusion methods of the proposed method include early fusion such as feature concatenating, weighted summing of different features after alignment, and late fusion such as weighted fusion of outputs' probabilities of different models based on different features, finetuning generated LSTM model by early fused features. Comparison experimental results on Microsoft Video Description (MSVD) dataset indicate that the fusion of different kinds of features can promote the evaluation score, while the fusion of the same kind of features cannot get higher evaluation score than that of the best feature; however, finetuning pre-trained model with other features has poor effect. Among different combination of different features tested, the description generated by the method of combining early fusion and later fusion gets 0.302 of METEOR, which is 1.34% higher than the highest score that can be found, it means that the method is able to improve the accuracy of video automatic description.

Key words: deep learning, feature fusion, video semantic analysis, video description, recurrent neural network, Long-Short Term Memory (LSTM)

中图分类号: