基于多特征融合的深度视频自然语言描述方法

doi:10.11772/j.issn.1001-9081.2017.04.1179

计算机应用 ›› 2017, Vol. 37 ›› Issue (4): 1179-1184.DOI: 10.11772/j.issn.1001-9081.2017.04.1179

• 计算机视觉与虚拟现实 • 上一篇下一篇

基于多特征融合的深度视频自然语言描述方法

梁锐¹, 朱清新¹, 廖淑娇¹, 牛新征²

1. 电子科技大学信息与软件工程学院, 成都 610054;
2. 电子科技大学计算机科学与工程学院, 成都 610054

收稿日期:2016-09-14 修回日期:2016-12-25 发布日期:2017-04-19 出版日期:2017-04-10
通讯作者: 朱清新
作者简介:梁锐(1985-),男,四川遂宁人,博士研究生,CCF会员,主要研究方向:计算机视觉、视频语义分析;朱清新(1954-),男,四川成都人,教授,博士,CCF会员,主要研究方向:软件工程、图形与视觉、计算运筹学、生物信息学;廖淑娇(1981-),女,福建漳州人,博士研究生,CCF会员,主要研究方向:机器学习、粒计算;牛新征(1978-),男,四川成都人,副教授,博士,主要研究方向:机器学习、大数据、移动计算。
基金资助:
国家自然科学基金资助项目（61300192）；中央高校基本科研业务费专项资金资助项目（ZYGX2014J052）。

Deep natural language description method for video based on multi-feature fusion

LIANG Rui¹, ZHU Qingxin¹, LIAO Shujiao¹, NIU Xinzheng²

1. School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu Sichuan 610054, China;
2. School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu Sichuan 610054, China

Received:2016-09-14 Revised:2016-12-25 Online:2017-04-19 Published:2017-04-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61300192), the Fundamental Research Funds for the Central Universities (ZYGX2014J052).

摘要/Abstract

摘要： 针对计算机对视频进行自动标注和描述准确率不高的问题，提出一种基于多特征融合的深度视频自然语言描述的方法。该方法提取视频帧序列的空间特征、运动特征、视频特征，进行特征的融合，使用融合的特征训练基于长短期记忆（LSTM）的自然语言描述模型。通过不同的特征组合训练多个自然语言描述模型，在测试时再进行后期融合，即先选择一个模型获取当前输入的多个可能的输出，再使用其他模型计算当前输出的概率，对这些输出的概率进行加权求和，取概率最高的作为输出。此方法中的特征融合的方法包括前期融合：特征的拼接、不同特征对齐加权求和；后期融合：不同特征模型输出的概率的加权融合，使用前期融合的特征对已生成的LSTM模型进行微调。在标准测试集MSVD上进行实验，结果表明：融合不同类型的特征方法能够获得更高评测分值的提升；相同类型的特征融合的评测结果不会高于单个特征的分值；使用特征对预训练好的模型进行微调的方法效果较差。其中使用前期融合与后期融合相结合的方法生成的视频自然语言描述得到的METEOR评测分值为0.302，比目前查到的最高值高1.34%，表明该方法可以提升视频自动描述的准确性。

关键词: 深度学习, 特征融合, 视频语义分析, 视频描述, 递归神经网络, 长短时记忆

Abstract: Concerning the low accuracy of automatically labelling or describing videos by computers, a deep natural language description method for video based on multi-feature fusion was proposed. The spatial features, motion features and video features of video frame sequence were extracted and fused to train a Long-Short Term Memory (LSTM) based natural language description model. Several natural language description models were trained through the combination of different features from early fusion, then did a late fusion when testing. One of the models was selected to predict possible outputs under current inputs, and the probabilities of these outputs were recomputed with other models, then a weighted sum of these outputs was computed and the output with the highest probability was used as the next output. The feature fusion methods of the proposed method include early fusion such as feature concatenating, weighted summing of different features after alignment, and late fusion such as weighted fusion of outputs' probabilities of different models based on different features, finetuning generated LSTM model by early fused features. Comparison experimental results on Microsoft Video Description (MSVD) dataset indicate that the fusion of different kinds of features can promote the evaluation score, while the fusion of the same kind of features cannot get higher evaluation score than that of the best feature; however, finetuning pre-trained model with other features has poor effect. Among different combination of different features tested, the description generated by the method of combining early fusion and later fusion gets 0.302 of METEOR, which is 1.34% higher than the highest score that can be found, it means that the method is able to improve the accuracy of video automatic description.

Key words: deep learning, feature fusion, video semantic analysis, video description, recurrent neural network, Long-Short Term Memory (LSTM)

中图分类号:

TP37
TP181

梁锐, 朱清新, 廖淑娇, 牛新征. 基于多特征融合的深度视频自然语言描述方法[J]. 计算机应用, 2017, 37(4): 1179-1184.

LIANG Rui, ZHU Qingxin, LIAO Shujiao, NIU Xinzheng. Deep natural language description method for video based on multi-feature fusion[J]. Journal of Computer Applications, 2017, 37(4): 1179-1184.

参考文献

[1] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL].[2016-09-14]. https://arxiv.org/pdf/1409.1556v6.pdf.
[2] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[EB/OL].[2016-09-14]. https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.
[3] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015:1-9.
[4] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[EB/OL].[2016-09-14]. https://www.researchgate.net/publication/286512696_Deep_Residual_Learning_for_Image_Recognition.
[5] JIA Y, SHELHAMER E, DONAHUE J, et al. Caffe: convolutional architecture for fast feature embedding[EB/OL].[2016-03-10]. https://arxiv.org/pdf/1408.5093v1.pdf.
[6] CHEN D L, DOLAN W B. Collecting highly parallel data for paraphrase evaluation[C]//HLT 2011: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA, USA: Association for Computational Linguistics, 2011, 1:190-200.
[7] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015:3156-3164.
[8] SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[C]//NIPS 2014: Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2014.
[9] CHO K, MERRIENBOER B V, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[EB/OL].[2016-09-10]. https://arxiv.org/pdf/1406.1078v3.pdf.
[10] KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015:3128-3137.
[11] KRISHNAMOORTHY N, MALKARNENKAR G, MOONEY R J, et al. Generating natural-language video descriptions using text-mined knowledge[C]//AAAI 2013: Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence. Menlo Park, CA: AAAI Press, 2013:541-547.
[12] THOMASON J, VENUGOPALAN S, GUADARRAMA S, et al. Integrating language and vision to generate natural language descriptions of videos in the wild[EB/OL].[2016-03-10]. http://www.cs.utexas.edu/users/ml/papers/thomason.coling14.pdf.
[13] VENUGOPALAN S, ROHRBACH M, DONAHUE J, et al. Sequence to sequence-video to text[EB/OL].[2016-03-10]. https://arxiv.org/pdf/1505.00487v3.pdf.
[14] VENUGOPALAN S, XU H, DONAHUE J, et al. Translating videos to natural language using deep recurrent neural networks[EB/OL].[2016-03-10]. https://arxiv.org/pdf/1412.4729v3.pdf.
[15] SHETTY R, LAAKSONEN J. Video captioning with recurrent networks based on frame- and video-level features and visual content classification[EB/OL].[2016-03-10]. https://arxiv.org/pdf/1512.02949v1.pdf.
[16] YAO L, TORABI A, CHO K, et al. Describing videos by exploiting temporal structure[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway, NJ: IEEE, 2015: 4507-4515.
[17] 李彦冬, 郝宗波, 雷航. 卷积神经网络研究综述[J]. 计算机应用, 2016, 36(9): 2508-2515.(LI Y D, HAO Z B, LEI H. Survey of convolutional neural network[J]. Journal of Computer Applications, 2016, 36(9): 2508-2515.)
[18] FARNEBACK G. Two-frame motion estimation based on polynomial expansion[C]//SCIA 2003: Proceedings of the 13th Scandinavian Conference on Image Analysis, LNCS 2749. Berlin: Springer, 2003:363-370.
[19] GKIOXARI G, MALIK J. Finding action tubes[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015:759-768.
[20] WANG H, KLASER A, SCHMID C, et al. Action recognition by dense trajectories[C]//CVPR 2011: Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition. Washington, DC: IEEE Computer Society, 2011:3169-3176.
[21] VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation[EB/OL].[2016-03-10]. https://arxiv.org/pdf/1411.5726v2.pdf.
[22] PAPINENI K. BLEU: a method for automatic evaluation of machine translation[J]. Wireless Networks, 2015, 4(4):307-318.
[23] FLICK C. ROUGE: a package for automatic evaluation of summaries[EB/OL].[2016-03-10]. http://anthology.aclweb.org/W/W04/W04-1013.pdf.
[24] DENKOWSKI M, LAVIE A. Meteor universal: language specific translation evaluation for any target language[EB/OL].[2016-03-10]. https://www.cs.cmu.edu/~alavie/METEOR/pdf/meteor-1.5.pdf.
[25] CHEN X, FANG H, LIN T, et al. Microsoft COCO captions: data collection and evaluation server[EB/OL].[2016-09-14]. https://arxiv.org/pdf/1504.00325v2.pdf.

基于多特征融合的深度视频自然语言描述方法

Deep natural language description method for video based on multi-feature fusion

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	潘烨新, 杨哲. 基于多级特征双向融合的小目标检测优化模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2871-2877.
[2]	李顺勇, 李师毅, 胥瑞, 赵兴旺. 基于自注意力融合的不完整多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2696-2703.
[3]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[4]	王熙源, 张战成, 徐少康, 张宝成, 罗晓清, 胡伏原. 面向手术导航3D/2D配准的无监督跨域迁移网络[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2911-2918.
[5]	黄云川, 江永全, 黄骏涛, 杨燕. 基于元图同构网络的分子毒性预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2964-2969.
[6]	刘禹含, 吉根林, 张红苹. 基于骨架图与混合注意力的视频行人异常检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2551-2557.
[7]	顾焰杰, 张英俊, 刘晓倩, 周围, 孙威. 基于时空多图融合的交通流量预测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2618-2625.
[8]	石乾宏, 杨燕, 江永全, 欧阳小草, 范武波, 陈强, 姜涛, 李媛. 面向空气质量预测的多粒度突变拟合网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2643-2650.
[9]	赵亦群, 张志禹, 董雪. 基于密集残差物理信息神经网络的各向异性旅行时计算方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2310-2318.
[10]	徐松, 张文博, 王一帆. 基于时空信息的轻量视频显著性目标检测网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2192-2199.
[11]	晁浩, 封舒琪, 刘永利. 脑电情感识别中多上下文向量优化的卷积递归神经网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2041-2046.
[12]	刘瑞华, 郝子赫, 邹洋杨. 基于多层级精细特征融合的步态识别算法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2250-2257.
[13]	孙逊, 冯睿锋, 陈彦如. 基于深度与实例分割融合的单目3D目标检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2208-2215.
[14]	吴筝, 程志友, 汪真天, 汪传建, 王胜, 许辉. 基于深度学习的患者麻醉复苏过程中的头部运动幅度分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2258-2263.
[15]	李欢欢, 黄添强, 丁雪梅, 罗海峰, 黄丽清. 基于多尺度时空图卷积网络的交通出行需求预测[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2065-2072.