Deep natural language description method for video based on multi-feature fusion

doi:10.11772/j.issn.1001-9081.2017.04.1179

Journal of Computer Applications ›› 2017, Vol. 37 ›› Issue (4): 1179-1184.DOI: 10.11772/j.issn.1001-9081.2017.04.1179

Previous Articles Next Articles

Deep natural language description method for video based on multi-feature fusion

LIANG Rui¹, ZHU Qingxin¹, LIAO Shujiao¹, NIU Xinzheng²

1. School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu Sichuan 610054, China;
2. School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu Sichuan 610054, China

Received:2016-09-14 Revised:2016-12-25 Online:2017-04-10 Published:2017-04-19
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61300192), the Fundamental Research Funds for the Central Universities (ZYGX2014J052).

基于多特征融合的深度视频自然语言描述方法

梁锐¹, 朱清新¹, 廖淑娇¹, 牛新征²

1. 电子科技大学信息与软件工程学院, 成都 610054;
2. 电子科技大学计算机科学与工程学院, 成都 610054

通讯作者: 朱清新
作者简介:梁锐(1985-),男,四川遂宁人,博士研究生,CCF会员,主要研究方向:计算机视觉、视频语义分析;朱清新(1954-),男,四川成都人,教授,博士,CCF会员,主要研究方向:软件工程、图形与视觉、计算运筹学、生物信息学;廖淑娇(1981-),女,福建漳州人,博士研究生,CCF会员,主要研究方向:机器学习、粒计算;牛新征(1978-),男,四川成都人,副教授,博士,主要研究方向:机器学习、大数据、移动计算。
基金资助:
国家自然科学基金资助项目（61300192）；中央高校基本科研业务费专项资金资助项目（ZYGX2014J052）。

Abstract

Abstract: Concerning the low accuracy of automatically labelling or describing videos by computers, a deep natural language description method for video based on multi-feature fusion was proposed. The spatial features, motion features and video features of video frame sequence were extracted and fused to train a Long-Short Term Memory (LSTM) based natural language description model. Several natural language description models were trained through the combination of different features from early fusion, then did a late fusion when testing. One of the models was selected to predict possible outputs under current inputs, and the probabilities of these outputs were recomputed with other models, then a weighted sum of these outputs was computed and the output with the highest probability was used as the next output. The feature fusion methods of the proposed method include early fusion such as feature concatenating, weighted summing of different features after alignment, and late fusion such as weighted fusion of outputs' probabilities of different models based on different features, finetuning generated LSTM model by early fused features. Comparison experimental results on Microsoft Video Description (MSVD) dataset indicate that the fusion of different kinds of features can promote the evaluation score, while the fusion of the same kind of features cannot get higher evaluation score than that of the best feature; however, finetuning pre-trained model with other features has poor effect. Among different combination of different features tested, the description generated by the method of combining early fusion and later fusion gets 0.302 of METEOR, which is 1.34% higher than the highest score that can be found, it means that the method is able to improve the accuracy of video automatic description.

Key words: deep learning, feature fusion, video semantic analysis, video description, recurrent neural network, Long-Short Term Memory (LSTM)

摘要： 针对计算机对视频进行自动标注和描述准确率不高的问题，提出一种基于多特征融合的深度视频自然语言描述的方法。该方法提取视频帧序列的空间特征、运动特征、视频特征，进行特征的融合，使用融合的特征训练基于长短期记忆（LSTM）的自然语言描述模型。通过不同的特征组合训练多个自然语言描述模型，在测试时再进行后期融合，即先选择一个模型获取当前输入的多个可能的输出，再使用其他模型计算当前输出的概率，对这些输出的概率进行加权求和，取概率最高的作为输出。此方法中的特征融合的方法包括前期融合：特征的拼接、不同特征对齐加权求和；后期融合：不同特征模型输出的概率的加权融合，使用前期融合的特征对已生成的LSTM模型进行微调。在标准测试集MSVD上进行实验，结果表明：融合不同类型的特征方法能够获得更高评测分值的提升；相同类型的特征融合的评测结果不会高于单个特征的分值；使用特征对预训练好的模型进行微调的方法效果较差。其中使用前期融合与后期融合相结合的方法生成的视频自然语言描述得到的METEOR评测分值为0.302，比目前查到的最高值高1.34%，表明该方法可以提升视频自动描述的准确性。

关键词: 深度学习, 特征融合, 视频语义分析, 视频描述, 递归神经网络, 长短时记忆

CLC Number:

TP37
TP181

LIANG Rui, ZHU Qingxin, LIAO Shujiao, NIU Xinzheng. Deep natural language description method for video based on multi-feature fusion[J]. Journal of Computer Applications, 2017, 37(4): 1179-1184.

梁锐, 朱清新, 廖淑娇, 牛新征. 基于多特征融合的深度视频自然语言描述方法[J]. 计算机应用, 2017, 37(4): 1179-1184.

References

[1] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL].[2016-09-14]. https://arxiv.org/pdf/1409.1556v6.pdf.
[2] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[EB/OL].[2016-09-14]. https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.
[3] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015:1-9.
[4] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[EB/OL].[2016-09-14]. https://www.researchgate.net/publication/286512696_Deep_Residual_Learning_for_Image_Recognition.
[5] JIA Y, SHELHAMER E, DONAHUE J, et al. Caffe: convolutional architecture for fast feature embedding[EB/OL].[2016-03-10]. https://arxiv.org/pdf/1408.5093v1.pdf.
[6] CHEN D L, DOLAN W B. Collecting highly parallel data for paraphrase evaluation[C]//HLT 2011: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA, USA: Association for Computational Linguistics, 2011, 1:190-200.
[7] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015:3156-3164.
[8] SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[C]//NIPS 2014: Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2014.
[9] CHO K, MERRIENBOER B V, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[EB/OL].[2016-09-10]. https://arxiv.org/pdf/1406.1078v3.pdf.
[10] KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015:3128-3137.
[11] KRISHNAMOORTHY N, MALKARNENKAR G, MOONEY R J, et al. Generating natural-language video descriptions using text-mined knowledge[C]//AAAI 2013: Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence. Menlo Park, CA: AAAI Press, 2013:541-547.
[12] THOMASON J, VENUGOPALAN S, GUADARRAMA S, et al. Integrating language and vision to generate natural language descriptions of videos in the wild[EB/OL].[2016-03-10]. http://www.cs.utexas.edu/users/ml/papers/thomason.coling14.pdf.
[13] VENUGOPALAN S, ROHRBACH M, DONAHUE J, et al. Sequence to sequence-video to text[EB/OL].[2016-03-10]. https://arxiv.org/pdf/1505.00487v3.pdf.
[14] VENUGOPALAN S, XU H, DONAHUE J, et al. Translating videos to natural language using deep recurrent neural networks[EB/OL].[2016-03-10]. https://arxiv.org/pdf/1412.4729v3.pdf.
[15] SHETTY R, LAAKSONEN J. Video captioning with recurrent networks based on frame- and video-level features and visual content classification[EB/OL].[2016-03-10]. https://arxiv.org/pdf/1512.02949v1.pdf.
[16] YAO L, TORABI A, CHO K, et al. Describing videos by exploiting temporal structure[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway, NJ: IEEE, 2015: 4507-4515.
[17] 李彦冬, 郝宗波, 雷航. 卷积神经网络研究综述[J]. 计算机应用, 2016, 36(9): 2508-2515.(LI Y D, HAO Z B, LEI H. Survey of convolutional neural network[J]. Journal of Computer Applications, 2016, 36(9): 2508-2515.)
[18] FARNEBACK G. Two-frame motion estimation based on polynomial expansion[C]//SCIA 2003: Proceedings of the 13th Scandinavian Conference on Image Analysis, LNCS 2749. Berlin: Springer, 2003:363-370.
[19] GKIOXARI G, MALIK J. Finding action tubes[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015:759-768.
[20] WANG H, KLASER A, SCHMID C, et al. Action recognition by dense trajectories[C]//CVPR 2011: Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition. Washington, DC: IEEE Computer Society, 2011:3169-3176.
[21] VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation[EB/OL].[2016-03-10]. https://arxiv.org/pdf/1411.5726v2.pdf.
[22] PAPINENI K. BLEU: a method for automatic evaluation of machine translation[J]. Wireless Networks, 2015, 4(4):307-318.
[23] FLICK C. ROUGE: a package for automatic evaluation of summaries[EB/OL].[2016-03-10]. http://anthology.aclweb.org/W/W04/W04-1013.pdf.
[24] DENKOWSKI M, LAVIE A. Meteor universal: language specific translation evaluation for any target language[EB/OL].[2016-03-10]. https://www.cs.cmu.edu/~alavie/METEOR/pdf/meteor-1.5.pdf.
[25] CHEN X, FANG H, LIN T, et al. Microsoft COCO captions: data collection and evaluation server[EB/OL].[2016-09-14]. https://arxiv.org/pdf/1504.00325v2.pdf.

Deep natural language description method for video based on multi-feature fusion

基于多特征融合的深度视频自然语言描述方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	ZHENG Zhiqiang, HU Xin, WENG Zhi, WANG Yuhe, CHENG Xi. Cattle eye image feature extraction method based on improved DenseNet [J]. Journal of Computer Applications, 2021, 41(9): 2780-2784.
[2]	CHEN Chengrui, SUN Ning, HE Shibiao, LIAO Yong. Deep learning-based joint channel estimation and equalization algorithm for C-V2X communications [J]. Journal of Computer Applications, 2021, 41(9): 2687-2693.
[3]	ZHAO Hong, KONG Dongyi. Chinese description of image content based on fusion of image feature attention and adaptive attention [J]. Journal of Computer Applications, 2021, 41(9): 2496-2503.
[4]	XU Jianglang, LI Linyan, WAN Xinjun, HU Fuyuan. Indoor scene recognition method combined with object detection [J]. Journal of Computer Applications, 2021, 41(9): 2720-2725.
[5]	XIE Defeng, JI Jianmin. Syntax-enhanced semantic parsing with syntax-aware representation [J]. Journal of Computer Applications, 2021, 41(9): 2489-2495.
[6]	DAI Yurou, YANG Qing, ZHANG Fengli, ZHOU Fan. Trajectory prediction model of social network users based on self-supervised learning [J]. Journal of Computer Applications, 2021, 41(9): 2545-2551.
[7]	LIU Zichen, LI Xiaojuan, WEI Wei. Automatic patent price evaluation based on recurrent neural network [J]. Journal of Computer Applications, 2021, 41(9): 2532-2538.
[8]	WANG Wei, ZHAO Erping, CUI Zhiyuan, SUN Hao. Disambiguation method of multi-feature fusion based on HowNet sememe and Word2vec word embedding representation [J]. Journal of Computer Applications, 2021, 41(8): 2193-2198.
[9]	CAO Yuhong, XU Hai, LIU Sun'ao, WANG Zixiao, LI Hongliang. Review of deep learning-based medical image segmentation [J]. Journal of Computer Applications, 2021, 41(8): 2273-2287.
[10]	DING Yin, SANG Nan, LI Xiaoyu, WU Feizhou. Prediction method of capacity data in telecom industry based on recurrent neural network [J]. Journal of Computer Applications, 2021, 41(8): 2373-2378.
[11]	QIN Binbin, PENG Liangkang, LU Xiangming, QIAN Jiangbo. Research progress on driver distracted driving detection [J]. Journal of Computer Applications, 2021, 41(8): 2330-2337.
[12]	HE Zhenghai, XIAN Yantuan, WANG Meng, YU Zhengtao. Case reading comprehension method combining syntactic guidance and character attention mechanism [J]. Journal of Computer Applications, 2021, 41(8): 2427-2431.
[13]	ZHOU Xianbing, FAN Xiaochao, REN Ge, YANG Yong. Automated English essay scoring method based on multi-level semantic features [J]. Journal of Computer Applications, 2021, 41(8): 2205-2211.
[14]	LI Yafang, LIANG Ye, FENG Weiwei, ZU Baokai, KANG Yujian. Deep network embedding method based on community optimization [J]. Journal of Computer Applications, 2021, 41(7): 1956-1963.
[15]	WANG Yue, JIANG Yiming, LAN Julong. Intrusion detection based on improved triplet network and K-nearest neighbor algorithm [J]. Journal of Computer Applications, 2021, 41(7): 1996-2002.