[1] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL].[2016-09-14]. https://arxiv.org/pdf/1409.1556v6.pdf. [2] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[EB/OL].[2016-09-14]. https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf. [3] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015:1-9. [4] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[EB/OL].[2016-09-14]. https://www.researchgate.net/publication/286512696_Deep_Residual_Learning_for_Image_Recognition. [5] JIA Y, SHELHAMER E, DONAHUE J, et al. Caffe: convolutional architecture for fast feature embedding[EB/OL].[2016-03-10]. https://arxiv.org/pdf/1408.5093v1.pdf. [6] CHEN D L, DOLAN W B. Collecting highly parallel data for paraphrase evaluation[C]//HLT 2011: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA, USA: Association for Computational Linguistics, 2011, 1:190-200. [7] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015:3156-3164. [8] SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[C]//NIPS 2014: Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2014. [9] CHO K, MERRIENBOER B V, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[EB/OL].[2016-09-10]. https://arxiv.org/pdf/1406.1078v3.pdf. [10] KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015:3128-3137. [11] KRISHNAMOORTHY N, MALKARNENKAR G, MOONEY R J, et al. Generating natural-language video descriptions using text-mined knowledge[C]//AAAI 2013: Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence. Menlo Park, CA: AAAI Press, 2013:541-547. [12] THOMASON J, VENUGOPALAN S, GUADARRAMA S, et al. Integrating language and vision to generate natural language descriptions of videos in the wild[EB/OL].[2016-03-10]. http://www.cs.utexas.edu/users/ml/papers/thomason.coling14.pdf. [13] VENUGOPALAN S, ROHRBACH M, DONAHUE J, et al. Sequence to sequence-video to text[EB/OL].[2016-03-10]. https://arxiv.org/pdf/1505.00487v3.pdf. [14] VENUGOPALAN S, XU H, DONAHUE J, et al. Translating videos to natural language using deep recurrent neural networks[EB/OL].[2016-03-10]. https://arxiv.org/pdf/1412.4729v3.pdf. [15] SHETTY R, LAAKSONEN J. Video captioning with recurrent networks based on frame- and video-level features and visual content classification[EB/OL].[2016-03-10]. https://arxiv.org/pdf/1512.02949v1.pdf. [16] YAO L, TORABI A, CHO K, et al. Describing videos by exploiting temporal structure[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway, NJ: IEEE, 2015: 4507-4515. [17] 李彦冬, 郝宗波, 雷航. 卷积神经网络研究综述[J]. 计算机应用, 2016, 36(9): 2508-2515.(LI Y D, HAO Z B, LEI H. Survey of convolutional neural network[J]. Journal of Computer Applications, 2016, 36(9): 2508-2515.) [18] FARNEBACK G. Two-frame motion estimation based on polynomial expansion[C]//SCIA 2003: Proceedings of the 13th Scandinavian Conference on Image Analysis, LNCS 2749. Berlin: Springer, 2003:363-370. [19] GKIOXARI G, MALIK J. Finding action tubes[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015:759-768. [20] WANG H, KLASER A, SCHMID C, et al. Action recognition by dense trajectories[C]//CVPR 2011: Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition. Washington, DC: IEEE Computer Society, 2011:3169-3176. [21] VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation[EB/OL].[2016-03-10]. https://arxiv.org/pdf/1411.5726v2.pdf. [22] PAPINENI K. BLEU: a method for automatic evaluation of machine translation[J]. Wireless Networks, 2015, 4(4):307-318. [23] FLICK C. ROUGE: a package for automatic evaluation of summaries[EB/OL].[2016-03-10]. http://anthology.aclweb.org/W/W04/W04-1013.pdf. [24] DENKOWSKI M, LAVIE A. Meteor universal: language specific translation evaluation for any target language[EB/OL].[2016-03-10]. https://www.cs.cmu.edu/~alavie/METEOR/pdf/meteor-1.5.pdf. [25] CHEN X, FANG H, LIN T, et al. Microsoft COCO captions: data collection and evaluation server[EB/OL].[2016-09-14]. https://arxiv.org/pdf/1504.00325v2.pdf. |