[1] FANG H, GUPTA S, IANDOLA F, et al. From captions to visual concepts and back[C]//Proceedings of the 2015 International Conference on Computer Vision and Pattern Recognition. Washington, DC:IEEE Computer Society, 2015:1473-1482. [2] LeCUN Y, BENGIO Y, HINTON G. Deep learning[J]. Nature, 2015, 521(7553):436-444. [3] HOPFIELD J J. Neural networks and physical systems with emergent collective computational abilities[J]. Proceedings of the National Academy of Sciences of the United States of America, 1982, 79(8):2554-2558. [4] MAO J, XU W, YANG Y, et al. Explain images with multimodal recurrent neural networks[EB/OL].[2018-06-10]. https://arxiv.org/pdf/1410.1090v1.pdf. [5] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 2012 International Conference on Neural Information Processing Systems. Lake Tahoe, Nevada:Curran Associates Inc. 2012:1097-1105. [6] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell:a neural image caption generator[C]//Proceedings of the 2015 International Conference on Computer Vision and Pattern Recognition. Washington, DC:IEEE Computer Society, 2015:3156-3164. [7] 梁锐,朱清新,廖淑娇,等.基于多特征融合的深度视频自然语言描述方法[J].计算机应用,2017,37(4):1179-1184. (LIANG R, ZHU Q X, LIAO S J, et al. Deep natural language description method for video based on multi-feature fusion[J]. Journal of Computer Applications, 2017,37(4):1179-1184.) [8] XU K, BA J L, KIROS R, et al. Show, attend and tell:Neural image caption generation with visual attention[EB/OL].[2018-06-08]. https://arxiv.org/pdf/1502.03044.pdf. [9] BAHDANAU D, CHO K H, BENGIO Y. Neural machine translation by jointly learning to align and translate[EB/OL].[2018-06-10]. https://arxiv.org/pdf/1409.0473.pdf. [10] LU J, XIONG C, PARIKH D, et al. Knowing when to look:adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the 2017 International Conference on Computer Vision and Pattern Recognition. Washington, DC:IEEE Computer Society, 2017:3242-3250. [11] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[EB/OL].[2018-05-10]. https://arxiv.org/pdf/1706.03762.pdf. [12] LI J, MEI X, PROKHOROV D, TAO D. Deep neural network for structural prediction and lane detection in traffic scene[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017,28(3):690-703. [13] QU Y, LIN L, SHEN F, et al. Joint hierarchical category structure learning and large-scale image classification[J]. IEEE Transactions on Image Processing, 2017, 26(9):4331-4346. [14] SHELHAMER E, LONG J, DARRELL T. Fully convolutional networks for semantic segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4):640-651. [15] GONG C, TAO D, LIU W, LIU L, YANG J. Label propagation via teaching-to-learn and learning-to-teach[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28(6):1452-1465. [16] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 International Conference on Computer Vision and Pattern Recognition. Washington, DC:IEEE Computer Society, 2016:770-778. [17] WANG P, LIU L, SHEN C, et al. Multi-attention network for one shot learning[C]//Proceedings of the 2017 Conference on Computer Vision and Pattern Recognition. Washington, DC:IEEE Computer Society, 2017:22-25. [18] REN S, HE K, GIRSHICK R, et al. Faster R-CNN:towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6):1137-1149. [19] LIN T-Y, MAIRE M, BELONGIE S, et al. Microsoft COCO:Common Objects in COntext[C]//Proceedings of the 2014 European Conference on Computer Vision. Cham:Springer, 2014:740-755. [20] KARPATHY A, LI F-F. Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the 2015 International Conference on Computer Vision and Pattern Recognition. Washington, DC:IEEE Computer Society, 2015:3128-3137. [21] PAPINENI K, ROUKOS S, WARD T, et al. BLEU:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Stroudsburg, PA:ACL, 2002:311-318. [22] LIN C-Y. Rouge:a package for automatic evaluation of summaries[C]//Proceedings of the ACL 2004 Workshop on Text Summarization. Stroudsburg, PA:ACL, 2004:74-81. [23] BANERJEE S, LAVIE A. METEOR:an automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the 2005 ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Stroudsburg, PA:ACL, 2005:65-72. [24] VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr:consensus-based image description evaluation[C]//Proceedings of the 2015 International Conference on Computer Vision and PatternRecognition. Washington, DC:IEEE Computer Society, 2015:4566-4575. [25] ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and VQA[EB/OL].[2018-05-07]. https://arxiv.org/pdf/1707.07998.pdf. [26] KINGMA D P, BA J. ADAM:a method for stochastic optimization[EB/OL].[2018-04-22]. https://arxiv.org/pdf/1412.6980.pdf. [27] RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning[C]//Proceedings of the 2017 International Conference on Computer Vision and PatternRecognition. Washington, DC:IEEE Computer Society, 2017:1179-1195. [28] YOU Q, JIN H, WANG Z, et al. Image captioning with semantic attention[C]//Proceedings of the 2016 International Conference on Computer Vision and Pattern Recognition. Washington, DC:IEEE Computer Society, 2016:4651-4659. [29] YANG Z, YUAN Y, WU Y, et al. Encode, review, and decode:Reviewer module for caption generation[EB/OL].[2018-06-10]. https://arxiv.org/pdf/1605.07912v1.pdf. [30] YAO T, PAN Y, LI Y, et al. Boosting image captioning with attributes[EB/OL].[2018-03-10]. https://arxiv.org/pdf/1611.01646.pdf. |