[1] KULKARNI G, PREMRAJ V, DHAR S, et al. BabyTalk:understanding and generating simple image descriptions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2013, 35(12):2891-2903. [2] MITCHELL M,DODGE J,GOYAL A,et al. Midge:generating image descriptions from computer vision detections[C]//Proceedings of the 2012 13th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg:ACL, 2012:747-756. [3] ELLIOTT D, KELLER F. Image description using visual dependency representations[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Stroudsburg:ACL,2013:1292-1302. [4] FARHADI A,HEJRATI M,SADEGHI M A,et al. Every picture tells a story:generating sentences from images[C]//Proceedings of the 2010 European Conference on Computer Vision,LNCS 6314. Berlin:Springer,2010:15-29. [5] SOCHER R, KARPATHY A, LE Q V, et al. Grounded compositional semantics for finding and describing images with sentences[J]. Transactions of the Association for Computational Linguistics,2014,2:207-218. [6] KUZNETSOVA P,ORDONEZ V,BERG T L,et al. TreeTalk:composition and compression of trees for image descriptions[J]. Transactions of the Association for Computational Linguistics, 2014,2:351-362. [7] KUZNETSOVA P,ORDONEZ V,BERG A,et al. Generalizing image captions for image-text parallel corpus[C]//Proceedings of the 201351st Annual Meeting of the Association for Computational Linguistics. Stroudsburg:ACL,2013:790-796. [8] MASON R,CHARNIAK E. Nonparametric method for data-driven image captioning[C]//Proceedings of the 201452nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg:ACL,2014:592-598. [9] MAO J,XU W,YANG Y,et al. Deep captioning with multimodal Recurrent Neural Networks (m-RNN)[EB/OL].[2020-11-17]. https://arxiv.org/pdf/1412.6632.pdf. [10] KIROS R,SALAKHUTDINOV R,ZEMEL R S. Unifying visualsemantic embeddings with multimodal neural language models[EB/OL].[2020-11-17]. https://arxiv.org/pdf/1411.2539.pdf. [11] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2015:1-9. [12] JIA X,GAVVES E,FERNANDO B,et al. Guiding the long-short term memory model for image caption generation[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway:IEEE,2015:2407-2415. [13] XU K,BA JI,KIROS R,et al. Show,attend and tell:neural image caption generation with visual attention[C]//Proceedings of the 201532nd International Conference on Machine Learning. New York:JMLR. org,2015:2048-2057. [14] LI L,TANG S,DENG L,et al. Image caption with global-local attention[C]//Proceedings of the 201731st AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press,2017:4133-4139. [15] LUO M,CHANG X,LI Z,et al. Simple to complex cross-modal learning to rank[J]. Computer Vision and Image Understanding, 2017,163:67-77. [16] HE X,SHI B,BAI X,et al. Image caption generation with part of speech guidance[J]. Pattern Recognition Letters,2019,119:229-237. [17] YANG J, SUN Y, LIANG J, et al. Image captioning by incorporating affective concepts learned from both visual and textual components[J]. Neurocomputing,2019,328:56-68. [18] ZHAO D,CHANG Z,GUO S. A multimodal fusion approach for image captioning[J]. Neurocomputing,2019,329:476-485. [19] LIU W,ANGUELOV D,ERHAN D,et al. SSD:single shot MultiBox detector[C]//Proceedings of the 2016 European Conference on Computer Vision,LNCS 9905. Cham:Springer, 2016:21-37. [20] HE K,ZHANG X,REN S,et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:770-778. [21] WANG C,YANG H,MEINEL C. Image captioning with deep bidirectional LSTMs and multi-task learning[J]. ACM Transactions on Multimedia Computing, Communications, and Applications,2018,14(2s):Article No. 40. [22] VINYALS O,TOSHEV A,BENGIO S,et al. Show and tell:a neural image caption generator[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2015:3156-3164. [23] DONAHUE J,HENDRICKS L A,ROHRBACH M,et al. Longterm recurrent convolutional networks for visual recognition and description[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(4):677-691. [24] KARPATHY A,LI F F. Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2015:3128-3137. [25] CHENG Y, HUANG F, ZHOU L, et al. A hierarchical multimodal attention-based neural network for image captioning[C]//Proceedings of the 2017 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM,2017:889-892. [26] QU S,XI Y,DING S. Visual attention based on long-short term memory model for image caption generation[C]//Proceedings of the 2017 29th Chinese Control and Decision Conference. Piscataway:IEEE,2017:4789-4794. [27] 王媛华. 基于多融合模型的图像语义描述研究[J]. 河南科技, 2019(14):34-36.(WANG Y H. Image caption based on multifusion model[J]. Henan Science and Technology,2019(14):34-36.) |