面向自然语言处理任务的预训练模型综述

doi:10.11772/j.issn.1001-9081.2020081152

摘要/Abstract

摘要： 近年来，深度学习技术得到了快速发展。在自然语言处理（NLP）任务中，随着文本表征技术从词级上升到了文档级，利用大规模语料库进行无监督预训练的方式已被证明能够有效提高模型在下游任务中的性能。首先，根据文本特征提取技术的发展，从词级和文档级对典型的模型进行了分析；其次，从预训练目标任务和下游应用两个阶段，分析了当前预训练模型的研究现状，并对代表性的模型特点进行了梳理和归纳；最后，总结了当前预训练模型发展所面临的主要挑战并提出了对未来的展望。

关键词: 自然语言处理, 预训练模型, 深度学习, 无监督学习, 神经网络

Abstract: In recent years, deep learning technology has developed rapidly. In Natural Language Processing (NLP) tasks, with text representation technology rising from the word level to the document level, the unsupervised pre-training method using a large-scale corpus has been proved to be able to effectively improve the performance of models in downstream tasks. Firstly, according to the development of text feature extraction technology, typical models were analyzed from word level and document level. Secondly, the research status of the current pre-trained models was analyzed from the two stages of pre-training target task and downstream application, and the characteristics of the representative models were summed up. Finally, the main challenges faced by the development of pre-trained models were summarized and the prospects were proposed.

Key words: Natural Language Processing (NLP), pre-trained model, deep learning, unsupervised learning, neural network

中图分类号:

TP391.1

刘睿珩, 叶霞, 岳增营. 面向自然语言处理任务的预训练模型综述[J]. 计算机应用, 2021, 41(5): 1236-1246.

LIU Ruiheng, YE Xia, YUE Zengying. Review of pre-trained models for natural language processing tasks[J]. Journal of Computer Applications, 2021, 41(5): 1236-1246.

参考文献

[1] PAN S J,YANG Q. A survey on transfer learning[J]. IEEE Transactions on Knowledge and Data Engineering,2010,22(10):1345-1359.
[2] MIKOLOV T,CHEN K,CORRADO G,et al. Efficient estimation of word representations in vector space[EB/OL].[2019-10-09]. https://arxiv.org/pdf/1301.3781.pdf.
[3] PENNINGTON J,SOCHER R,MANNING C D. GloVe:global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA:Association for Computational Linguistics, 2014:1532-1543.
[4] PETERS M E, NEUMANN M, IYYER M, et al. Deep contextualized word representations[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg, PA:Association for Computational Linguistics, 2018:2227-2237.
[5] VASWANI A,SHAZEER N,PARMAR N,et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook,NY:Curran Associates Inc.,2017:6000-6010.
[6] DEVLIN J,CHANG M W,LEE K,et al. BERT:pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg, PA:Association for Computational Linguistics,2019:4171-4186.
[7] RADFORD A,NARASIMHAN K,SALIMANS T,et al. Improving language understanding by generative pre-training[EB/OL].[2019-05-06]. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf.
[8] BENGIO Y, DUCHARME R, VINCENT P, et al. A neural probabilistic language model[J]. Journal of Machine Learning Research,2003,3:1137-1155.
[9] MIKOLOV T. Statistical language models based on neural networks[EB/OL].[2019-08-06]. http://www.fit.vutbr.cz/~imikolov/rnnlm/google.pdf.
[10] MIKOLOV T,KARAFIÁT M,BURGET L,et al. Recurrent neural network based language model[C/OL]//Proceedings of the 11th Annual Conference of the International Speech Communication Association.[2019-08-06]. https://isca-speech.org/archive/archive_papers/interspeech_2010/i10_1045.pdf.
[11] COLLOBERT R,WESTON J. A unified architecture for natural language processing:deep neural networks with multitask learning[C]//Proceedings of the 25th International Conference on Machine Learning. New York:ACM,2008:160-167.
[12] MORIN F,BENGIO Y. Hierarchical probabilistic neural network language model[EB/OL].[2019-09-16]. https://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf.
[13] MNIH A, KAVUKCUOGLU K. Learning word embeddings efficiently with noise-contrastive estimation[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems. Red Hook,NY:Curran Associates Inc., 2013:2265-2273.
[14] JOULIN A,GRAVE E,BOJANOWSKI P,et al. Bag of tricks for efficient text classification[C]//Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics,2017:427-431.
[15] KIM Y. Convolutional neural networks for sentence classification[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg,PA:Association for Computational Linguistics,2014:1746-1751.
[16] CHO K, VAN MERRIËNBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA:Association for Computational Linguistics, 2014:1724-1734.
[17] BAHDANAU D,CHO K,BENGIO Y. Neural machine translation by jointly learning to align and translate[EB/OL].[2019-12-03]. https://arxiv.org/pdf/1409.0473.pdf.
[18] DAI Z,YANG Z,YANG Y,et al. Transformer-XL:attentive language models beyond a fixed-length context[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics,2019:2978-2988.
[19] KITAEV N,KAISER L,LEVSKAYA A. Reformer:the efficient transformer[EB/OL].[2020-05-04]. https://arxiv.org/pdf/2001.04451.pdf.
[20] GOMEZ A N,REN M,URTASUN R,et al. The reversible residual network:backpropagation without storing activations[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY:Curran Associates Inc.,2017:2211-2221.
[21] BELTAGY I,PETERS M E,COHAN A. Longformer:the longdocument transformer[EB/OL].[2020-12-08]. https://arxiv.org/pdf/2004.05150.pdf.
[22] WANG A,SINGH A,MICHAEL J,et al. GLUE:a multi-task benchmark and analysis platform for natural language understanding[C]//Proceedings of the 2018 EMNLP Workshop BlackboxNLP:Analyzing and Interpreting Neural Networks for NLP. Stroudsburg, PA:Association for Computational Linguistics,2018:353-355.
[23] HAO Y,DONG L,WEI F,et al. Visualizing and understanding the effectiveness of BERT[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing/the 9th International Joint Conference on Natural Language Processing. Stroudsburg, PA:Association for Computational Linguistics,2019:4143-4152.
[24] YANG Z, DAI Z, YANG Y, et al. XLNet:generalized autoregressive pretraining for language understanding[C/OL]//Proceedings of the 33rd Conference on Neural Information Processing Systems.[2020-04-08]. https://proceedings.neurips.cc/paper/2019/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf.
[25] HOWARD J,RUDER S. Universal language model fine-tuning for text classification[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg,PA:Association for Computational Linguistics,2018:328-339.
[26] MERITY S, KESKAR N S, SOCHER R. Regularizing and optimizing LSTM language models[EB/OL].[2019-12-13]. https://arxiv.org/pdf/1708.02182.pdf.
[27] CHRONOPOULOU A, BAZIOTIS C, POTAMIANOS A. An embarrassingly simple approach for transfer learning from pretrained language models[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg, PA:Association for Computational Linguistics, 2019:2089-2095.
[28] RADFORD A,WU J,CHILD R,et al. Language models are unsupervised multitask learners[EB/OL].[2020-12-05]. https://d4mucfpksywv.cloudfront.net/better-language-models/languagemodels.pdf.
[29] BROWN T B,MANN B,RYDER N,et al. Language models are few-shot learners[C/OL]//Proceedings of the 34th Conference on Neural Information Processing Systems.[2020-12-04]. https://papers.nips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
[30] SONG K,TAN X,QIN T,et al. MASS:masked sequence to sequence pre-training for language generation[C]//Proceedings of the 36th International Conference on Machine Learning. New York:JMLR. org,2019:5926-5936.
[31] RAFFEL C,SHAZEER N,ROBERTS A,et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. Journal of Machine Learning Research,2020,21:1-67.
[32] LEWIS M,LIU Y,GOYAL N,et al. BART:denoising sequenceto-sequence pre-training for natural language generation, translation, and comprehension[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics, 2020:7871-7880.
[33] LAN Z,CHEN M,GOODMAN S,et al. ALBERT:a lite BERT for self-supervised learning of language representations[EB/OL].[2020-04-06]. https://arxiv.org/pdf/1909.11942.pdf.
[34] JOSHI M, CHEN D, LIU Y, et al. SpanBERT:improving pretraining by representing and predicting spans[J]. Transactions of the Association for Computational Linguistics,2020,8:64-77.
[35] LIU Y, OTT M, GOYAL N, et al. RoBERTa:a robustly optimized BERT pretraining approach[EB/OL].[2020-03-20]. https://arxiv.org/pdf/1907.11692.pdf.
[36] XIAO D,ZHANG H,LI Y,et al. ERNIE-GEN:an enhanced multi-flow pre-training and fine-tuning framework for natural language generation[C]//Proceedings of the 29th International Joint Conference on Artificial Intelligence. Palo Alto,CA:AAAI Press,2020:3997-4003.
[37] QI W,YAN Y,GONG Y,et al. ProphetNet:predicting future ngram for sequence-to-sequence pre-training[M]//COHN T,HE Y, LIU Y. Findings of the Association for Computational Linguistics:EMNLP 2020. Stroudsburg, PA:Association for Computational Linguistics,2020:2401-2410.
[38] RAJPURKAR P, ZHANG J, LOPYREV K, et al. SQuAD:100000+ questions for machine comprehension of text[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Stroudsburg,PA:Association for Computational Linguistics,2016:2383-2392.
[39] DONG L,YANG N,WANG W,et al. Unified language model pre-training for natural language understanding and generation[C/OL]//Proceedings of the 33rd Conference on Neural Information Processing Systems.[2019-12-19]. https://papers.nips.cc/paper/2019/file/c20bb2d9a50d5ac1f713f8b34d9aac5a-Paper.pdf.
[40] BAO H,DONG L,WEI F,et al. UniLMv2:pseudo-masked language models for unified language model pre-training[C]//Proceedings of the 37th International Conference on Machine Learning. New York:JMLR. org,2020:642-652.
[41] SONG K,TAN X,QIN T,et al. MPNet:masked and permuted pre-training for language understanding[C/OL]//Proceedings of the 34th Conference on Neural Information Processing Systems.[2020-12-11]. https://proceedings.neurips.cc/paper/2020/file/c3a690be93aa602ee2dc0ccab5b7b67e-Paper.pdf.
[42] CLARK K,LUONG M T,LE Q V,et al. ELECTRA:pre-training text encoders as discriminators rather than generators[EB/OL].[2020-06-02]. https://arxiv.org/pdf/2003.10555.pdf.
[43] RUDER S,PLANK B. Learning to select data for transfer learning with Bayesian optimization[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA:Association for Computational Linguistics, 2017:372-382.
[44] PETERS M E,RUDER S,SMITH N A. To tune or not to tune? Adapting pretrained representations to diverse tasks[C]//Proceedings of the 4th Workshop on Representation Learning for NLP. Stroudsburg, PA:Association for Computational Linguistics,2019:7-14.
[45] STICKLAND A C,MURRAY I. BERT and PALs:projected attention layers for efficient adaptation in multi-task learning[C]//Proceedings of the 36th International Conference on Machine Learning. New York:JMLR. org,2019:5986-5995.
[46] LIU X,HE P,CHEN W,et al. Multi-task deep neural networks for natural language understanding[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics, 2019:4487-4496.
[47] SUN Y,WANG S,LI Y,et al. ERNIE 2.0:a continual pretraining framework for language understanding[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto, CA:AAAI Press,2020:8968-8975.
[48] HOULSBY N,GIURGIU A,JASTRZEBSKI S,et al. Parameterefficient transfer learning for NLP[C]//Proceedings of the 36th International Conference on Machine Learning. New York:JMLR.org,2019:2790-2799.
[49] PHANG J,FÉVRY T,BOWMAN S R. Sentence encoders on STILTs:supplementary training on intermediate labeled-data tasks[EB/OL].[2019-12-13]. https://arxiv.org/pdf/1811.01088.pdf.
[50] REBUFFI S A, VEDALDI A, BILEN H, et al. Efficient parametrization of multi-domain deep neural networks[C]//Proceedings of the2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2018:8119-8127.
[51] YANG J,ZHAO H. Deepening hidden representations from pretrained language models[EB/OL].[2020-06-03]. https://arxiv.org/pdf/1911.01940.pdf.
[52] GURURANGAN S,MARASOVIĆ A,SWAYAMDIPTA S,et al. Don't stop pretraining:adapt language models to domains and tasks[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics,2020:8342-8360.
[53] SUN Y,WANG S,LI Y,et al. ERNIE:enhanced representation through knowledge integration[EB/OL].[2020-04-24]. https://arxiv.org/pdf/1904.09223.pdf.
[54] HE H,NING Q,ROTH D. QuASE:question-answer driven sentence encoding[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg,PA:Association for Computational Linguistics,2020:8743-8758.
[55] GOODFELLOW I J,SHLENS J,SZEGEDY C. Explaining and harnessing adversarial examples[EB/OL].[2019-12-06]. https://arxiv.org/pdf/1412.6572.pdf.
[56] JIN D,JIN Z,ZHOU J T,et al. Is BERT really robust? A strong baseline for natural language attack on text classification and entailment[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto,CA:AAAI Press,2020:8018-8025.
[57] BENDER E M, KOLLER A. Climbing towards NLU:on meaning, form, and understanding in the age of data[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics,2020:5185-5198.
[58] JIAO X,YIN Y,SHANG L,et al. TinyBERT:distilling BERT for natural language understanding[M]//COHN T,HE Y,LIU Y. Findings of the Association for Computational Linguistics:EMNLP 2020. Stroudsburg, PA:Association for Computational Linguistics,2020:4163-4174.
[59] YU F,TANG J,YIN W,et al. ERNIE-ViL:knowledge enhanced vision-language representations through scene graph[EB/OL].[2020-09-02]. https://arxiv.org/pdf/2006.16934.pdf.
[60] MIYATO T,DAI A M,GOODFELLOW I. Adversarial training methods for semi-supervised text classification[EB/OL].[2020-07-02]. https://arxiv.org/pdf/1605.07725.pdf.
[61] MADRY A,MAKELOV A,SCHMIDT L,et al. Towards deep learning models resistant to adversarial attacks[EB/OL].[2019-12-13]. https://arxiv.org/pdf/1706.06083.pdf.
[62] ZHU C,CHENG Y,GAN Z,et al. FreeLB:enhanced adversarial training for language understanding[EB/OL].[2020-07-08]. https://arxiv.org/pdf/1909.11764.pdf.
[63] RAJPURKAR P,JIA R,LIANG P. Know what you don't know:unanswerable questions for SQuAD[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics 2018:784-789.
[64] WANG A, PRUKSACHATKUN Y, NANGIA N, et al. SuperGLUE:a stickier benchmark for general-purpose language understanding systems[C/OL]//Proceedings of the 33rd Conference on Neural Information Processing Systems.[2020-06-02]. https://papers.nips.cc/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf.
[65] RIBEIRO M T,WU T,GUESTRIN C,et al. Beyond accuracy:behavioral testing of NLP models with checklist[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics,2020:4902-4912.