《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (3): 860-866.DOI: 10.11772/j.issn.1001-9081.2021030441

• 人工智能 • 上一篇    

基于BART噪声器的中文语法纠错模型

孙邱杰(), 梁景贵, 李思   

  1. 北京邮电大学 人工智能学院,北京 100876
  • 收稿日期:2021-03-23 修回日期:2021-07-20 接受日期:2021-07-21 发布日期:2022-04-09 出版日期:2022-03-10
  • 通讯作者: 孙邱杰
  • 作者简介:梁景贵(1996—),男,广西玉林人,硕士研究生,主要研究方向:自然语言理解、语法纠错
    李思(1985—),女,内蒙古赤峰人,副教授,博士,主要研究方向:中文自然语言理解、计算机视觉。
  • 基金资助:
    国家自然科学基金资助项目(61702047)

Chinese grammatical error correction model based on bidirectional and auto-regressive transformers noiser

Qiujie SUN(), Jinggui LIANG, Si LI   

  1. School of Artificial Intelligence,Beijing University of Posts and Telecommunications,Beijing 100876,China
  • Received:2021-03-23 Revised:2021-07-20 Accepted:2021-07-21 Online:2022-04-09 Published:2022-03-10
  • Contact: Qiujie SUN
  • About author:LIANG Jinggui, born in 1996, M. S. candidate. His research interests include natural language understanding, grammatical error correction.
    LI Si, born in 1985, Ph. D., associate professor. Her research interests include Chinese natural language understanding, computer vision.
  • Supported by:
    National Natural Science Foundation of China(61702047)

摘要:

在中文语法纠错中,基于神经机器翻译的方法被广泛应用,该方法在训练过程中需要大量的标注数据才能保障性能,但中文语法纠错的标注数据较难获取。针对标注数据有限导致中文语法纠错系统性能不佳问题,提出一种基于BART噪声器的中文语法纠错模型——BN-CGECM。首先,为了加快模型的收敛,使用基于BERT的中文预训练语言模型对BN-CGECM的编码器参数进行初始化;其次,在训练过程中,通过BART噪声器对输入样本引入文本噪声,自动生成更多样的含噪文本用于模型训练,从而缓解标注数据有限的问题。在NLPCC 2018数据集上的实验结果表明,所提模型的F0.5值比有道开发的中文语法纠错系统(YouDao)提高7.14个百分点,比北京语言大学开发的集成中文语法纠错系统(BLCU_ensemble)提高6.48个百分点;同时,所提模型不增加额外的训练数据量,增强了原始数据的多样性,且具有更快的收敛速度。

关键词: 数据增强, 中文语法纠错, 文本噪声, 深度学习, 序列到序列模型, BART噪声器

Abstract:

Methods based on neural machine translation are widely used in Chinese grammatical error correction. These methods require a large amount of annotation data to guarantee the performance, which is difficult to obtain in Chinese grammatical error correction. Focused on the issue that the limited size of annotation data constrains Chinese grammatical error correction system’s performance, a Chinese Grammatical Error Correction Model based on Bidirectional and Auto-Regressive Transformers (BART) Noiser (BN-CGECM) was proposed. Firstly, to speed up model convergence, Chinese pretraining language model based on BERT (Bidirectional Encoder Representation from Transformers) was used to initialize the parameters of BN-CGECM’s encoder. Secondly, a BART noiser was used to introduce text noise to the input samples in the training process to automatically generate diverse noisy data, which was used to alleviate the problem of limited size of annotation data. Experimental results on NLPCC 2018 dataset demonstrate that the proposed model achieves F0.5 by 7.14 percentage points higher than that of the Chinese grammatical error correction system proposed by YouDao, and 6.48 percentage points higher than that of the Chinese grammatical error correction ensemble system proposed by Beijing Language and Culture University (BLCU_ensemble). Meanwhile, the proposed model enhances the diversity of the original data and converges faster without increasing the amount of training data.

Key words: data augmentation, Chinese grammatical error correction, text noise, deep learning, Sequence to Sequence (Seq2Seq) model, Bidirectional and Auto-Regressive Transformers (BART) noiser

中图分类号: