《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (3): 860-866.DOI: 10.11772/j.issn.1001-9081.2021030441
所属专题: 人工智能
收稿日期:
2021-03-23
修回日期:
2021-07-20
接受日期:
2021-07-21
发布日期:
2022-04-09
出版日期:
2022-03-10
通讯作者:
孙邱杰
作者简介:
梁景贵(1996—),男,广西玉林人,硕士研究生,主要研究方向:自然语言理解、语法纠错基金资助:
Qiujie SUN(), Jinggui LIANG, Si LI
Received:
2021-03-23
Revised:
2021-07-20
Accepted:
2021-07-21
Online:
2022-04-09
Published:
2022-03-10
Contact:
Qiujie SUN
About author:
LIANG Jinggui, born in 1996, M. S. candidate. His research interests include natural language understanding, grammatical error correction.Supported by:
摘要:
在中文语法纠错中,基于神经机器翻译的方法被广泛应用,该方法在训练过程中需要大量的标注数据才能保障性能,但中文语法纠错的标注数据较难获取。针对标注数据有限导致中文语法纠错系统性能不佳问题,提出一种基于BART噪声器的中文语法纠错模型——BN-CGECM。首先,为了加快模型的收敛,使用基于BERT的中文预训练语言模型对BN-CGECM的编码器参数进行初始化;其次,在训练过程中,通过BART噪声器对输入样本引入文本噪声,自动生成更多样的含噪文本用于模型训练,从而缓解标注数据有限的问题。在NLPCC 2018数据集上的实验结果表明,所提模型的F0.5值比有道开发的中文语法纠错系统(YouDao)提高7.14个百分点,比北京语言大学开发的集成中文语法纠错系统(BLCU_ensemble)提高6.48个百分点;同时,所提模型不增加额外的训练数据量,增强了原始数据的多样性,且具有更快的收敛速度。
中图分类号:
孙邱杰, 梁景贵, 李思. 基于BART噪声器的中文语法纠错模型[J]. 计算机应用, 2022, 42(3): 860-866.
Qiujie SUN, Jinggui LIANG, Si LI. Chinese grammatical error correction model based on bidirectional and auto-regressive transformers noiser[J]. Journal of Computer Applications, 2022, 42(3): 860-866.
错误 类型 | 错误句子(输入序列) | 正确句子(输出序列) |
---|---|---|
M | 中国是世界拥有最多“烟民”的国家。 | 中国是世界上拥有最多“烟民”的国家。 |
R | 孩子的教育不能只靠一个学校老师。 | 孩子的教育不能只靠一个老师。 |
S | 父母对孩子的爱情是最重要的。 | 父母对孩子的关爱是最重要的。 |
W | 生产率较低,那肯定价格要上升。 | 生产率较低,那价格肯定要上升。 |
表1 语法纠错任务示例
Tab. 1 Examples for grammatical error correction
错误 类型 | 错误句子(输入序列) | 正确句子(输出序列) |
---|---|---|
M | 中国是世界拥有最多“烟民”的国家。 | 中国是世界上拥有最多“烟民”的国家。 |
R | 孩子的教育不能只靠一个学校老师。 | 孩子的教育不能只靠一个老师。 |
S | 父母对孩子的爱情是最重要的。 | 父母对孩子的关爱是最重要的。 |
W | 生产率较低,那肯定价格要上升。 | 生产率较低,那价格肯定要上升。 |
类型 | 句子数 | Src词数 | Tgt词数 |
---|---|---|---|
原始训练集 | 1 200 000 | 23700 000 | 25 000 000 |
伪训练集 | 1 200 000 | 23700 000 | 25 100 000 |
验证集 | 5 000 | 99 300 | 104 100 |
测试集 | 2 000 | 58 900 | — |
表2 NLPCC 2018 Task 2 数据集
Tab. 2 NLPCC 2018 Task 2 dataset
类型 | 句子数 | Src词数 | Tgt词数 |
---|---|---|---|
原始训练集 | 1 200 000 | 23700 000 | 25 000 000 |
伪训练集 | 1 200 000 | 23700 000 | 25 100 000 |
验证集 | 5 000 | 99 300 | 104 100 |
测试集 | 2 000 | 58 900 | — |
模型 | P | R | |
---|---|---|---|
YouDao | 35.24 | 18.64 | 29.91 |
BLCU | 41.73 | 13.08 | 29.02 |
BLCU_ensemble | 47.63 | 12.56 | 30.57 |
BERT-encoder | 32.67 | 22.19 | 29.76 |
BERT-encoder_ensemble | 41.84 | 22.02 | 35.51 |
BN-CGECM | 44.27 | 18.36 | 34.53 |
BN-CGECM_ensemble | 51.57 | 17.43 | 37.05 |
表3 几种模型在NLPCC 2018数据集的实验结果 (%)
Tab. 3 Experimental results of several models on NLPCC 2018 dataset
模型 | P | R | |
---|---|---|---|
YouDao | 35.24 | 18.64 | 29.91 |
BLCU | 41.73 | 13.08 | 29.02 |
BLCU_ensemble | 47.63 | 12.56 | 30.57 |
BERT-encoder | 32.67 | 22.19 | 29.76 |
BERT-encoder_ensemble | 41.84 | 22.02 | 35.51 |
BN-CGECM | 44.27 | 18.36 | 34.53 |
BN-CGECM_ensemble | 51.57 | 17.43 | 37.05 |
方法 | P | R | |
---|---|---|---|
Char-Transformer | 39.95 | 12.71 | 27.96 |
Char-Transformer+字屏蔽 | 45.25 | 17.40 | 34.28 |
Char-Transformer+随机字替换 | 21.38 | 24.15 | 21.88 |
Char-Transformer+文本填充 | 46.16 | 16.25 | 33.74 |
Char-Transformer+混合方法 | 44.27 | 18.36 | 34.53 |
表4 不同噪声方法的实验结果 (%)
Tab. 4 Experimental results of different noise methods
方法 | P | R | |
---|---|---|---|
Char-Transformer | 39.95 | 12.71 | 27.96 |
Char-Transformer+字屏蔽 | 45.25 | 17.40 | 34.28 |
Char-Transformer+随机字替换 | 21.38 | 24.15 | 21.88 |
Char-Transformer+文本填充 | 46.16 | 16.25 | 33.74 |
Char-Transformer+混合方法 | 44.27 | 18.36 | 34.53 |
预训练模型 | P | R | |
---|---|---|---|
— | 44.27 | 18.36 | 34.53 |
Chinese-BERT-wwm | 44.46 | 18.38 | 34.63 |
Chinese-BERT-wwm-ext | 44.38 | 18.37 | 34.59 |
Chinese-RoBERTa-wwm-ext | 45.55 | 18.50 | 35.24 |
表5 不同预训练模型的实验结果 (%)
Tab. 5 Experimental results of different pre-trained models
预训练模型 | P | R | |
---|---|---|---|
— | 44.27 | 18.36 | 34.53 |
Chinese-BERT-wwm | 44.46 | 18.38 | 34.63 |
Chinese-BERT-wwm-ext | 44.38 | 18.37 | 34.59 |
Chinese-RoBERTa-wwm-ext | 45.55 | 18.50 | 35.24 |
1 | MARTINS B, SILVA M J. Spelling correction for search engine queries [C]// Proceedings of the 2004 International Conference on Natural Language Processing. Cham: Springer, 2004: 372-383. 10.1007/978-3-540-30228-5_33 |
2 | GAO J F, LI X L, MICOL D, et al. A large scale ranker-based system for search query spelling correction [C]// Proceedings of the 23rd International Conference on Computational Linguistics. New York: ACM, 2010: 358-366. |
3 | AFLI H, QIU Z, WAY A, et al. Using SMT for OCR error correction of historical texts [C]// Proceedings of the Tenth International Conference on Language Resources and Evaluation. Portorož: European Language Resources Association, 2016: 962-966. |
4 | WANG D M, SONG Y, LI J, et al. A hybrid approach to automatic corpus generation for Chinese spelling check [C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels: Association for Computational Linguistics, 2018: 2517-2527. 10.18653/v1/d18-1273 |
5 | BURSTEIN J, CHODOROW M. Automated essay scoring for nonnative English speakers[C]// Proceedings of a Symposium on Computer Mediated Language Assessment and Evaluation in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 1999: 68-75. 10.3115/1598834.1598847 |
6 | YUAN Z, BRISCOE T. Grammatical error correction using neural machine translation [C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: Association for Computational Linguistics, 2016: 380-386. 10.18653/v1/n16-1042 |
7 | JI J S, WANG Q L, TOUTANOVA K, et al. A nested attention neural hybrid model for grammatical error correction [EB/OL]. [2020-10-10]. . 10.18653/v1/p17-1070 |
8 | CHOLLAMPATT S, NG H T. A multilayer convolutional encoder-decoder neural network for grammatical error correction [EB/OL]. [2020-10-10]. . 10.18653/v1/d18-1274 |
9 | CHENG X Y, XU W D, CHEN K L, et al. SpellGCN: incorporating phonological and visual similarities into language models for Chinese spelling check[EB/OL]. [2021-01-10]. . 10.18653/v1/2020.acl-main.81 |
10 | REN H K, YANG L, XUN E. A sequence to sequence learning for Chinese grammatical error correction [C]// Proceedings of the 7th CCF International Conference on Natural Language Processing and Chinese Computing. Cham: Springer, 2018: 401-410. 10.1007/978-3-319-99501-4_36 |
11 | ZHOU J, LI C, LIU H, et al. Chinese grammatical error correction using statistical and neural models [C]// Proceedings of the 7th CCF International Conference on Natural Language Processing and Chinese Computing. Cham: Springer, 2018: 117-128. 10.1007/978-3-319-99501-4_10 |
12 | 张佳宁,严冬梅,王勇. 基于word2vec的语音识别后文本纠错[J]. 计算机工程与设计, 2020,41(11):3235-3240. 10.16208/j.issn1000-7024.2020.11.038 |
ZHANG J N, YAN D M, WANG Y. Text correction based on word2vec speech recognition full-text in Chinese [J]. Computer Engineering and Design, 2020,41(11):3235-3240. 10.16208/j.issn1000-7024.2020.11.038 | |
13 | DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: Association for Computational Linguistics, 2019: 4171-4186. 10.18653/v1/n19-1423 |
14 | CUI Y M, CHE W X, LIU T, et al. Revisiting pre-trained models for Chinese natural language processing [EB/OL].[2020-10-10]. . 10.18653/v1/2020.findings-emnlp.58 |
15 | ZHANG Z, HAN X, LIU Z, et al. ERNIE: enhanced language representation with informative entities [C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2019: 1441-1451. 10.18653/v1/p19-1139 |
16 | KIYONO S, SUZUKI J, MITA M, et al. An empirical study of incorporating pseudo data into grammatical error correction [C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2019: 1236-1242. 10.18653/v1/d19-1119 |
17 | LEWIS M, LIU Y, GOYAL N, et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension [C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2020: 7871-7880. 10.18653/v1/2020.acl-main.703 |
18 | REN H, YANG L, XUN E. A sequence to sequence learning for Chinese grammatical error correction [C]// Proceedings of the 2018 CCF International Conference on Natural Language Processing and Chinese Computing. Cham: Springer, 2018: 401-410. 10.1007/978-3-319-99501-4_36 |
19 | BUSTAMANTE F R, LEÓN F S. GramCheck: a grammar and style checker [C]// Proceedings of the 16th Conference on Computational Linguistics. New York: ACM, 1996,1: 175-181. 10.3115/992628.992661 |
20 | HEIDORN G E, JENSEN K, MILLER L A, et al. The EPISTLE text-critiquing system[J]. IBM Systems Journal, 1982, 21(3): 305-326. 10.1147/sj.213.0305 |
21 | DE FELICE R, PULMAN S. A classifier-based approach to preposition and determiner error correction in L2 English [C]// Proceedings of the 22nd International Conference on Computational Linguistics. New York: ACM, 2008: 169-176. 10.3115/1599081.1599103 |
22 | ROZOVSKAYA A, ROTH D. Grammatical error correction: machine translation and classifiers [C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2016: 2205-2215. 10.18653/v1/p16-1208 |
23 | BROCKETT C, DOLAN W B, GAMON M. Correcting ESL errors using phrasal SMT techniques [C]// Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2006: 249-256. 10.3115/1220175.1220207 |
24 | JUNCZYS-DOWMUNT M, GRUNDKIEWICZ R. Phrase-based machine translation is state-of-the-art for automatic grammatical error correction [C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2016: 1546-1556. 10.18653/v1/d16-1161 |
25 | NG H T, WU S M, BRISCOE T, et al. The CoNLL-2014 shared task on grammatical error correction [C]// Proceedings of the 18th Conference on Computational Natural Language Learning: Shared Task. Stroudsburg, PA: Association for Computational Linguistics, 2014: 1-14. 10.3115/v1/w14-1701 |
26 | XIE Z, AVATI A, ARIVAZHAGAN N, et al. Neural language correction with character-based attention [EB/OL]. [2016-05-31]. . |
27 | CHOLLAMPATT S, NG H T. A multilayer convolutional encoder-decoder neural network for grammatical error correction [EB/OL]. [2018-01-26]. . 10.18653/v1/d18-1274 |
28 | GRUNDKIEWICZ R, JUNCZYS-DOWMUNT M. Near human-level performance in grammatical error correction with hybrid machine translation[EB/OL]. [2018-04-16]. . 10.18653/v1/n18-2046 |
29 | WANG H, KUROSAWA M, KATSUMATA S, et al. Chinese grammatical correction using BERT-based pre-trained model [EB/OL]. [2020-11-04]. . |
30 | FELICE M, YUAN Z. Generating artificial errors for grammatical error correction [C]// Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2014: 116-126. 10.3115/v1/e14-3013 |
31 | XIE Z, AVATI A, ARIVAZHAGAN N, et al. Neural language correction with character-based attention [EB/OL]. [2016-05-31]. . |
32 | ZHAO Y, JIANG N, SUN W, et al. Overview of the NLPCC 2018 shared task: grammatical error correction [C]// Proceedings of the 2018 CCF International Conference on Natural Language Processing and Chinese Computing. Cham: Springer, 2018: 439-445. 10.1007/978-3-319-99501-4_41 |
33 | FU K, HUANG J, DUAN Y. Youdao’s winning solution to the NLPCC-2018 task 2 challenge: a neural machine translation approach to Chinese grammatical error correction [C]// Proceedings of the 2018 CCF International Conference on Natural Language Processing and Chinese Computing. Cham: Springer, 2018: 341-350. 10.1007/978-3-319-99495-6_29 |
34 | DAHLMEIER D, NG H T. Better evaluation for grammatical error correction [C]// Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: Association for Computational Linguistics, 2012: 568-572. |
[1] | 秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974. |
[2] | 王熙源, 张战成, 徐少康, 张宝成, 罗晓清, 胡伏原. 面向手术导航3D/2D配准的无监督跨域迁移网络[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2911-2918. |
[3] | 黄云川, 江永全, 黄骏涛, 杨燕. 基于元图同构网络的分子毒性预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2964-2969. |
[4] | 潘烨新, 杨哲. 基于多级特征双向融合的小目标检测优化模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2871-2877. |
[5] | 李顺勇, 李师毅, 胥瑞, 赵兴旺. 基于自注意力融合的不完整多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2696-2703. |
[6] | 杨莹, 郝晓燕, 于丹, 马垚, 陈永乐. 面向图神经网络模型提取攻击的图数据生成方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2483-2492. |
[7] | 刘禹含, 吉根林, 张红苹. 基于骨架图与混合注意力的视频行人异常检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2551-2557. |
[8] | 顾焰杰, 张英俊, 刘晓倩, 周围, 孙威. 基于时空多图融合的交通流量预测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2618-2625. |
[9] | 石乾宏, 杨燕, 江永全, 欧阳小草, 范武波, 陈强, 姜涛, 李媛. 面向空气质量预测的多粒度突变拟合网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2643-2650. |
[10] | 赵亦群, 张志禹, 董雪. 基于密集残差物理信息神经网络的各向异性旅行时计算方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2310-2318. |
[11] | 徐松, 张文博, 王一帆. 基于时空信息的轻量视频显著性目标检测网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2192-2199. |
[12] | 孙逊, 冯睿锋, 陈彦如. 基于深度与实例分割融合的单目3D目标检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2208-2215. |
[13] | 吴筝, 程志友, 汪真天, 汪传建, 王胜, 许辉. 基于深度学习的患者麻醉复苏过程中的头部运动幅度分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2258-2263. |
[14] | 李欢欢, 黄添强, 丁雪梅, 罗海峰, 黄丽清. 基于多尺度时空图卷积网络的交通出行需求预测[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2065-2072. |
[15] | 张郅, 李欣, 叶乃夫, 胡凯茜. 基于暗知识保护的模型窃取防御技术DKP[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2080-2086. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||