《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (3): 849-855.DOI: 10.11772/j.issn.1001-9081.2024091325

• 大模型前沿研究与典型应用 • 上一篇    下一篇

基于大语言模型的多输入中文拼写纠错方法

马灿1,2, 黄瑞章1,2(), 任丽娜1,2,3, 白瑞娜1,2, 伍瑶瑶1,2   

  1. 1.文本计算与认知智能教育部工程研究中心(贵州大学),贵阳 550025
    2.贵州大学 计算机科学与技术学院,贵阳 550025
    3.贵州轻工职业技术学院 信息工程系,贵阳 550025
  • 收稿日期:2024-09-20 修回日期:2024-12-11 接受日期:2024-12-13 发布日期:2025-02-13 出版日期:2025-03-10
  • 通讯作者: 黄瑞章
  • 作者简介:马灿(1992—),男,湖北鄂州人,硕士,主要研究方向:文本纠错、文本挖掘
    任丽娜(1987—),女,辽宁阜新人,讲师,博士,主要研究方向:文本挖掘、机器学习
    白瑞娜(1994—),女,新疆乌鲁木齐人,副教授,博士,主要研究方向:文本挖掘、机器学习
    伍瑶瑶(2000—),女,贵州安顺人,硕士研究生,主要研究方向:文本挖掘、机器学习。
  • 基金资助:
    国家自然科学基金资助项目(62066007);贵州省科技支撑计划项目(2022277)

Chinese spelling correction method based on LLM with multiple inputs

Can MA1,2, Ruizhang HUANG1,2(), Lina REN1,2,3, Ruina BAI1,2, Yaoyao WU1,2   

  1. 1.Engineering Research Center of Ministry of Education for Text Computing and Cognitive Intelligence (Guizhou University),Guiyang Guizhou 550025,China
    2.College of Computer Science and Technology,Guizhou University,Guiyang Guizhou 550025,China
    3.Department of Information Engineering,Guizhou Light Industry Technical College,Guiyang Guizhou 550025,China
  • Received:2024-09-20 Revised:2024-12-11 Accepted:2024-12-13 Online:2025-02-13 Published:2025-03-10
  • Contact: Ruizhang HUANG
  • About author:MA Can, born in 1992, M. S. His research interests include text correction, text mining.
    REN Lina, born in 1987, Ph. D., lecturer. Her research interests include text mining, machine learning.
    BAI Ruina, born in 1994, Ph. D., associate professor. Her research interests include text mining, machine learning.
    WU Yaoyao, born in 2000, M. S. candidate. Her research interests include text mining, machine learning.
  • Supported by:
    National Natural Science Foundation of China(62066007);Guizhou Province Science and Technology Support Program(2022277)

摘要:

中文拼写纠错(CSC)是自然语言处理(NLP)中的一项重要研究任务,现有的基于大语言模型(LLM)的CSC方法由于LLM的生成机制,会生成和原文存在语义偏差的纠错结果。因此,提出基于LLM的多输入CSC方法。该方法包含多输入候选集合构建和LLM纠错两阶段:第一阶段将多个小模型的纠错结果构建为多输入候选集合;第二阶段使用LoRA(Low-Rank Adaptation)对LLM进行微调,即借助LLM的推理能力,在多输入候选集合中预测出没有拼写错误的句子作为最终的纠错结果。在公开数据集SIGHAN13、SIGHAN14、SIGHAN15和修正后的SIGHAN15上的实验结果表明,相较于使用LLM直接生成纠错结果的方法Prompt-GEN-1,所提方法的纠错F1值分别提升了9.6、24.9、27.9和34.2个百分点,相较于表现次优的纠错小模型,所提方法的纠错F1值分别提升了1.0、1.1、0.4和2.4个百分点,验证了所提方法能提升CSC任务的效果。

关键词: 中文拼写纠错, 大语言模型, 模型集成, 模型微调, 提示学习

Abstract:

Chinese Spelling Correction (CSC) is an important research task in Natural Language Processing (NLP). The existing CSC methods based on Large Language Models (LLMs) may generate semantic discrepancies between the corrected results and the original content. Therefore, a CSC method based on LLM with multiple inputs was proposed. The method consists of two stages: multi-input candidate set construction and LLM correction. In the first stage, a multi-input candidate set was constructed using error correction results of several small models. In the second stage, LoRA (Low-Rank Adaptation) was employed to fine-tune the LLM, which means that with the aid of reasoning capabilities of the LLM, sentences without spelling errors were deduced from the multi-input candidate set and used as the final error correction results. Experimental results on the public datasets SIGHAN13, SIGHAN14, SIGHAN15 and revised SIGHAN15 show that the proposed method has the correction F1 value improved by 9.6, 24.9, 27.9, and 34.2 percentage points, respectively, compared to the method Prompt-GEN-1, which generates error correction results directly using an LLM. Compared with the sub-optimal error correction small model, the proposed method has the correction F1 value improved by 1.0, 1.1, 0.4, and 2.4 percentage points, respectively, verifying the proposed method’s ability to enhance the effect of CSC tasks.

Key words: Chinese Spelling Correction (CSC), Large Language Model (LLM), model ensemble, model fine-tuning, prompt learning

中图分类号: