《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (5): 1528-1534.DOI: 10.11772/j.issn.1001-9081.2024050628

• 人工智能 • 上一篇    

基于多模态信息融合的中文拼写纠错算法

张庆1,2, 杨凡1,2(), 方宇涵1,2   

  1. 1.中国科学院 成都计算机应用研究所,成都 610213
    2.中国科学院大学,北京 100049
  • 收稿日期:2024-05-17 修回日期:2024-12-10 接受日期:2024-12-26 发布日期:2025-01-03 出版日期:2025-05-10
  • 通讯作者: 杨凡
  • 作者简介:张庆(2000—),男,山西阳泉人,硕士研究生,主要研究方向:自然语言处理、大数据分析、机器学习
    杨凡(1978—),男,江苏丹阳人,高级工程师,博士,主要研究方向:大数据、人工智能、工业软件
    方宇涵(2000—),男,四川资阳人,硕士研究生,主要研究方向:深度学习、自然语言处理。
  • 基金资助:
    四川省科技计划项目(24QYCX0229);成都市重点研发支撑计划(2023-YF11-00092-HZ)

Chinese spelling correction algorithm based on multi-modal information fusion

Qing ZHANG1,2, Fan YANG1,2(), Yuhan FANG1,2   

  1. 1.Chengdu Institute of Computer Application,Chinese Academy of Sciences,Chengdu Sichuan 610213,China
    2.University of Chinese Academy of Sciences,Beijing 100049,China
  • Received:2024-05-17 Revised:2024-12-10 Accepted:2024-12-26 Online:2025-01-03 Published:2025-05-10
  • Contact: Fan YANG
  • About author:ZHANG Qing, born in 2000, M. S. candidate. His research interests include natural language processing, big data analytics, machine learning.
    YANG Fan, born in 1978, Ph. D., senior engineer. His research interests include big data, artificial intelligence, industrial software.
    FANG Yuhan, born in 2000, M. S. candidate. His research interests include deep learning, natural language processing.
  • Supported by:
    Science and Technology Program of Sichuan Province(24QYCX0229);Key Research and Development Support Program of Chengdu(2023-YF11-00092-HZ)

摘要:

中文拼写纠错(CSC)的目标是检测和修正用户输入中文文本中的字或词级别的错误,这些错误通常是由于汉字之间的语义、字音或字形相似而导致的误用。然而,现有模型通常忽略了局部信息,无法充分捕捉不同汉字之间的字音和字形相似性,也无法有效地将这些信息与语义信息结合起来。为了解决这些问题,提出一种基于多模态信息融合的CSC算法PWSpell。该算法利用卷积注意力机制关注局部语义信息,利用拼音编码捕捉汉字之间的字音相似关系,并首次将五笔编码引入CSC领域,用于捕捉汉字之间的字形相似关系。此外,将这2种相似关系与经过BERT(Bidirectional Encoder Representation from Transformers)处理的语义信息进行选择性融合。实验结果表明,PWSpell在SIGHAN 2015测试集的检测级指标上准确率、精确率、F1值以及校正级指标精确率、F1值上均有提升,其中校正级的精确率至少提升了1个百分点;消融实验结果也验证了算法中各个模块的设计都能有效提升模型的性能。

关键词: 中文自然语言处理, 中文拼写纠错, BERT, 多模态信息融合, 局部信息

Abstract:

The goal of Chinese Spelling Correction (CSC) is to detect and correct character or word-level errors in user-input Chinese text, which commonly arise from semantic, phonetic, or glyphic similarities among Chinese characters. However, existing models often neglect local information, and fail to fully capture phonetic and glyphic similarities among different Chinese characters, as well as effectively integrate these similarities with semantic information. To address these issues, a new CSC algorithm based on multimodal information fusion was proposed, namely PWSpell. This algorithm utilized a convolutional attention mechanism to focus on local semantic information, employed Pinyin encoding to capture phonetic similarities among characters, and, for the first time, introduced Wubi encoding into the CSC domain for capturing glyphic similarities among Chinese characters. Additionally, it selectively integrated these two types of similarity information with semantic information processed by BERT (Bidirectional Encoder Representation from Transformers). Experimental results demonstrate that PWSpell improves error detection accuracy, precision, F1-score, as well as correction precision and F1-score on SIGHAN 2015 test set, with at least one percentage point increase in correction precision. Ablation experimental results also validate that the design of each module in PWSpell effectively improves its performance.

Key words: Chinese natural language processing, Chinese Spelling Correction (CSC), BERT (Bidirectional Encoder Representation from Transformers), multimodal information fusion, local information

中图分类号: