Journal of Computer Applications
Next Articles
Received:
Revised:
Online:
Published:
张伟1,牛家祥2,马继超1,沈琼霞3
通讯作者:
基金资助:
Abstract: As current leading Chinese Spelling Correction (CSC) model, ReLM (Rephrasing Language Model) was found to have insufficient feature representation in complex semantic scenarios. To address this issue, FeReLM(Feature-enhanced Rephrasing Language Model) was proposed. The model was designed to leverage Depthwise Separable Convolution (DSC) to integrate deep semantic features generated by feature extraction model BGE(BAAI General Embedding) with global features produced by ReLM, effectively enhancing model’s ability to parse complex contexts and improve precision in recognizing and correcting spelling errors. Initially, FeReLM was trained on Wang271K dataset, enabling it to continuously learn deep semantics and complex expressions within sentences. Subsequently, the learned weights were transferred to new datasets for fine-tuning. Experimental results show that FeReLM outperforms ReLM, MCRSpell (Metric learning of Correct Representation for Chinese Spelling Correction), RSpell(Retrieval-augmented Framework for Domain Adaptive Chinese Spelling Check) and so on in key metrics such as precision, recall, and F1 score, with improvements ranging from 0.6 to 28.7 percentage points. The effectiveness of proposed method is confirmed through ablation experiments.
Key words: Natural Language Processing (, NLP), feature enhancement, Chinese Spelling Correction (, CSC)
摘要: ReLM (Rephrasing Language Model)作为当前性能领先的中文拼写纠错 (CSC)模型,针对其在复杂语义场景中存在特征表达不足的问题,文中提出了深层语义特征增强的ReLM模型FeReLM (Feature-enhanced Rephrasing Language Model),利用深度可分离卷积 (DSC)技术融合特征提取模型BGE (BAAI General Embeddings)生成的深层语义特征与ReLM模型生成的整体特征,有效提升了模型对复杂上下文的解析力和拼写错误的识别纠正精度。首先将FeReLM在Wang271K数据集上训练,使模型不断学习句子中的深层语义和复杂表达,再将训练好的权重加载迁移,将模型学习到的知识应用于新的数据集并微调。实验结果表明,与ReLM、MCRSpell (Metric learning of Correct Representation for Chinese Spelling Correction)、RSpell (Retrieval-augmented Framework for Domain Adaptive Chinese Spelling Check)等模型相比,FeReLM的精确率、召回率、F1分数等关键指标提升幅度可达0.6至28.7个百分点,并通过消融实验,验证了所提方法的有效性。
关键词: 自然语言处理, 特征增强, 中文拼写纠错, 语义融合, 文本纠错, 预训练语言模型
CLC Number:
TP391.1
张伟 牛家祥 马继超 沈琼霞. 深层语义特征增强的ReLM中文拼写纠错模型[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2024071015.
0 / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2024071015