Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (12): 3771-3778.DOI: 10.11772/j.issn.1001-9081.2024111694

• Artificial intelligence • Previous Articles     Next Articles

Chinese semantic error recognition model based on hierarchical information enhancement

Yuqi ZHANG1,2,3,4, Ying SHA1,2,3,4   

  1. 1.College of Informatics,Huazhong Agricultural University,Wuhan Hubei 430070,China
    2.Key Laboratory of Smart Breeding Technology,Ministry of Agriculture and Rural Affairs (Huazhong Agricultural University),Wuhan Hubei 430070,China
    3.Hubei Engineering Technology Research Center of Agricultural Big Data (Huazhong Agricultural University),Wuhan Hubei 430070,China
    4.Engineering Research Center of Agricultural Intelligent Technology,Ministry of Education (Huazhong Agricultural University),Wuhan Hubei 430070,China
  • Received:2024-12-02 Revised:2025-04-16 Accepted:2025-04-23 Online:2025-04-27 Published:2025-12-10
  • Contact: Ying SHA
  • About author:ZHANG Yuqi, born in 1998, M. S. candidate. His research interests include natural language processing.
    SHA Ying, born in 1973, Ph. D., professor. His research interests include natural language processing, machine learning, artificial intelligence.
  • Supported by:
    National Natural Science Foundation of China(62272188)

基于层次信息增强的中文语义错误识别模型

张瑜琦1,2,3,4, 沙灜1,2,3,4   

  1. 1.华中农业大学 信息学院,武汉 430070
    2.农业农村部智慧养殖技术重点实验室(华中农业大学),武汉 430070
    3.湖北省农业大数据工程技术研究中心(华中农业大学),武汉 430070
    4.农业智能技术教育部工程研究中心(华中农业大学),武汉 430070
  • 通讯作者: 沙灜
  • 作者简介:张瑜琦(1998—),男,山西太原人,硕士研究生,主要研究方向:自然语言处理
    沙灜(1973—),男,江苏扬州人,教授,博士,CCF高级会员,主要研究方向:自然语言处理、机器学习、人工智能。
  • 基金资助:
    国家自然科学基金资助项目(62272188)

Abstract:

The semantic errors in Chinese differ from simple spelling and grammatical errors, as they are more inconspicuous and complex. Chinese Semantic Error Recognition (CSER) aims to determine whether a Chinese sentence contains semantic errors. As a prerequisite task for semantic review, the performance of recognition model is crucial for semantic error correction. To address the issue of CSER models ignoring the differences between syntactic structure and contextual structure when integrating syntactic information, a Hierarchical Information Enhancement Graph Convolutional Network (HIE-GCN) model was proposed to embed the hierarchical information of nodes in the syntactic tree into the context encoder, thereby reducing the gap between syntactic structure and contextual structure. Firstly, a traversal algorithm was used to extract the hierarchical information of nodes in the syntactic tree. Secondly, the hierarchical information was embedded into the BERT (Bidirectional Encoder Representations from Transformers) model to generate character features, the Graph Convolutional Network (GCN) was adopted to utilize these character features for the nodes in the graph, and after graph convolution calculation, the feature vector of the entire sentence was obtained. Finally, a fully connected layer was used for one-class or multi-class semantic error recognition. Results of semantic error recognition and correction experiments conducted on the FCGEC (Fine-grained corpus for Chinese Grammatical Error Correction) and NaCGEC (Native Chinese Grammatical Error Correction) datasets show that, on the FCGEC dataset, in the recognition task, compared with the baseline model: HIE-GCN improves the accuracy by at least 0.10 percentage points and the F1 score by at least 0.13 percentage points in the one-class error recognition; in the multi-class error recognition, the accuracy is improved by at least 1.05 percentage points and the F1 score is improved by at least 0.53 percentage points. Ablation experimental results verify the effectiveness of hierarchical information embedding. Compared with Large Language Models (LLMs) such as GPT and Qwen, the proposed model’s overall performance in recognition is significantly higher. In the correction experiment, compared to the sequence-to-sequence direct error correction model, the recognition-correction two-stage pipeline improves the correction precision by 8.01 percentage points. It is also found that in the correction process of LLM GLM4, providing the model with hints on the sentence’s error type increases the correction precision by 4.62 percentage points.

Key words: Natural Language Processing (NLP), Graph Convolutional Network (GCN), Chinese Semantic Error Recognition (CSER), Large Language Model (LLM), dependency parsing

摘要:

中文语义错误不同于简单的拼写错误和语法错误,它们通常更加隐蔽和复杂。中文语义错误识别(CSER)旨在判断中文句子是否包含语义错误,作为语义校对的前置任务,识别模型的性能对语义错误校对至关重要。针对CSER模型在融合句法信息时忽视句法结构与上下文结构之间差异的问题,提出一种层次信息增强的图卷积神经网络(HIE-GCN)模型,旨在将句法树中节点的层次信息嵌入上下文编码器,从而缩小句法结构与上下文结构之间的差异。首先,采用遍历算法提取句法树中节点的层次信息;其次,将层次信息嵌入BERT(Bidirectional Encoder Representations from Transformers)模型生成字符特征,而图卷积网络(GCN)将字符特征用于图上节点,并在图卷积计算后得到整个句子的特征向量;最后,利用全连接层进行单分类错误识别或多分类错误识别。在FCGEC(Fine-grained corpus for Chinese Grammatical Error Correction)和NaCGEC(Native Chinese Grammatical Error Correction)数据集上进行语义错误识别和校对的实验结果表明,在识别任务中,与基线模型相比,HIE-GCN模型在FCGEC数据集的单分类错误识别中将准确率至少提高0.10个百分点,F1值至少提高0.13个百分点;在多分类错误识别中将准确率至少提高1.05个百分点,F1值至少提高0.53个百分点;消融实验验证了层次信息嵌入的有效性;与GPT、Qwen等多个大语言模型(LLM)相比,所提模型的整体识别性能更高。在校对实验中,与序列到序列的直接纠错模型相比,采用识别-纠错二阶段流水线可将纠错精确率提高8.01个百分点,同时还发现,在LLM GLM4纠错过程中,向模型提示句子错误类型可将纠错的精确率提高4.62个百分点。

关键词: 自然语言处理, 图卷积网络, 中文语义错误识别, 大语言模型, 依存句法分析

CLC Number: