《计算机应用》唯一官方网站

• •    下一篇

融合多粒度语言知识与层级信息的中文命名实体识别模型

于右任,张仰森,蒋玉茹,黄改娟   

  1. 北京信息科技大学
  • 收稿日期:2023-06-28 修回日期:2023-07-27 发布日期:2023-09-04 出版日期:2023-09-04
  • 通讯作者: 于右任
  • 基金资助:
    基于语义分析的科技评审专家智能推荐方法研究

Chinese named entity recognition model incorporating multi-granularity linguistic knowledge and hierarchical information

  • Received:2023-06-28 Revised:2023-07-27 Online:2023-09-04 Published:2023-09-04

摘要: 针对当前大多数命名实体识别模型只使用字符级信息编码且缺乏对文本层次信息提取的问题,提出了一种融合多粒度语言知识与层级信息的中文命名实体模型(CMH)。首先,使用经过多粒度语言知识预训练的模型对文本进行编码,使模型能够同时捕获文本的细粒度和粗粒度语言信息,从而更好地对语料进行表征。其次,使用ON-LSTM(Ordered Neurons Long Short Term Memory Network)模型进行层级信息提取,以利用文本本身的层级结构信息,增强编码间的时序关系。最后,在模型的解码端结合了文本的分词信息,并将实体识别问题转化为表格填充问题,以更好地解决实体重叠问题,并获得更加准确的实体识别结果。同时,为解决当前模型在不同领域中的迁移能力较差的问题,提出通用实体识别的理念,通过筛选多领域的通用实体类型,构建一套提升模型在多领域中的泛化能力的通用命名实体识别数据集(MDNER)。为验证所提模型的效果,在数据集Resume、Weibo、MSRA进行了实验,与MECT(Multi-metadata Embedding based Cross-Transformer)模型相比,F1值分别提高了0.94、4.95和1.58个百分点,达到了最优水平。同时,为了验证模型在多领域中的实体识别效果,在MDNER数据上进行了实验,F1值达到了95.29%。实验结果表明多粒度语言知识预训练、文本层级结构信息提取以及高效指针解码器对于模型的性能至关重要。

关键词: 命名实体识别, 自然语言处理, 知识图谱构建, 高效指针, 通用实体

Abstract: Aiming at the problem that most of the current named entity recognition models only use character-level information encoding and lack of text hierarchical information extraction, Chinese named entity recognition model incorporating Multi-granularity linguistic knowledge and Hierarchical informationa(CMH) was proposed. First, the text was encoded using a model that had been pre-trained with multi-granularity linguistic knowledge, so that the model could capture both fine-grained and coarse-grained linguistic information of the text, and thus better characterize the corpus. Second, hierarchical information extraction was performed using the ON-LSTM(Ordered Neurons Long Short Term Memory Network) model in order to utilize the hierarchical structural information of the text itself and enhance the temporal relationships between encodings. Finally, the text's disambiguation information was incorporated by the decoding end of the model, and the entity recognition problem was transformed into a table filling problem in order to better solve the entity overlapping problem and obtain more accurate entity recognition results.Meanwhile, in order to solve the problem of poor migration ability of the then-current model in different domains, the concept of universal entity recognition was proposed, and a set of universal named entity recognition dataset (MDNER) was constructed to enhance the generalization ability of the model in multiple domains by filtering the universal entity types in multiple domains To validate the effectiveness of the proposed model, experiments were conducted on the datasets Resume, Weibo, and MSRA, and the F1 values were improved by 0.94, 4.95, and 1.58 percentage points, respectively, to reach the optimal level when compared to the MECT (Multi-metadata Embedding based Cross-Transformer) model.Meanwhile, in order to verify the model's entity recognition effect in multi-domain, experiments were conducted on MDNER data, and the F1 value reached 95.29%. The experimental results show that the pre-training of multi-granularity linguistic knowledge, the extraction of structural information at the text hierarchy level, and the efficient pointer decoder are crucial for the performance of the model.

Key words: named entity recognition, natural language processing, knowledge graph construction, efficient pointer, generic entities

中图分类号: