融合多粒度语言知识与层级信息的中文命名实体识别模型

doi:10.11772/j.issn.1001-9081.2023060833

《计算机应用》唯一官方网站

• • 下一篇

融合多粒度语言知识与层级信息的中文命名实体识别模型

于右任,张仰森,蒋玉茹,黄改娟

北京信息科技大学

收稿日期:2023-06-28 修回日期:2023-07-27 发布日期:2023-09-04 出版日期:2023-09-04
通讯作者: 于右任
基金资助:
基于语义分析的科技评审专家智能推荐方法研究

Chinese named entity recognition model incorporating multi-granularity linguistic knowledge and hierarchical information

Received:2023-06-28 Revised:2023-07-27 Online:2023-09-04 Published:2023-09-04

摘要/Abstract

摘要： 针对当前大多数命名实体识别模型只使用字符级信息编码且缺乏对文本层次信息提取的问题，提出了一种融合多粒度语言知识与层级信息的中文命名实体模型（CMH）。首先，使用经过多粒度语言知识预训练的模型对文本进行编码，使模型能够同时捕获文本的细粒度和粗粒度语言信息，从而更好地对语料进行表征。其次，使用ON-LSTM(Ordered Neurons Long Short Term Memory Network)模型进行层级信息提取，以利用文本本身的层级结构信息，增强编码间的时序关系。最后，在模型的解码端结合了文本的分词信息，并将实体识别问题转化为表格填充问题，以更好地解决实体重叠问题，并获得更加准确的实体识别结果。同时，为解决当前模型在不同领域中的迁移能力较差的问题，提出通用实体识别的理念，通过筛选多领域的通用实体类型，构建一套提升模型在多领域中的泛化能力的通用命名实体识别数据集（MDNER）。为验证所提模型的效果，在数据集Resume、Weibo、MSRA进行了实验，与MECT(Multi-metadata Embedding based Cross-Transformer)模型相比，F1值分别提高了0.94、4.95和1.58个百分点，达到了最优水平。同时，为了验证模型在多领域中的实体识别效果，在MDNER数据上进行了实验，F1值达到了95.29%。实验结果表明多粒度语言知识预训练、文本层级结构信息提取以及高效指针解码器对于模型的性能至关重要。

关键词: 命名实体识别, 自然语言处理, 知识图谱构建, 高效指针, 通用实体

Abstract: Aiming at the problem that most of the current named entity recognition models only use character-level information encoding and lack of text hierarchical information extraction, Chinese named entity recognition model incorporating Multi-granularity linguistic knowledge and Hierarchical informationa(CMH) was proposed. First, the text was encoded using a model that had been pre-trained with multi-granularity linguistic knowledge, so that the model could capture both fine-grained and coarse-grained linguistic information of the text, and thus better characterize the corpus. Second, hierarchical information extraction was performed using the ON-LSTM(Ordered Neurons Long Short Term Memory Network) model in order to utilize the hierarchical structural information of the text itself and enhance the temporal relationships between encodings. Finally, the text's disambiguation information was incorporated by the decoding end of the model, and the entity recognition problem was transformed into a table filling problem in order to better solve the entity overlapping problem and obtain more accurate entity recognition results.Meanwhile, in order to solve the problem of poor migration ability of the then-current model in different domains, the concept of universal entity recognition was proposed, and a set of universal named entity recognition dataset (MDNER) was constructed to enhance the generalization ability of the model in multiple domains by filtering the universal entity types in multiple domains To validate the effectiveness of the proposed model, experiments were conducted on the datasets Resume, Weibo, and MSRA, and the F1 values were improved by 0.94, 4.95, and 1.58 percentage points, respectively, to reach the optimal level when compared to the MECT (Multi-metadata Embedding based Cross-Transformer) model.Meanwhile, in order to verify the model's entity recognition effect in multi-domain, experiments were conducted on MDNER data, and the F1 value reached 95.29%. The experimental results show that the pre-training of multi-granularity linguistic knowledge, the extraction of structural information at the text hierarchy level, and the efficient pointer decoder are crucial for the performance of the model.

Key words: named entity recognition, natural language processing, knowledge graph construction, efficient pointer, generic entities

中图分类号:

TP391

于右任张仰森蒋玉茹黄改娟. 融合多粒度语言知识与层级信息的中文命名实体识别模型[J]. 计算机应用, DOI: 10.11772/j.issn.1001-9081.2023060833.

[1]	高龙涛, 李娜娜. 基于方面感知注意力增强的方面情感三元组抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1049-1057.
[2]	杨先凤, 汤依磊, 李自强. 基于交替注意力机制和图卷积网络的方面级情感分析模型[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1058-1064.
[3]	杨保山, 杨智, 陈性元, 韩冰, 杜学绘. Android应用敏感行为与隐私政策一致性分析[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 788-796.
[4]	董永峰, 白佳明, 王利琴, 王旭. 融合先验知识和字形特征的中文命名实体识别[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 702-708.
[5]	罗歆然, 李天瑞, 贾真. 基于自注意力机制与词汇增强的中文医学命名实体识别[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 385-392.
[6]	黄子麒, 胡建鹏. 实体类别增强的汽车领域嵌套命名实体识别[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 377-384.
[7]	王楷天, 叶青, 程春雷. 基于异构图表示的中医电子病历分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 411-417.
[8]	冯程皓, 谢振平, 丁博文. 中文文本纠错软件测试用例的选择生成方法[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 101-112.
[9]	周晓敏, 滕飞, 张艺. 基于元网络的自动国际疾病分类编码模型[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2721-2726.
[10]	张心月, 刘蓉, 魏驰宇, 方可. 融合提示知识的方面级情感分析方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2753-2759.
[11]	张小艳, 段正宇. 基于句级别GAN的跨语言零资源命名实体识别模型[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2406-2411.
[12]	陈克正, 郭晓然, 钟勇, 李振平. 基于负训练和迁移学习的关系抽取方法[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2426-2430.
[13]	金泽熙, 李磊, 刘继. 基于改进领域分离网络的迁移学习模型[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2382-2389.
[14]	刘耀, 童昕, 陈一风. 面向业务需求的算法路径自组配模型[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1768-1778.
[15]	雷景生, 剌凯俊, 杨胜英, 吴怡. 基于上下文语义增强的实体关系联合抽取[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1438-1444.

融合多粒度语言知识与层级信息的中文命名实体识别模型

Chinese named entity recognition model incorporating multi-granularity linguistic knowledge and hierarchical information

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics