《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (2): 377-384.DOI: 10.11772/j.issn.1001-9081.2023020239

• 人工智能 • 上一篇    

实体类别增强的汽车领域嵌套命名实体识别

黄子麒, 胡建鹏()   

  1. 上海工程技术大学 电子电气工程学院,上海 201620
  • 收稿日期:2023-03-06 修回日期:2023-05-16 接受日期:2023-05-22 发布日期:2023-08-14 出版日期:2024-02-10
  • 通讯作者: 胡建鹏
  • 作者简介:黄子麒(1997—),男,江西赣州人,硕士研究生,CCF学生会员,主要研究方向:自然语言处理;
  • 基金资助:
    科技创新2030—“新一代人工智能”重大项目(2020AAA0109300)

Entity category enhanced nested named entity recognition in automotive domain

Ziqi HUANG, Jianpeng HU()   

  1. School of Electric and Electronic Engineering,Shanghai University of Engineering Science,Shanghai 201620,China
  • Received:2023-03-06 Revised:2023-05-16 Accepted:2023-05-22 Online:2023-08-14 Published:2024-02-10
  • Contact: Jianpeng HU
  • About author:HUANG Ziqi, born in 1997, M. S. candidate. His research interests include natural language processing.
  • Supported by:
    Science and Technology Innovation 2030 — Major Project of "New Generation Artificial Intelligence"(2020AAA0109300)

摘要:

针对中文汽车领域实体抽取任务中对嵌套实体、长实体识别效果差的问题,提出一种实体类别增强的嵌套实体抽取(ECE-NER)模型。首先,基于特征融合编码,提高模型对领域实体边界的感知能力;然后,尾词识别模块利用多层感知机得到实体尾词集合;最后,前向边界识别模块基于义原构造的实体类别特征和自注意力机制得到实体类别增强的候选尾词表征,融合领域实体类别特征,利用双仿射编码器计算特定尾词和实体类型的实体跨度概率,从而确定命名实体。在某汽车企业生产线故障数据集、汽车工业故障抽取评测数据集CCL2022和中文医学文本数据集CHIP2020上进行模型验证。实验结果表明,所提模型在前两个数据集上的实体识别F1值比序列标注模型(BERT+BiLSTM+CRF)、基于跨度的实体抽取模型(PURE(Princeton University Relation Extraction)、SpERT(Span-based Entity and Relation Transformer))分别提高了4.1、1.8、1.6个百分点和9.0、5.4、7.3个百分点;在第一个数据集和第三个数据集中嵌套实体识别F1值与PURE、SpERT模型相比提高了13.3、8.3个百分点和21.7、9.3个百分点,验证了所提模型在嵌套实体识别上的有效性。

关键词: 特征融合, 义原特征, 自注意力机制, 双仿射编码器, 中文嵌套命名实体识别

Abstract:

Aiming at the problem of poor recognition of nested entities and long entities in the Chinese automotive domain entity extraction task, an Entity Category Enhanced nested Named Entity Recognition (ECE-NER) model was proposed. Firstly, the model’s perception of domain entity boundaries was improved based on feature fusion encoding. Then, the tail word recognition module was used to obtain the entity tail word set by multi-layer perceptron. Finally, the forward boundary recognition module was used to obtain entity category-enhanced entity representation of candidate tail words, based on the sememe-constructed entity category features and self-attention mechanism. By fusing domain entity category features, a biaffine encoder was used to calculate the entity span probabilities of the specific tail words in order to determine the named entities. The experimental evaluation was carried out on the failure dataset of the automobile production line, the failure extraction and evaluation dataset of the automobile industry CCL2022, and the Chinese medical text dataset CHIP2020. The experimental results on the first two datasets show that ECE-NER model increases F1 value by 4.1, 1.8, 1.6 percentage points and 9.0, 5.4, 7.3 percentage points respectively compared with the baseline models including the sequence labeling model (BERT+BiLSTM+CRF) and the span-based entity extraction models (PURE(Princeton University Relation Extraction), SpERT(Span-based Entity and Relation Transformer)). Especially, ECE-NER model increases F1 value of nested entity recognition by 13.3, 8.3 and 21.7, 9.3 percentage points in the first and third datasets compared to PURE and SpERT models. The experimental results verify the effectiveness of the proposed model on the recognition of nested entities.

Key words: feature fusion, sememe characteristic, self-attention mechanism, biaffine encoder, Chinese nested named entity recognition

中图分类号: