Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (9): 2693-2700.DOI: 10.11772/j.issn.1001-9081.2021071356

• Artificial intelligence • Previous Articles    

Python named entity recognition model based on transformer

Guanyou XU, Weisen FENG()   

  1. College of Computer Science,Sichuan University,Chengdu Sichuan 610065,China
  • Received:2021-07-30 Revised:2021-11-03 Accepted:2021-11-09 Online:2022-09-19 Published:2022-09-10
  • Contact: Weisen FENG
  • About author:XU Guanyou, born in 1997, M. S. candidate. His research interests include natural language processing, knowledge graph.

基于transformer的python命名实体识别模型

徐关友, 冯伟森()   

  1. 四川大学 计算机学院,成都 610065
  • 通讯作者: 冯伟森
  • 作者简介:徐关友(1997—),男,四川泸州人,硕士研究生,主要研究方向:自然语言处理、知识图谱;

Abstract:

Recently, some character-based Named Entity Recognition (NER) models cannot make full use of word information, and the lattice structure model using word information may degenerate into a word-based model and cause word segmentation errors. To deal with these problems, a python NER model based on transformer was proposed to encode character-word information. Firstly, the word information was bound to the characters corresponding to the beginning or end of the word. Then, three different strategies were used to encode the word information into a fixed-size representation through the transformer. Finally, Conditional Random Field (CRF) was used for decoding, thereby avoiding the problem of word segmentation errors caused by obtaining the word boundary information as well as improving the batch training speed. Experimental results on the python dataset show that the F1 score of the proposed model is 2.64 percentage points higher than that of the Lattice-LSTM model, and the training time of the proposed model is about a quarter of the comparison model, indicating that the proposed model can prevent model degradation, improve batch training speed, and better recognize the python named entities.

Key words: Named Entity Recognition (NER), word boundary, python, word information, transformer

摘要:

最近一些基于字符的命名实体识别(NER)模型无法充分利用词信息,而利用词信息的格子结构模型可能会退化为基于词的模型而出现分词错误。针对这些问题提出了一种基于transformer的python NER模型来编码字符-词信息。首先,将词信息与词开始或结束对应的字符绑定;然后,利用三种不同的策略,将词信息通过transformer编码为固定大小的表示;最后,使用条件随机场(CRF)解码,从而避免获取词边界信息带来的分词错误,并提升批量训练速度。在python数据集上的实验结果可以看出,所提模型的F1值比Lattice-LSTM模型高2.64个百分点,同时训练时间是对比模型的1/4左右,说明所提模型能够防止模型退化,提升批量训练速度,更好地识别python命名实体。

关键词: 命名实体识别, 词边界, python, 词信息, transformer

CLC Number: