Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (3): 702-708.DOI: 10.11772/j.issn.1001-9081.2023030361

• Artificial intelligence • Previous Articles     Next Articles

Chinese named entity recognition combining prior knowledge and glyph features

Yongfeng DONG1,2,3, Jiaming BAI1, Liqin WANG1,2,3(), Xu WANG1,2,3   

  1. 1.School of Artificial Intelligence,Hebei University of Technology,Tianjin 300401,China
    2.Hebei Province Key Laboratory of Big Data Computing (Hebei University of Technology),Tianjin 300401,China
    3.Hebei Data?Driven Industrial Intelligent Engineering Research Center (Hebei University of Technology),Tianjin 300401,China
  • Received:2023-04-04 Revised:2023-05-08 Accepted:2023-05-18 Online:2023-05-30 Published:2024-03-10
  • Contact: Liqin WANG
  • About author:DONG Yongfeng, born in 1977, Ph. D., professor. His research interests include artificial intelligence, knowledge graph.
    BAI Jiaming, born in 1998, M. S. candidate. His research interests include natural language processing, knowledge graph.
    WANG Xu, born in 1995, Ph. D., lecturer. His research interests include natural language processing, knowledge graph.
  • Supported by:
    Science and Technology Research Program of Higher Education Institutions in Hebei Province(ZD2022082);Higher Education Teaching Reform Research and Practice Project of Hebei Province(2022GJJG049);Postgraduate Education Teaching Reform Research Project of Hebei Province(YJG2023023)

融合先验知识和字形特征的中文命名实体识别

董永峰1,2,3, 白佳明1, 王利琴1,2,3(), 王旭1,2,3   

  1. 1.河北工业大学 人工智能与数据科学学院, 天津 300401
    2.河北省大数据计算重点实验室(河北工业大学), 天津 300401
    3.河北省数据驱动工业智能工程研究中心(河北工业大学), 天津 300401
  • 通讯作者: 王利琴
  • 作者简介:董永峰(1977—),男,河北定州人,教授,博士,CCF高级会员,主要研究方向:人工智能、知识图谱
    白佳明(1998—),男,河北定州人,硕士研究生,主要研究方向:自然语言处理、知识图谱
    王旭(1995—),男,河北任丘人,讲师,博士,CCF会员,主要研究方向:自然语言处理、知识图谱。
  • 基金资助:
    河北省高等学校科学技术研究项目(ZD2022082);河北省高等教育教学改革研究与实践项目(2022GJJG049);河北省研究生教育教学改革研究项目(YJG2023023)

Abstract:

To address the problem that relevant models typically only model characters and relevant vocabulary without fully utilizing the unique glyph structure information and entity type information of Chinese characters, a model that integrates prior knowledge and glyph features for Named Entity Recognition (NER) task was proposed. Firstly, the input sequence was encoded using a Transformer combined with Gaussian attention mechanism, and the Chinese definitions of entity types were obtained from Chinese Wikipedia. Bidirectional Gated Recurrent Unit (BiGRU) was used to encode the entity type information as prior knowledge, which was combined with the character representation using an attention mechanism. Secondly, Bidirectional Long Short-Term Memory (BiLSTM) network was used to encode the long-distance dependency relationship of the input sequence, and a glyph encoding table was used to obtain traditional Chinese characters’ Cangjie codes and simplified Chinese characters’ modern Wubi codes. Then, Convolutional Neural Network (CNN) was used to extract glyph feature representations, and the traditional and simplified glyph feature representations were combined with different weights, which were then combined with the character representation encoded by BiLSTM using a gating mechanism. Finally, decoding was performed using Conditional Random Field (CRF) to obtain a sequence of named entity annotations. Experiment results on the colloquial dataset Weibo, the small dataset Boson, and the large dataset PeopleDaily show that, compared with the baseline model MECT (Multi-metadata Embedding based Cross-Transformer), the proposed model has the F1 value increased by 2.47, 1.20, and 0.98 percentage points, respectively, proving the effectiveness of the proposed model.

Key words: Named Entity Recognition (NER), attention mechanism, Convolutional Neural Network (CNN), Bidirectional Long Short-Term Memory (BiLSTM), Conditional Random Field (CRF)

摘要:

针对命名实体识别(NER)任务中相关模型通常仅对字符及相关词汇进行建模,未充分利用汉字特有的字形结构信息和实体类型信息的问题,提出一种融合先验知识和字形特征的命名实体识别模型。首先,采用结合高斯注意力机制的Transformer对输入序列进行编码,并从中文维基百科中获取实体类型的中文释义,采用双向门控循环单元(BiGRU)编码实体类型信息作为先验知识,利用注意力机制将它与字符表示进行组合;其次,采用双向长短时记忆(BiLSTM)网络编码输入序列的远距离依赖关系,通过字形编码表获得繁体的仓颉码和简体的现代五笔码,采用卷积神经网络(CNN)提取字形特征表示,并根据不同权重组合繁体与简体字形特征,利用门控机制将它与经过BiLSTM编码后的字符表示进行组合;最后,使用条件随机场(CRF)解码,得到命名实体标注序列。在偏口语化的数据集Weibo、小型数据集Boson和大型数据集PeopleDaily上的实验结果表明,与基线模型MECT(Multi-metadata Embedding based Cross-Transformer)相比,所提模型的F1值别提高了2.47、1.20和0.98个百分点,验证了模型的有效性。

关键词: 命名实体识别, 注意力机制, 卷积神经网络, 双向长短时记忆, 条件随机场

CLC Number: