《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (9): 2680-2685.DOI: 10.11772/j.issn.1001-9081.2021071209

• 人工智能 • 上一篇    

基于知识库实体增强BERT模型的中文命名实体识别

胡婕1(), 胡燕1, 刘梦赤2, 张龑1   

  1. 1.湖北大学 计算机与信息工程学院,武汉 430062
    2.华南师范大学 计算机学院,广州 510631
  • 收稿日期:2021-07-12 修回日期:2021-09-18 接受日期:2021-09-24 发布日期:2021-10-08 出版日期:2022-09-10
  • 通讯作者: 胡婕
  • 作者简介:胡燕(1993—),女,安徽安庆人,硕士研究生,主要研究方向:自然语言处理;
    刘梦赤(1962—),男,湖北武汉人,教授,博士,CCF会员,主要研究方向:语义数据库、深度学习;
    张龑(1974—),男,湖北宜昌人,教授,博士,CCF会员,主要研究方向:软件工程、信息安全。
  • 基金资助:
    国家自然科学基金资助项目(61977021);广州市大数据与智能教育重点实验室资助项目(201905010009)

Chinese named entity recognition based on knowledge base entity enhanced BERT model

Jie HU1(), Yan HU1, Mengchi LIU2, Yan ZHANG1   

  1. 1.School of Computer Science and Information Engineering,Hubei University,Wuhan Hubei 430062,China
    2.School of Computer Science,South China Normal University,Guangzhou Guangdong 510631,China
  • Received:2021-07-12 Revised:2021-09-18 Accepted:2021-09-24 Online:2021-10-08 Published:2022-09-10
  • Contact: Jie HU
  • About author:HU Yan, born in 1993, M. S. candidate. Her research interests include natural language processing.
    LIU Mengchi, born in 1962, Ph. D., professor. His research interests include semantic database, deep learning.
    ZHANG Yan, born in 1974, Ph. D., professor. His research interests include software engineering, information security.
  • Supported by:
    National Natural Science Foundation of China(61977021);Guangzhou Key Laboratory of Big Data and Intelligent Education(201905010009)

摘要:

针对预训练模型BERT存在词汇信息缺乏的问题,在半监督实体增强最小均方差预训练模型的基础上提出了一种基于知识库实体增强BERT模型的中文命名实体识别模型OpenKG+Entity Enhanced BERT+CRF。首先,从中文通用百科知识库CN-DBPedia中下载文档并用Jieba中文分词抽取实体来扩充实体词典;然后,将词典中的实体嵌入到BERT中进行预训练,将训练得到的词向量输入到双向长短期记忆网络(BiLSTM)中提取特征;最后,经过条件随机场(CRF)修正后输出结果。在CLUENER 2020 和 MSRA数据集上进行模型验证,将所提模型分别与Entity Enhanced BERT Pre-training、BERT+BiLSTM、ERNIE和BiLSTM+CRF模型进行对比实验。实验结果表明,该模型的F1值在两个数据集上比四个对比模型分别提高了1.63个百分点和1.1个百分点、3.93个百分点和5.35个百分点、2.42个百分点和4.63个百分点以及6.79个百分点和7.55个百分点。可见,所提模型对命名实体识别的综合效果得到有效提升,F1值均优于对比模型。

关键词: 命名实体识别, 知识库, 实体词典, 预训练模型, 双向长短期记忆网络

Abstract:

Aiming at the problem that the pre-training model BERT (Bidirectional Encoder Representation from Transformers) lacks of vocabulary information, a Chinese named entity recognition model called OpenKG + Entity Enhanced BERT + CRF (Conditional Random Field) based on knowledge base entity enhanced BERT model was proposed on the basis of the semi-supervised entity enhanced minimum mean-square error pre-training model. Firstly, documents were downloaded from Chinese general encyclopedia knowledge base CN-DBPedia and entities were extracted by Jieba Chinese text segmentation to expand entity dictionary. Then, the entities in the dictionary were embedded into BERT for pre-training. And the word vectors obtained from the training were input into Bidirectional Long-Short-Term Memory network (BiLSTM) for feature extraction. Finally, the results were corrected by CRF and output. Model validation was performed on datasets CLUENER 2020 and MSRA, and the proposed model was compared with Entity Enhanced BERT pre-training, BERT+BiLSTM, ERNIE and BiLSTM+CRF models. Experimental results show that compared with these four models, the proposed model has the F1 score increased by 1.63 percentage points and 1.1 percentage points, 3.93 percentage points and 5.35 percentage points, 2.42 percentage points and 4.63 percentage points, 6.79 and 7.55 percentage points, respectively in the two datasets. It can be seen that the comprehensive effect of the proposed model on named entity recognition is effectively improved, and the F1 scores of the model are better than those of the comparison models.

Key words: Named Entity Recognition (NER), knowledge base, entity dictionary, pre-training model, Bidirectional Long Short-Term Memory (BiLSTM) network

中图分类号: