基于知识库实体增强BERT模型的中文命名实体识别

doi:10.11772/j.issn.1001-9081.2021071209

《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (9): 2680-2685.DOI: 10.11772/j.issn.1001-9081.2021071209

所属专题：人工智能

基于知识库实体增强BERT模型的中文命名实体识别

胡婕¹(), 胡燕¹, 刘梦赤², 张龑¹

^1.湖北大学计算机与信息工程学院，武汉 430062
^2.华南师范大学计算机学院，广州 510631

收稿日期:2021-07-12 修回日期:2021-09-18 接受日期:2021-09-24 发布日期:2021-10-08 出版日期:2022-09-10
通讯作者: 胡婕
作者简介:胡燕（1993—），女，安徽安庆人，硕士研究生，主要研究方向：自然语言处理；
刘梦赤（1962—），男，湖北武汉人，教授，博士，CCF会员，主要研究方向：语义数据库、深度学习；
张龑（1974—），男，湖北宜昌人，教授，博士，CCF会员，主要研究方向：软件工程、信息安全。
基金资助:
国家自然科学基金资助项目(61977021);广州市大数据与智能教育重点实验室资助项目(201905010009)

Chinese named entity recognition based on knowledge base entity enhanced BERT model

Jie HU¹(), Yan HU¹, Mengchi LIU², Yan ZHANG¹

^1.School of Computer Science and Information Engineering，Hubei University，Wuhan Hubei 430062，China
^2.School of Computer Science，South China Normal University，Guangzhou Guangdong 510631，China

Received:2021-07-12 Revised:2021-09-18 Accepted:2021-09-24 Online:2021-10-08 Published:2022-09-10
Contact: Jie HU
About author:HU Yan， born in 1993， M. S. candidate. Her research interests include natural language processing.
LIU Mengchi， born in 1962， Ph. D.， professor. His research interests include semantic database， deep learning.
ZHANG Yan， born in 1974， Ph. D.， professor. His research interests include software engineering， information security.
Supported by:
National Natural Science Foundation of China(61977021);Guangzhou Key Laboratory of Big Data and Intelligent Education(201905010009)

摘要/Abstract

摘要：

针对预训练模型BERT存在词汇信息缺乏的问题，在半监督实体增强最小均方差预训练模型的基础上提出了一种基于知识库实体增强BERT模型的中文命名实体识别模型OpenKG+Entity Enhanced BERT+CRF。首先，从中文通用百科知识库CN-DBPedia中下载文档并用Jieba中文分词抽取实体来扩充实体词典；然后，将词典中的实体嵌入到BERT中进行预训练，将训练得到的词向量输入到双向长短期记忆网络（BiLSTM）中提取特征；最后，经过条件随机场（CRF）修正后输出结果。在CLUENER 2020 和 MSRA数据集上进行模型验证，将所提模型分别与Entity Enhanced BERT Pre-training、BERT+BiLSTM、ERNIE和BiLSTM+CRF模型进行对比实验。实验结果表明，该模型的F1值在两个数据集上比四个对比模型分别提高了1.63个百分点和1.1个百分点、3.93个百分点和5.35个百分点、2.42个百分点和4.63个百分点以及6.79个百分点和7.55个百分点。可见，所提模型对命名实体识别的综合效果得到有效提升，F1值均优于对比模型。

关键词: 命名实体识别, 知识库, 实体词典, 预训练模型, 双向长短期记忆网络

Abstract:

Aiming at the problem that the pre-training model BERT （Bidirectional Encoder Representation from Transformers） lacks of vocabulary information， a Chinese named entity recognition model called OpenKG + Entity Enhanced BERT + CRF （Conditional Random Field） based on knowledge base entity enhanced BERT model was proposed on the basis of the semi-supervised entity enhanced minimum mean-square error pre-training model. Firstly， documents were downloaded from Chinese general encyclopedia knowledge base CN-DBPedia and entities were extracted by Jieba Chinese text segmentation to expand entity dictionary. Then， the entities in the dictionary were embedded into BERT for pre-training. And the word vectors obtained from the training were input into Bidirectional Long-Short-Term Memory network （BiLSTM） for feature extraction. Finally， the results were corrected by CRF and output. Model validation was performed on datasets CLUENER 2020 and MSRA， and the proposed model was compared with Entity Enhanced BERT pre-training， BERT+BiLSTM， ERNIE and BiLSTM+CRF models. Experimental results show that compared with these four models， the proposed model has the F1 score increased by 1.63 percentage points and 1.1 percentage points， 3.93 percentage points and 5.35 percentage points， 2.42 percentage points and 4.63 percentage points， 6.79 and 7.55 percentage points， respectively in the two datasets. It can be seen that the comprehensive effect of the proposed model on named entity recognition is effectively improved， and the F1 scores of the model are better than those of the comparison models.

Key words: Named Entity Recognition (NER), knowledge base, entity dictionary, pre-training model, Bidirectional Long Short-Term Memory (BiLSTM) network

中图分类号:

TP391.1

胡婕, 胡燕, 刘梦赤, 张龑. 基于知识库实体增强BERT模型的中文命名实体识别[J]. 计算机应用, 2022, 42(9): 2680-2685.

Jie HU, Yan HU, Mengchi LIU, Yan ZHANG. Chinese named entity recognition based on knowledge base entity enhanced BERT model[J]. Journal of Computer Applications, 2022, 42(9): 2680-2685.

图/表 9

参考文献 22

1	RIEDEL S， YAO L M， McCALLUM A， et al. Relation extraction with matrix factorization and universal schemas［C］// Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg， PA： Association for Computational Linguistics， 2013： 74-84.
2	CHEN Y B， XU L C， LIU K， et al. Event extraction via dynamic multi-pooling convolutional neural networks［C］// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing （Volume 1： Long Papers）. Stroudsburg， PA： Association for Computational Linguistics， 2015： 167-176. 10.3115/v1/p15-1017
3	DIEFENBACH D， LOPEZ V， SINGH K， et al. Core techniques of question answering systems over knowledge bases： a survey［J］. Knowledge and Information Systems， 2018， 55（3）： 529-569. 10.1007/s10115-017-1100-y
4	李源，马磊，邵党国，等. 用于社交媒体的中文命名实体识别［J］. 中文信息学报， 2020， 34（8）：61-69. 10.3969/j.issn.1003-0077.2020.08.008
	LI Y， MA L， SHAO D G， et al. Chinese named entity recognition for social media［J］. Journal of Chinese Information Processing， 2020， 34（8）： 61-69. 10.3969/j.issn.1003-0077.2020.08.008
5	张毅，王爽胜，何彬，等．基于BERT 的初等数学文本命名实体识别方法［J］．计算机应用， 2022， 42（2）： 433-439. 10.11772/j.issn.1001-9081.2021020334
	ZHANG Y， WANG S S， HE B， et al. Named entity recognition method of elementary mathematical text based on BERT［J］. Journal of Computer Applications， 2022， 42（2）： 433-439. 10.11772/j.issn.1001-9081.2021020334
6	李韧，李童，杨建喜，等. 基于Transformer-BiLSTM-CRF的桥梁检测领域命名实体识别［J］. 中文信息学报， 2021， 35（4）： 83-91. 10.3969/j.issn.1003-0077.2021.04.012
	LI R， LI T， YANG J X， et al. Bridge inspection named entity recognition based on Transformer-BiLSTM-CRF［J］. Journal of Chinese Information Processing， 2021， 35（4）： 83-91. 10.3969/j.issn.1003-0077.2021.04.012
7	LIU L Y， SHANG J B， REN X， et al. Empower sequence labeling with task-aware neural language model［C］// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2018： 5253-5260.
8	LI H B， HAGIWARA M， LI Q， et al. Comparison of the impact of word segmentation on name tagging for Chinese and Japanese［C］// Proceedings of the 9th International Conference on Language Resources and Evaluation. Paris： European Language Resources Association， 2014： 2532-2536.
9	ZHANG Y， YANG J. Chinese NER using lattice LSTM［C］// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg， PA： Association for Computational Linguistics， 2018： 1554-1564. 10.18653/v1/p18-1144
10	MA R T， PENG M L， ZHANG Q， et al. Simplify the usage of lexicon in Chinese NER［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg， PA： Association for Computational Linguistics， 2020： 5951-5960. 10.18653/v1/2020.acl-main.528
11	LI X N， YAN H， QIU X P， et al. FLAT： Chinese NER using flat-lattice transformer［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg， PA： Association for Computational Linguistics， 2020： 6836-6842. 10.18653/v1/2020.acl-main.611
12	XUE M G， YU B W， LIU T W， et al. Porous lattice transformer encoder for Chinese NER［C］// Proceedings of the 28th International Conference on Computational Linguistics. ［S.l.］： International Committee on Computational Linguistics， 2020： 3831-3841. 10.18653/v1/2020.coling-main.340
13	GUI T， MA R T， ZHANG Q， et al. CNN-based Chinese NER with lexicon rethinking［C］// Proceedings of the 28th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2019： 4982-4988. 10.24963/ijcai.2019/692
14	GUI T， ZOU Y C， ZHANG Q， et al. A lexicon-based graph neural network for Chinese NER［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2019： 1040-1050. 10.18653/v1/d19-1096
15	DEVLIN J， CHANG M W， LEE K， et al. BERT： pre-training of deep bidirectional transformers for language understanding［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsburg， PA： Association for Computational Linguistics， 2019：4171-4186. 10.18653/v1/n18-2
16	SUN Y， WANG S H， LI Y K， et al. ERNIE： enhanced representation through knowledge integration［EB/OL］. （2019-04-19）［2021-01-21］..
17	JIA C， SHI Y F， YANG Q R， et al. Entity enhanced BERT pre-training for Chinese NER［C］// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2020： 6384-6396. 10.18653/v1/2020.emnlp-main.518
18	XU B， XU Y， LIANG J Q， et al. CN-DBpedia： a never-ending Chinese knowledge extraction system［C］// Proceedings of the 2017 International Conference on Industrial， Engineering and Other Applications of Applied Intelligent Systems， LNCS 10351. Cham： Springer， 2017： 428-438.
19	XU L， TONG Y， DONG Q Q， et al. CLUENER2020： fine-grained named entity recognition dataset and benchmark for Chinese［EB/OL］. （2020-01-20）［2021-01-24］.. 10.1109/mercon50084.2020.9185296
20	LEVOW G A. The third international Chinese language processing bakeoff： word segmentation and named entity recognition［C］// Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2006： 108-117.
21	BOUMA G. Normalized （pointwise） mutual information in collocation extraction［EB/OL］.［2021-01-25］..
22	SUN Y， WANG S H， LI Y K， et al. ERNIE 2.0： a continual pre-training framework for language understanding［C］// Proceedings of the 34th Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2020： 8968-8975. 10.1609/aaai.v34i05.6428

参数	BERT预训练	NER任务
Epoch	3	30
Batchsize	32	30
最大句子度	180	32
优化器	Adam	—
学习率	3E-5	5E-5
衰减率	0.01	—

参数	BERT预训练	NER任务
Epoch	3	30
Batchsize	32	30
最大句子度	180	32
优化器	Adam	—
学习率	3E-5	5E-5
衰减率	0.01	—

模型	CLUENER 2020					MSRA
模型	GAM	ENT	LOT	FIN	ALL	MSRA
文献［17］模型	70.90	87.11	82.73	77.18	76.52	87.01
OpenKG+ 文献［17］模型	71.40	87.82	83.32	77.52	77.44	87.59
本文模型	71.50	88.43	84.21	78.12	78.15	88.11

模型	CLUENER 2020					MSRA
模型	GAM	ENT	LOT	FIN	ALL	MSRA
文献［17］模型	70.90	87.11	82.73	77.18	76.52	87.01
OpenKG+ 文献［17］模型	71.40	87.82	83.32	77.52	77.44	87.59
本文模型	71.50	88.43	84.21	78.12	78.15	88.11

模型	CLUENER 2020			MSRA
模型	准确率	召回率	F1值	准确率	召回率	F1值
文献［17］模型	74.45	78.53	76.52	86.04	88.02	87.01
OpenKG+ 文献［17］模型	74.52	80.60	77.44	86.23	89.01	87.59
本文模型	76.26	80.13	78.15	87.23	89.01	88.11

基于知识库实体增强BERT模型的中文命名实体识别

Chinese named entity recognition based on knowledge base entity enhanced BERT model

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 22

相关文章 15

编辑推荐

Metrics

数据集	领域	CLUENER 2020		MSRA
数据集	领域	句子数	实体数	句子数	实体数
训练集	—	5 200	10 800	16 000	32 000
评估集	—	600	1 200	—	—
测试集	GAM	300	500	—	—
	ENT	48	100	—	—
	LOT	100	300	—	—
	FIN	300	600	—	—
	总计	748	1 500	4 000	9 000

模型	CLUENER 2020	MSRA
BERT+BiLSTM	74.22	82.76
ERNIE	75.73	83.48
BiLSTM+CRF	71.36	80.56
本文模型	78.15	88.11

[1]	薛凯鹏, 徐涛, 廖春节. 融合自监督和多层交叉注意力的多模态情感分析网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2387-2392.
[2]	汪雨晴, 朱广丽, 段文杰, 李书羽, 周若彤. 基于交互注意力机制的心理咨询文本情感分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2393-2399.
[3]	李晨阳, 张龙, 郑秋生, 钱少华. 基于扩散序列的多元可控文本生成[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2414-2420.
[4]	孙焕良, 王思懿, 刘俊岭, 许景科. 社交媒体数据中水灾事件求助信息提取模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2437-2445.
[5]	于右任, 张仰森, 蒋玉茹, 黄改娟. 融合多粒度语言知识与层级信息的中文命名实体识别模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1706-1712.
[6]	吕锡婷, 赵敬华, 荣海迎, 赵嘉乐. 基于Transformer和关系图卷积网络的信息传播预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1760-1766.
[7]	赵征宇, 罗景, 涂新辉. 基于多粒度语义融合的信息检索方法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1775-1780.
[8]	董永峰, 白佳明, 王利琴, 王旭. 融合先验知识和字形特征的中文命名实体识别[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 702-708.
[9]	余杭, 周艳玲, 翟梦鑫, 刘涵. 基于预训练模型与标签融合的文本分类[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 709-714.
[10]	罗歆然, 李天瑞, 贾真. 基于自注意力机制与词汇增强的中文医学命名实体识别[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 385-392.
[11]	黄子麒, 胡建鹏. 实体类别增强的汽车领域嵌套命名实体识别[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 377-384.
[12]	王楷天, 叶青, 程春雷. 基于异构图表示的中医电子病历分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 411-417.
[13]	林翔, 金彪, 尤玮婧, 姚志强, 熊金波. 基于脆弱指纹的深度神经网络模型完整性验证框架[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3479-3486.
[14]	田悦霖, 黄瑞章, 任丽娜. 融合局部语义特征的学者细粒度信息提取方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2707-2714.
[15]	张心月, 刘蓉, 魏驰宇, 方可. 融合提示知识的方面级情感分析方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2753-2759.