基于transformer的python命名实体识别模型

doi:10.11772/j.issn.1001-9081.2021071356

摘要/Abstract

摘要：

最近一些基于字符的命名实体识别（NER）模型无法充分利用词信息，而利用词信息的格子结构模型可能会退化为基于词的模型而出现分词错误。针对这些问题提出了一种基于transformer的python NER模型来编码字符-词信息。首先，将词信息与词开始或结束对应的字符绑定；然后，利用三种不同的策略，将词信息通过transformer编码为固定大小的表示；最后，使用条件随机场（CRF）解码，从而避免获取词边界信息带来的分词错误，并提升批量训练速度。在python数据集上的实验结果可以看出，所提模型的F1值比Lattice-LSTM模型高2.64个百分点，同时训练时间是对比模型的1/4左右，说明所提模型能够防止模型退化，提升批量训练速度，更好地识别python命名实体。

关键词: 命名实体识别, 词边界, python, 词信息, transformer

Abstract:

Recently， some character-based Named Entity Recognition （NER） models cannot make full use of word information， and the lattice structure model using word information may degenerate into a word-based model and cause word segmentation errors. To deal with these problems， a python NER model based on transformer was proposed to encode character-word information. Firstly， the word information was bound to the characters corresponding to the beginning or end of the word. Then， three different strategies were used to encode the word information into a fixed-size representation through the transformer. Finally， Conditional Random Field （CRF） was used for decoding， thereby avoiding the problem of word segmentation errors caused by obtaining the word boundary information as well as improving the batch training speed. Experimental results on the python dataset show that the F1 score of the proposed model is 2.64 percentage points higher than that of the Lattice-LSTM model， and the training time of the proposed model is about a quarter of the comparison model， indicating that the proposed model can prevent model degradation， improve batch training speed， and better recognize the python named entities.

Key words: Named Entity Recognition (NER), word boundary, python, word information, transformer

中图分类号:

TP391.1

徐关友, 冯伟森. 基于transformer的python命名实体识别模型[J]. 计算机应用, 2022, 42(9): 2693-2700.

Guanyou XU, Weisen FENG. Python named entity recognition model based on transformer[J]. Journal of Computer Applications, 2022, 42(9): 2693-2700.

图/表 11

参考文献 32

1	DIEFENBACH D， LOPEZ V， SINGH K， et al. Core techniques of question answering systems over knowledge bases： a survey［J］. Knowledge and Information Systems， 2018， 55（3）： 529-569. 10.1007/s10115-017-1100-y
2	VEALE T. Creative language retrieval： a robust hybrid of information retrieval and linguistic creativity［C］// Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg， PA： Association for Computational Linguistics， 2011： 278-287.
3	HAN X， GAO T Y， LIN Y K， et al. More data， more relations， more context and more openness： a review and outlook for relation extraction［C］// Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2020： 745-758.
4	SAITO K， NAGATA M. Multi-language named-entity recognition system based on HMM［C］// Proceedings of the ACL 2003 Workshop on Multilingual and Mixed-Language Named Entity Recognition. Stroudsburg， PA： Association for Computational Linguistics， 2003： 41-48. 10.3115/1119384.1119390
5	FENG Y Y， SUN L， LV Y H. Chinese word segmentation and named entity recognition based on conditional random fields models［C］// Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2006： 181-184.
6	EKBAL A， BANDYOPADHYAY S. Named entity recognition using support vector machine： a language independent approach［J］. International Journal of Electrical， Computer， and Systems Engineering， 2010， 4（2）： 155-170.
7	LI X N， YAN H， QIU X P， et al. FLAT： Chinese NER using flat-lattice transformer［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg， PA： Association for Computational Linguistics， 2020： 6836-6842. 10.18653/v1/2020.acl-main.611
8	HE H F， SUN X. A unified model for cross-domain and semi-supervised named entity recognition in Chinese social media［C］// Proceedings of the 31st AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2017： 3216-3222. 10.1609/aaai.v31i1.10977
9	CAO P F， CHEN Y B， LIU K， et al. Adversarial transfer learning for Chinese named entity recognition with self-attention mechanism［C］// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2018： 182-192. 10.18653/v1/d18-1017
10	LI H B， HAGIWARA M， LI Q， et al. Comparison of the impact of word segmentation on name tagging for Chinese and Japanese［C］// Proceedings of the 9th International Conference on Language Resources and Evaluation. Paris： European Language Resources Association， 2014： 2532-2536.
11	ZHANG Y， YANG J. Chinese NER using lattice LSTM［C］// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg， PA： Association for Computational Linguistics， 2018： 1554-1564. 10.18653/v1/p18-1144
12	MA R T， PENG M L， ZHANG Q， et al. Simplify the usage of lexicon in Chinese NER［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg， PA： Association for Computational Linguistics， 2020： 5951-5960. 10.18653/v1/2020.acl-main.528
13	GUI T， MA R T， ZHANG Q， et al. CNN-based Chinese NER with lexicon rethinking［C］// Proceedings of the 28th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2019： 4982-4988. 10.24963/ijcai.2019/692
14	GUI T， ZOU Y C， ZHANG Q， et al. A lexicon-based graph neural network for Chinese NER［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2019： 1040-1050. 10.18653/v1/d19-1096
15	LI J， SUN A X， HAN J L， et al. A survey on deep learning for named entity recognition［J］. IEEE Transactions on Knowledge and Data Engineering， 2022， 34（1）： 50-70. 10.1109/tkde.2020.2981314
16	XU C W， WANG F Y， HAN J L， et al. Exploiting multiple embeddings for Chinese named entity recognition［C］// Proceedings of the 28th ACM International Conference on Information and Knowledge Management. New York： ACM， 2019： 2269-2272. 10.1145/3357384.3358117
17	SUN Y， WANG S H， LI Y K， et al. ERNIE 2.0： a continual pre-training framework for language understanding［C］// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2020： 8968-8975. 10.1609/aaai.v34i05.6428
18	LIU W， XU T G， XU Q H， et al. An encoding strategy based word-character LSTM for Chinese NER［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsburg， PA： Association for Computational Linguistics， 2019： 2379-2389. 10.18653/v1/n18-2
19	MENG Y X， WU W， WANG F， et al. Glyce： glyph-vectors for Chinese character representations［C/OL］// Proceedings of the 33rd Conference on Neural Information Processing Systems. ［2021-03-15］..
20	XUAN Z Y， BAO R， JIANG S Y. FGN： fusion glyph network for Chinese named entity recognition［C］// Proceedings of the 2020 China Conference on Knowledge Graph and Semantic Computing， CCIS 1356. Singapore： Springer， 2021： 28-40.
21	YAN H， DENG B C， LI X N， et al. TENER： adapting transformer encoder for named entity recognition［EB/OL］. （2019-12-10）［2020-10-13］..
22	ZHU Y Y， WANG G X. CAN-NER： convolutional attention network for Chinese named entity recognition［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsburg， PA： Association for Computational Linguistics， 2019： 3384-3393. 10.18653/v1/N19-1342
23	DING R X， XIE P J， ZHANG X Y， et al. A neural multi-digraph model for Chinese NER with gazetteers［C］// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg， PA： Association for Computational Linguistics， 2019： 1462-1467. 10.18653/v1/p19-1141
24	SUI D B， CHEN Y B， LIU K， et al. Leverage lexical knowledge for Chinese named entity recognition via collaborative graph network［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2019： 3830-3840. 10.18653/v1/d19-1396
25	WU F Z， LIU J X， WU C H， et al. Neural Chinese named entity recognition via CNN-LSTM-CRF and joint training with word segmentation［C］// Proceedings of the 2019 World Wide Web Conference. New York： ACM， 2019： 3342-3348. 10.1145/3308558.3313743
26	XUE M G， YU B W， LIU T W， et al. Porous lattice transformer encoder for Chinese NER［C］// Proceedings of the 28th International Conference on Computational Linguistics. ［S.l.］： International Committee on Computational Linguistics， 2020： 3831-3841. 10.18653/v1/2020.coling-main.340
27	ZHAO H S， YANG Y， ZHANG Q， et al. Improve neural entity recognition via multi-task data selection and constrained decoding［C］// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 2 （Short Papers）. Stroudsburg， PA： Association for Computational Linguistics， 2018： 346-351. 10.18653/v1/n18-2056
28	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 6000-6010.
29	HETLAND M L. Python基础教程［M］. 3版. 袁国忠，译. 北京：人民邮电出版社， 2014：1-458. 10.1007/978-1-4842-0055-1_1
	HETLAND M L. Beginning Python： From Novice to Professional［M］. 3rd ed. YUAN G Z， translated. Beijing： People’s Posts and Telecommunications Press， 2008：1-458. 10.1007/978-1-4842-0055-1_1
30	YANG J， ZHANG Y， LI L W， et al. YEDDA： a lightweight collaborative text span annotation tool［C］// Proceedings of ACL 2018， System Demonstrations. Stroudsburg， PA： Association for Computational Linguistics， 2018： 31-36. 10.18653/v1/p18-4006
31	杨玉基，许斌，胡家威，等. 一种准确而高效的领域知识图谱构建方法［J］. 软件学报， 2018， 29（10）： 2931-2947. 10.13328/j.cnki.jos.005552
	YANG Y J， XU B， HU J W， et al. Accurate and efficient method for constructing domain knowledge graph［J］. Journal of Software， 2018， 29（10）： 2931-2947. 10.13328/j.cnki.jos.005552
32	李振，周东岱. 教育知识图谱的概念模型与构建方法研究［J］. 电化教育研究， 2019， 40（8）： 78-86， 113.
	LI Z， ZHOU D D. Research on conceptual model and construction method of educational knowledge graph［J］. e-Education Research， 2019， 40（8）： 78-86， 113.

数据集	类型	训练集	开发集	测试集
python	句子数	6.1K	0.7K	0.6K
python	字符数	207.4K	23.3K	22.3K
resume	句子数	3.8K	0.5K	0.5K
resume	字符数	124.1K	13.9K	15.1K
weibo	句子数	1.4K	0.27K	0.27K
weibo	字符数	73.8K	14.5K	14.8K

数据集	类型	训练集	开发集	测试集
python	句子数	6.1K	0.7K	0.6K
python	字符数	207.4K	23.3K	22.3K
resume	句子数	3.8K	0.5K	0.5K
resume	字符数	124.1K	13.9K	15.1K
weibo	句子数	1.4K	0.27K	0.27K
weibo	字符数	73.8K	14.5K	14.8K

参数	值
hidden_size	［160，256，320，480］
number of layers	［1，2，4］
number of head	［4，6，8，10，16］
head dimension	［32，64，96，128，256］
max_len	［175，178，199］
fc dropout	0.4
transformer dropout	0.15
optimizer	SGD
learning rate	［1E-3，7E-4］
clip	5
batch_size	10
epochs	［75，100］

参数	值
hidden_size	［160，256，320，480］
number of layers	［1，2，4］
number of head	［4，6，8，10，16］
head dimension	［32，64，96，128，256］
max_len	［175，178，199］
fc dropout	0.4
transformer dropout	0.15
optimizer	SGD
learning rate	［1E-3，7E-4］
clip	5
batch_size	10
epochs	［75，100］

环境		配置
硬件	操作系统	Windows10
	中央处理器	AMD Ryzen7 3700X
	图形处理器	GeForce RTX 3070
	内存	32 GB
软件	编程环境	Anaconda
	Python	Python3.6
	Pytorch	1.8.0
	Fastnlp	0.5.0