Material entity recognition based on subword embedding and relative attention

doi:10.11772/j.issn.1001-9081.2021040582

Abstract

Abstract:

Accurately identifying named entities is helpful to construct professional knowledge graphs and question answering systems. Named Entity Recognition （NER） technology based on deep learning has been widely used in a variety of professional fields. However， there are relatively few researches on NER in the field of materials. Concerning the problem of small scale of datasets and high complexity of entity words for supervised learning in NER of materials field， the large-scale unstructured materials field literature data were used to train the subword embedding word segmentation model based on Unigram Language Model （ULM）， and the information contained in the word structure was fully utilized to enhance the robustness of the model. At the same time， the entity recognition model with BiLSTM-CRF （Bi-directional Long-Short Term Memory-Conditional Random Field） model as the basis and combined with the Relative Multi-Head Attention（RMHA）capable of perceiving direction and distance of words was proposed to improve the sensitivity of the model to keywords. Compared with BiLSTM-CNNs-CRF， SciBERT （Scientific BERT） and other models， the obtained BiLSTM-RMHA-CRF model combining with the ULM subword embedding method increased the value of Macro F1 by 2-4 percentage points on Solid Oxide Fuel Cell （SOFC） NER dataset， and 3-8 percentage points on SOFC fine-grained entity recognition dataset. Experimental results show that the recognition model based on subword embedding and relative attention can effectively improve the recognition accuracy of entities in the materials field.

Key words: named entity recognition, subword embedding, relative attention, deep learning, material field

摘要：

准确识别命名实体有助于构建专业知识图谱、问答系统等。基于深度学习的命名实体识别（NER）技术已广泛应用于多种专业领域，然而面向材料领域的NER研究相对较少。针对材料领域NER中可用于监督学习的数据集规模小、实体词复杂度高等问题，使用大规模非结构化的材料领域文献数据来训练基于一元语言模型（ULM）的子词嵌入分词模型，并充分利用单词结构蕴含的信息来增强模型鲁棒性；提出以BiLSTM-CRF模型（双向长短时记忆网络与条件随机场结合的模型）为基础并结合能够感知方向和距离的相对多头注意力机制（RMHA）的实体识别模型，以提高对关键词的敏感程度。得到的BiLSTM-RMHA-CRF模型结合ULM子词嵌入方法，相比BiLSTM-CNNs-CRF和SciBERT等模型，在固体氧化物燃料电池（SOFC）NER数据集上的宏平均F1值（Macro F1值）提高了2~4个百分点，在SOFC细粒度实体识别数据集上的Macro F1值提高了3~8个百分点。实验结果表明，基于子词嵌入和相对注意力的识别模型能够有效提高材料领域实体的识别准确率。

关键词: 命名实体识别, 子词嵌入, 相对注意力, 深度学习, 材料领域

CLC Number:

TP391

Yumin HAN, Xiaoyan HAO. Material entity recognition based on subword embedding and relative attention[J]. Journal of Computer Applications, 2022, 42(6): 1862-1868.

韩玉民, 郝晓燕. 基于子词嵌入和相对注意力的材料实体识别[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1862-1868.

Figures/Tables 12

References 25

1	LAFFERTY J D， McCALLUM A， PEREIRA F C N. Conditional random fields： probabilistic models for segmenting and labeling sequence data［C］// Proceedings of the 18th International Conference on Machine Learning. San Francisco： Morgan Kaufmann Publishers Inc.， 2001： 282-289.
2	KIM Y. Convolutional neural networks for sentence classification［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2014： 1746 - 1751. 10.3115/v1/d14-1181
3	HOCHREITER S， SCHMIDHUBER J. Long short-term memory［J］. Neural Computation， 1997， 9（8）：1735-1780. 10.1162/neco.1997.9.8.1735
4	CHUNG J， GULCEHRE C， CHO K， et al. Empirical evaluation of gated recurrent neural networks on sequence modeling［EB/OL］. （2014-12-11）［2021-02-13］. . 10.1007/978-3-662-44848-9_34
5	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 6000-6010. 10.1016/s0262-4079(17)32358-8
6	MA X Z， HOVY E. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF［C］// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg， PA： Association for Computational Linguistics， 2016： 1064-1074. 10.18653/v1/p16-1101
7	CHIU J P C， NICHOLS E. Named entity recognition with bidirectional LSTM-CNNs［J］. Transactions of the Association for Computational Linguistics， 2016， 4： 357-370. 10.1162/tacl_a_00104
8	LIU L Y， SHANG J B， Ren x， et al. Empower sequence labeling with task-aware neural language model［C］// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2018： 5253-5260.
9	DHRISYA K， REMYA G， MOHAN A. Fine-grained entity type classification using GRU with self-attention［J］. International Journal of Information Technology， 2020， 12（3）： 869-878. 10.1007/s41870-020-00499-5
10	杨维，孙德艳，张晓慧，等. 面向电力智能问答系统的命名实体识别算法［J］. 计算机工程与设计， 2019， 40（12）： 3625-3630.
	YANG W， SUN D Y， ZHANG X H， et al. Named entity recognition for intelligent answer system in power service［J］. Computer Engineering and Design， 2019， 40（12）： 3625-3630.
11	李博，康晓东，张华丽，等. 采用Transformer-CRF的中文电子病历命名实体识别［J］. 计算机工程与应用， 2020， 56（5）：153-159.
	LI B， KANG X D， ZHANG H L， et al. Named entity recognition in Chinese electronic medical records using Transformer-CRF［J］. Computer Engineering and Applications， 2020， 56（5）：153-159.
12	张华丽，康晓东，李博，等. 结合注意力机制的Bi-LSTM-CRF中文电子病历命名实体识别［J］. 计算机应用， 2020， 40（S1）：98-102. 10.11772/j.issn.1001-9081.2019081371
	ZHANG H L， KANG X D， LI B， et al. Medical name entity recognition based on Bi-LSTM-CRF and attention mechanism［J］. Journal of Computer Applications， 2020， 40（S1）：98-102. 10.11772/j.issn.1001-9081.2019081371
13	张心怡，冯仕民，丁恩杰. 面向煤矿的实体识别与关系抽取模型［J］. 计算机应用， 2020， 40（8）：2182-2188. 10.11772/j.issn.1001-9081.2019122255
	ZHANG X Y， FENG S M， DING E J. Entity recognition and relation extraction model for coal mine［J］. Journal of Computer Applications， 2020， 40（8）：2182-2188. 10.11772/j.issn.1001-9081.2019122255
14	许力，李建华. 基于句法依存分析的图网络生物医学命名实体识别［J］. 计算机应用， 2021， 41（2）：357-362. 10.11772/j.issn.1001-9081.2020050738
	XU L， LI J H. Biomedical named entity recognition with graph network based on syntactic dependency parsing［J］. Journal of Computer Applications， 2021， 41（2）：357-362. 10.11772/j.issn.1001-9081.2020050738
15	MYSORE S， KIM E， STRUBELL E， et al. Automatically extracting action graphs from materials science synthesis procedures［EB/OL］. （2017-11-28）［2021-02-13］.. 10.18653/v1/w19-4007
16	MYSORE S， JENSEN Z， KIM E， et al. The materials science procedural text corpus： annotating materials synthesis procedures with shallow semantic structures［C］// Proceedings of the 13th Linguistic Annotation Workshop. Stroudsburg， PA： Association for Computational Linguistics， 2019： 56-64. 10.18653/v1/w19-4007
17	MRDJENOVICH D， HORTON M K， MONTOYA J H， et al. propnet： a knowledge graph for materials science［J］. Matter， 2020， 2（2）： 464-480. 10.1016/j.matt.2019.11.013
18	FRIEDRICH A， ADEL H， TOMAZIC F， et al. The SOFC-Exp corpus and neural approaches to information extraction in the materials science domain［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg， PA： Association for Computational Linguistics， 2020： 1255-1268. 10.18653/v1/2020.acl-main.116
19	GAJENDRAN S， MANJULA D， SUGUMARAN V. Character level and word level embedding with bidirectional LSTM - dynamic recurrent neural network for biomedical named entity recognition from literature［J］. Journal of Biomedical Informatics， 2020， 112： No.103609. 10.1016/j.jbi.2020.103609
20	CHO M， HA J， PARK C， et al. Combinatorial feature embedding based on CNN and LSTM for biomedical named entity recognition［J］. Journal of Biomedical Informatics， 2020， 103： No.103381. 10.1016/j.jbi.2020.103381
21	YAN H， DENG B C， LI X N， et al. TENER： adapting transformer encoder for named entity recognition［EB/OL］. （2019-12-10）［2021-02-13］..
22	MIKOLOV T， CHEN K， CORRADO G， et al. Efficient estimation of word representations in vector space［EB/OL］. （2013-09-07）［2021-02-13］.. 10.3126/jiee.v3i1.34327
23	HEINZERLING B， STRUBE M. BPEmb： tokenization-free pre-trained subword embeddings in 275 languages［C］// Proceedings of the 11th International Conference on Language Resources and Evaluation . Stroudsburg， PA： Association for Computational Linguistics， 2018： 2989-2993. 10.18653/v1/p19-1027
24	KUDO T. Subword regularization： improving neural network translation models with multiple subword candidates［C］// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg， PA： Association for Computational Linguistics， 2018： 66-75. 10.18653/v1/p18-1007
25	LIU Z H， WINATA G I， XU P， et al. Coach： a coarse-to-fine approach for cross-domain slot filling［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg， PA： Association for Computational Linguistics， 2020： 19-25.）. 10.18653/v1/2020.acl-main.3

类别	标签	数量	示例
非实体	O	—	resulting， significantly
材料	B-M I-M	1 492	GDC， YSZ， Ba_0.5Sr_0.5， hydrogen
参数	B-V I-V	1 475	270 mW， 1 mm bellow 600℃
设备	B-D I-D	533	fuel cell， SOFC， micro-solid oxide fuel cell
实验	B-E I-E	1 090	fabricated， characterized， compared， demonstrated

类别	标签	数量	示例
非实体	O	—	resulting， significantly
材料	B-M I-M	1 492	GDC， YSZ， Ba_0.5Sr_0.5， hydrogen
参数	B-V I-V	1 475	270 mW， 1 mm bellow 600℃
设备	B-D I-D	533	fuel cell， SOFC， micro-solid oxide fuel cell
实验	B-E I-E	1 090	fabricated， characterized， compared， demonstrated

类别	细粒度类别
材料（MATERIAL）	anode_material， cathode_material， electrolyte_material， fuel_used， interconnect_material， interlayer_material， support_material
参数（VALUE）	conductivity， current_density， degradation_rate， open_circuit_voltage， power_density， resistance， thickness， time_of_operation， voltage， working_temperature
设备（DEVICE）	device
实验（EXPERIMENT）	experiment_evoking_word

类别	细粒度类别
材料（MATERIAL）	anode_material， cathode_material， electrolyte_material， fuel_used， interconnect_material， interlayer_material， support_material
参数（VALUE）	conductivity， current_density， degradation_rate， open_circuit_voltage， power_density， resistance， thickness， time_of_operation， voltage， working_temperature
设备（DEVICE）	device
实验（EXPERIMENT）	experiment_evoking_word

模型参数	值
ULM分词结果（包括噪声）	3
LSTM隐层维数	600
RMHA隐层维数	600
RMHA多头数目	12
RMHA归一化参数	1
全局学习率	0.001
CRF层学习率	0.1