基于子词嵌入和相对注意力的材料实体识别

doi:10.11772/j.issn.1001-9081.2021040582

《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (6): 1862-1868.DOI: 10.11772/j.issn.1001-9081.2021040582

所属专题：人工智能

基于子词嵌入和相对注意力的材料实体识别

韩玉民, 郝晓燕()

太原理工大学信息与计算机学院，太原 030600

收稿日期:2021-04-15 修回日期:2021-07-09 接受日期:2021-07-15 发布日期:2022-06-22 出版日期:2022-06-10
通讯作者: 郝晓燕
作者简介:韩玉民（1995—），男，山西临汾人，硕士，主要研究方向：自然语言处理
基金资助:
山西省软科学研究计划项目(2019041055-1);京大学科研技术项目(203290929-J)

Material entity recognition based on subword embedding and relative attention

Yumin HAN, Xiaoyan HAO()

College of Information and Computer，Taiyuan University of Technology，Taiyuan Shanxi 030600，China

Received:2021-04-15 Revised:2021-07-09 Accepted:2021-07-15 Online:2022-06-22 Published:2022-06-10
Contact: Xiaoyan HAO
About author:HAN Yumin，born in 1995，M. S. His research interests include natural language processing.
Supported by:
Soft Science Research Program of Shanxi Province(2019041055-1);Scientific Research and Technology Project of Peking University(203290929-J)

摘要/Abstract

摘要：

准确识别命名实体有助于构建专业知识图谱、问答系统等。基于深度学习的命名实体识别（NER）技术已广泛应用于多种专业领域，然而面向材料领域的NER研究相对较少。针对材料领域NER中可用于监督学习的数据集规模小、实体词复杂度高等问题，使用大规模非结构化的材料领域文献数据来训练基于一元语言模型（ULM）的子词嵌入分词模型，并充分利用单词结构蕴含的信息来增强模型鲁棒性；提出以BiLSTM-CRF模型（双向长短时记忆网络与条件随机场结合的模型）为基础并结合能够感知方向和距离的相对多头注意力机制（RMHA）的实体识别模型，以提高对关键词的敏感程度。得到的BiLSTM-RMHA-CRF模型结合ULM子词嵌入方法，相比BiLSTM-CNNs-CRF和SciBERT等模型，在固体氧化物燃料电池（SOFC）NER数据集上的宏平均F1值（Macro F1值）提高了2~4个百分点，在SOFC细粒度实体识别数据集上的Macro F1值提高了3~8个百分点。实验结果表明，基于子词嵌入和相对注意力的识别模型能够有效提高材料领域实体的识别准确率。

关键词: 命名实体识别, 子词嵌入, 相对注意力, 深度学习, 材料领域

Abstract:

Accurately identifying named entities is helpful to construct professional knowledge graphs and question answering systems. Named Entity Recognition （NER） technology based on deep learning has been widely used in a variety of professional fields. However， there are relatively few researches on NER in the field of materials. Concerning the problem of small scale of datasets and high complexity of entity words for supervised learning in NER of materials field， the large-scale unstructured materials field literature data were used to train the subword embedding word segmentation model based on Unigram Language Model （ULM）， and the information contained in the word structure was fully utilized to enhance the robustness of the model. At the same time， the entity recognition model with BiLSTM-CRF （Bi-directional Long-Short Term Memory-Conditional Random Field） model as the basis and combined with the Relative Multi-Head Attention（RMHA）capable of perceiving direction and distance of words was proposed to improve the sensitivity of the model to keywords. Compared with BiLSTM-CNNs-CRF， SciBERT （Scientific BERT） and other models， the obtained BiLSTM-RMHA-CRF model combining with the ULM subword embedding method increased the value of Macro F1 by 2-4 percentage points on Solid Oxide Fuel Cell （SOFC） NER dataset， and 3-8 percentage points on SOFC fine-grained entity recognition dataset. Experimental results show that the recognition model based on subword embedding and relative attention can effectively improve the recognition accuracy of entities in the materials field.

Key words: named entity recognition, subword embedding, relative attention, deep learning, material field

中图分类号:

TP391

韩玉民, 郝晓燕. 基于子词嵌入和相对注意力的材料实体识别[J]. 计算机应用, 2022, 42(6): 1862-1868.

Yumin HAN, Xiaoyan HAO. Material entity recognition based on subword embedding and relative attention[J]. Journal of Computer Applications, 2022, 42(6): 1862-1868.

图/表 12

参考文献 25

1	LAFFERTY J D， McCALLUM A， PEREIRA F C N. Conditional random fields： probabilistic models for segmenting and labeling sequence data［C］// Proceedings of the 18th International Conference on Machine Learning. San Francisco： Morgan Kaufmann Publishers Inc.， 2001： 282-289.
2	KIM Y. Convolutional neural networks for sentence classification［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2014： 1746 - 1751. 10.3115/v1/d14-1181
3	HOCHREITER S， SCHMIDHUBER J. Long short-term memory［J］. Neural Computation， 1997， 9（8）：1735-1780. 10.1162/neco.1997.9.8.1735
4	CHUNG J， GULCEHRE C， CHO K， et al. Empirical evaluation of gated recurrent neural networks on sequence modeling［EB/OL］. （2014-12-11）［2021-02-13］. . 10.1007/978-3-662-44848-9_34
5	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 6000-6010. 10.1016/s0262-4079(17)32358-8
6	MA X Z， HOVY E. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF［C］// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg， PA： Association for Computational Linguistics， 2016： 1064-1074. 10.18653/v1/p16-1101
7	CHIU J P C， NICHOLS E. Named entity recognition with bidirectional LSTM-CNNs［J］. Transactions of the Association for Computational Linguistics， 2016， 4： 357-370. 10.1162/tacl_a_00104
8	LIU L Y， SHANG J B， Ren x， et al. Empower sequence labeling with task-aware neural language model［C］// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2018： 5253-5260.
9	DHRISYA K， REMYA G， MOHAN A. Fine-grained entity type classification using GRU with self-attention［J］. International Journal of Information Technology， 2020， 12（3）： 869-878. 10.1007/s41870-020-00499-5
10	杨维，孙德艳，张晓慧，等. 面向电力智能问答系统的命名实体识别算法［J］. 计算机工程与设计， 2019， 40（12）： 3625-3630.
	YANG W， SUN D Y， ZHANG X H， et al. Named entity recognition for intelligent answer system in power service［J］. Computer Engineering and Design， 2019， 40（12）： 3625-3630.
11	李博，康晓东，张华丽，等. 采用Transformer-CRF的中文电子病历命名实体识别［J］. 计算机工程与应用， 2020， 56（5）：153-159.
	LI B， KANG X D， ZHANG H L， et al. Named entity recognition in Chinese electronic medical records using Transformer-CRF［J］. Computer Engineering and Applications， 2020， 56（5）：153-159.
12	张华丽，康晓东，李博，等. 结合注意力机制的Bi-LSTM-CRF中文电子病历命名实体识别［J］. 计算机应用， 2020， 40（S1）：98-102. 10.11772/j.issn.1001-9081.2019081371
	ZHANG H L， KANG X D， LI B， et al. Medical name entity recognition based on Bi-LSTM-CRF and attention mechanism［J］. Journal of Computer Applications， 2020， 40（S1）：98-102. 10.11772/j.issn.1001-9081.2019081371
13	张心怡，冯仕民，丁恩杰. 面向煤矿的实体识别与关系抽取模型［J］. 计算机应用， 2020， 40（8）：2182-2188. 10.11772/j.issn.1001-9081.2019122255
	ZHANG X Y， FENG S M， DING E J. Entity recognition and relation extraction model for coal mine［J］. Journal of Computer Applications， 2020， 40（8）：2182-2188. 10.11772/j.issn.1001-9081.2019122255
14	许力，李建华. 基于句法依存分析的图网络生物医学命名实体识别［J］. 计算机应用， 2021， 41（2）：357-362. 10.11772/j.issn.1001-9081.2020050738
	XU L， LI J H. Biomedical named entity recognition with graph network based on syntactic dependency parsing［J］. Journal of Computer Applications， 2021， 41（2）：357-362. 10.11772/j.issn.1001-9081.2020050738
15	MYSORE S， KIM E， STRUBELL E， et al. Automatically extracting action graphs from materials science synthesis procedures［EB/OL］. （2017-11-28）［2021-02-13］.. 10.18653/v1/w19-4007
16	MYSORE S， JENSEN Z， KIM E， et al. The materials science procedural text corpus： annotating materials synthesis procedures with shallow semantic structures［C］// Proceedings of the 13th Linguistic Annotation Workshop. Stroudsburg， PA： Association for Computational Linguistics， 2019： 56-64. 10.18653/v1/w19-4007
17	MRDJENOVICH D， HORTON M K， MONTOYA J H， et al. propnet： a knowledge graph for materials science［J］. Matter， 2020， 2（2）： 464-480. 10.1016/j.matt.2019.11.013
18	FRIEDRICH A， ADEL H， TOMAZIC F， et al. The SOFC-Exp corpus and neural approaches to information extraction in the materials science domain［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg， PA： Association for Computational Linguistics， 2020： 1255-1268. 10.18653/v1/2020.acl-main.116
19	GAJENDRAN S， MANJULA D， SUGUMARAN V. Character level and word level embedding with bidirectional LSTM - dynamic recurrent neural network for biomedical named entity recognition from literature［J］. Journal of Biomedical Informatics， 2020， 112： No.103609. 10.1016/j.jbi.2020.103609
20	CHO M， HA J， PARK C， et al. Combinatorial feature embedding based on CNN and LSTM for biomedical named entity recognition［J］. Journal of Biomedical Informatics， 2020， 103： No.103381. 10.1016/j.jbi.2020.103381
21	YAN H， DENG B C， LI X N， et al. TENER： adapting transformer encoder for named entity recognition［EB/OL］. （2019-12-10）［2021-02-13］..
22	MIKOLOV T， CHEN K， CORRADO G， et al. Efficient estimation of word representations in vector space［EB/OL］. （2013-09-07）［2021-02-13］.. 10.3126/jiee.v3i1.34327
23	HEINZERLING B， STRUBE M. BPEmb： tokenization-free pre-trained subword embeddings in 275 languages［C］// Proceedings of the 11th International Conference on Language Resources and Evaluation . Stroudsburg， PA： Association for Computational Linguistics， 2018： 2989-2993. 10.18653/v1/p19-1027
24	KUDO T. Subword regularization： improving neural network translation models with multiple subword candidates［C］// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg， PA： Association for Computational Linguistics， 2018： 66-75. 10.18653/v1/p18-1007
25	LIU Z H， WINATA G I， XU P， et al. Coach： a coarse-to-fine approach for cross-domain slot filling［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg， PA： Association for Computational Linguistics， 2020： 19-25.）. 10.18653/v1/2020.acl-main.3

类别	标签	数量	示例
非实体	O	—	resulting， significantly
材料	B-M I-M	1 492	GDC， YSZ， Ba_0.5Sr_0.5， hydrogen
参数	B-V I-V	1 475	270 mW， 1 mm bellow 600℃
设备	B-D I-D	533	fuel cell， SOFC， micro-solid oxide fuel cell
实验	B-E I-E	1 090	fabricated， characterized， compared， demonstrated

类别	标签	数量	示例
非实体	O	—	resulting， significantly
材料	B-M I-M	1 492	GDC， YSZ， Ba_0.5Sr_0.5， hydrogen
参数	B-V I-V	1 475	270 mW， 1 mm bellow 600℃
设备	B-D I-D	533	fuel cell， SOFC， micro-solid oxide fuel cell
实验	B-E I-E	1 090	fabricated， characterized， compared， demonstrated

类别	细粒度类别
材料（MATERIAL）	anode_material， cathode_material， electrolyte_material， fuel_used， interconnect_material， interlayer_material， support_material
参数（VALUE）	conductivity， current_density， degradation_rate， open_circuit_voltage， power_density， resistance， thickness， time_of_operation， voltage， working_temperature
设备（DEVICE）	device
实验（EXPERIMENT）	experiment_evoking_word

类别	细粒度类别
材料（MATERIAL）	anode_material， cathode_material， electrolyte_material， fuel_used， interconnect_material， interlayer_material， support_material
参数（VALUE）	conductivity， current_density， degradation_rate， open_circuit_voltage， power_density， resistance， thickness， time_of_operation， voltage， working_temperature
设备（DEVICE）	device
实验（EXPERIMENT）	experiment_evoking_word

模型参数	值
ULM分词结果（包括噪声）	3
LSTM隐层维数	600
RMHA隐层维数	600
RMHA多头数目	12
RMHA归一化参数	1
全局学习率	0.001
CRF层学习率	0.1

基于子词嵌入和相对注意力的材料实体识别

Material entity recognition based on subword embedding and relative attention

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 25

相关文章 15

编辑推荐

Metrics

模型	SOFC						SOFC Fine-grained
	F1				Micro F1	Macro F1	Micro F1	Macro F1
	DEVICE	EXPERIMENT	MATERIAL	VALUE	Micro F1	Macro F1	Micro F1	Macro F1
本文模型	85.03	79.04	81.88	94.40	87.63	85.09	79.09	74.42
BiLSTM-CNNs-CRF	80.45	75.89	79.07	94.88	86.05	82.57	75.97	67.97
LM-LSTM-CRF	81.62	76.19	80.40	94.17	86.45	83.09	76.71	70.74
BiGRU-SelfAttn	75.24	73.57	81.11	92.28	84.83	80.55	73.01	65.63
SciBERT	72.70	84.50	77.00	91.60	82.97	81.50	75.74	68.61
Char-Level CNN-LSTM	81.57	76.21	77.55	94.32	85.62	82.41	76.98	71.16

模型	SOFC		SOFC Fine-grained
模型	Micro F1	Macro F1	Micro F1	Macro F1
BiLSTM-CRF	72.83	70.47	63.12	51.16
+RMHA	75.43	73.40	66.26	52.70
+ULM	87.72	84.22	76.93	70.12
+RMHA+ULM	87.63	85.09	79.09	74.42

[1]	李顺勇, 李师毅, 胥瑞, 赵兴旺. 基于自注意力融合的不完整多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2696-2703.
[2]	黄云川, 江永全, 黄骏涛, 杨燕. 基于元图同构网络的分子毒性预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2964-2969.
[3]	潘烨新, 杨哲. 基于多级特征双向融合的小目标检测优化模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2871-2877.
[4]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[5]	王熙源, 张战成, 徐少康, 张宝成, 罗晓清, 胡伏原. 面向手术导航3D/2D配准的无监督跨域迁移网络[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2911-2918.
[6]	孙焕良, 王思懿, 刘俊岭, 许景科. 社交媒体数据中水灾事件求助信息提取模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2437-2445.
[7]	刘禹含, 吉根林, 张红苹. 基于骨架图与混合注意力的视频行人异常检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2551-2557.
[8]	顾焰杰, 张英俊, 刘晓倩, 周围, 孙威. 基于时空多图融合的交通流量预测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2618-2625.
[9]	石乾宏, 杨燕, 江永全, 欧阳小草, 范武波, 陈强, 姜涛, 李媛. 面向空气质量预测的多粒度突变拟合网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2643-2650.
[10]	吴筝, 程志友, 汪真天, 汪传建, 王胜, 许辉. 基于深度学习的患者麻醉复苏过程中的头部运动幅度分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2258-2263.
[11]	李欢欢, 黄添强, 丁雪梅, 罗海峰, 黄丽清. 基于多尺度时空图卷积网络的交通出行需求预测[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2065-2072.
[12]	张郅, 李欣, 叶乃夫, 胡凯茜. 基于暗知识保护的模型窃取防御技术DKP[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2080-2086.
[13]	赵亦群, 张志禹, 董雪. 基于密集残差物理信息神经网络的各向异性旅行时计算方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2310-2318.
[14]	徐松, 张文博, 王一帆. 基于时空信息的轻量视频显著性目标检测网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2192-2199.
[15]	孙逊, 冯睿锋, 陈彦如. 基于深度与实例分割融合的单目3D目标检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2208-2215.