《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (6): 1862-1868.DOI: 10.11772/j.issn.1001-9081.2021040582
所属专题: 人工智能
收稿日期:
2021-04-15
修回日期:
2021-07-09
接受日期:
2021-07-15
发布日期:
2022-06-22
出版日期:
2022-06-10
通讯作者:
郝晓燕
作者简介:
韩玉民(1995—),男,山西临汾人,硕士,主要研究方向:自然语言处理
基金资助:
Received:
2021-04-15
Revised:
2021-07-09
Accepted:
2021-07-15
Online:
2022-06-22
Published:
2022-06-10
Contact:
Xiaoyan HAO
About author:
HAN Yumin,born in 1995,M. S. His research interests include natural language processing.
Supported by:
摘要:
准确识别命名实体有助于构建专业知识图谱、问答系统等。基于深度学习的命名实体识别(NER)技术已广泛应用于多种专业领域,然而面向材料领域的NER研究相对较少。针对材料领域NER中可用于监督学习的数据集规模小、实体词复杂度高等问题,使用大规模非结构化的材料领域文献数据来训练基于一元语言模型(ULM)的子词嵌入分词模型,并充分利用单词结构蕴含的信息来增强模型鲁棒性;提出以BiLSTM-CRF模型(双向长短时记忆网络与条件随机场结合的模型)为基础并结合能够感知方向和距离的相对多头注意力机制(RMHA)的实体识别模型,以提高对关键词的敏感程度。得到的BiLSTM-RMHA-CRF模型结合ULM子词嵌入方法,相比BiLSTM-CNNs-CRF和SciBERT等模型,在固体氧化物燃料电池(SOFC)NER数据集上的宏平均F1值(Macro F1值)提高了2~4个百分点,在SOFC细粒度实体识别数据集上的Macro F1值提高了3~8个百分点。实验结果表明,基于子词嵌入和相对注意力的识别模型能够有效提高材料领域实体的识别准确率。
中图分类号:
韩玉民, 郝晓燕. 基于子词嵌入和相对注意力的材料实体识别[J]. 计算机应用, 2022, 42(6): 1862-1868.
Yumin HAN, Xiaoyan HAO. Material entity recognition based on subword embedding and relative attention[J]. Journal of Computer Applications, 2022, 42(6): 1862-1868.
类别 | 标签 | 数量 | 示例 |
---|---|---|---|
非实体 | O | — | resulting, significantly |
材料 | B-M I-M | 1 492 | GDC, YSZ, Ba0.5Sr0.5, hydrogen |
参数 | B-V I-V | 1 475 | 270 mW, 1 mm bellow 600℃ |
设备 | B-D I-D | 533 | fuel cell, SOFC, micro-solid oxide fuel cell |
实验 | B-E I-E | 1 090 | fabricated, characterized, compared, demonstrated |
表1 SOFC命名实体识别数据集标签分布
Tab. 1 Label distribution of SOFC NER dataset
类别 | 标签 | 数量 | 示例 |
---|---|---|---|
非实体 | O | — | resulting, significantly |
材料 | B-M I-M | 1 492 | GDC, YSZ, Ba0.5Sr0.5, hydrogen |
参数 | B-V I-V | 1 475 | 270 mW, 1 mm bellow 600℃ |
设备 | B-D I-D | 533 | fuel cell, SOFC, micro-solid oxide fuel cell |
实验 | B-E I-E | 1 090 | fabricated, characterized, compared, demonstrated |
类别 | 细粒度类别 |
---|---|
材料(MATERIAL) | anode_material, cathode_material, electrolyte_material, fuel_used, interconnect_material, interlayer_material, support_material |
参数(VALUE) | conductivity, current_density, degradation_rate, open_circuit_voltage, power_density, resistance, thickness, time_of_operation, voltage, working_temperature |
设备(DEVICE) | device |
实验(EXPERIMENT) | experiment_evoking_word |
表2 SOFC细粒度实体识别标签类别
Tab. 2 Label categories of SOFC Fine-grained entity recognition dataset
类别 | 细粒度类别 |
---|---|
材料(MATERIAL) | anode_material, cathode_material, electrolyte_material, fuel_used, interconnect_material, interlayer_material, support_material |
参数(VALUE) | conductivity, current_density, degradation_rate, open_circuit_voltage, power_density, resistance, thickness, time_of_operation, voltage, working_temperature |
设备(DEVICE) | device |
实验(EXPERIMENT) | experiment_evoking_word |
模型参数 | 值 |
---|---|
ULM分词结果(包括噪声) | 3 |
LSTM隐层维数 | 600 |
RMHA隐层维数 | 600 |
RMHA多头数目 | 12 |
RMHA归一化参数 | 1 |
全局学习率 | 0.001 |
CRF层学习率 | 0.1 |
表3 模型参数设置
Tab. 3 Model parameter setting
模型参数 | 值 |
---|---|
ULM分词结果(包括噪声) | 3 |
LSTM隐层维数 | 600 |
RMHA隐层维数 | 600 |
RMHA多头数目 | 12 |
RMHA归一化参数 | 1 |
全局学习率 | 0.001 |
CRF层学习率 | 0.1 |
模型 | SOFC | SOFC Fine-grained | ||||||
---|---|---|---|---|---|---|---|---|
F1 | Micro F1 | Macro F1 | Micro F1 | Macro F1 | ||||
DEVICE | EXPERIMENT | MATERIAL | VALUE | |||||
本文模型 | 85.03 | 79.04 | 81.88 | 94.40 | 87.63 | 85.09 | 79.09 | 74.42 |
BiLSTM-CNNs-CRF | 80.45 | 75.89 | 79.07 | 94.88 | 86.05 | 82.57 | 75.97 | 67.97 |
LM-LSTM-CRF | 81.62 | 76.19 | 80.40 | 94.17 | 86.45 | 83.09 | 76.71 | 70.74 |
BiGRU-SelfAttn | 75.24 | 73.57 | 81.11 | 92.28 | 84.83 | 80.55 | 73.01 | 65.63 |
SciBERT | 72.70 | 84.50 | 77.00 | 91.60 | 82.97 | 81.50 | 75.74 | 68.61 |
Char-Level CNN-LSTM | 81.57 | 76.21 | 77.55 | 94.32 | 85.62 | 82.41 | 76.98 | 71.16 |
表4 SOFC命名实体识别数据集上不同模型的实验结果 ( %)
Tab. 4 Experimental results of different models on SOFC NER and Fine-grained entity recognition datasets
模型 | SOFC | SOFC Fine-grained | ||||||
---|---|---|---|---|---|---|---|---|
F1 | Micro F1 | Macro F1 | Micro F1 | Macro F1 | ||||
DEVICE | EXPERIMENT | MATERIAL | VALUE | |||||
本文模型 | 85.03 | 79.04 | 81.88 | 94.40 | 87.63 | 85.09 | 79.09 | 74.42 |
BiLSTM-CNNs-CRF | 80.45 | 75.89 | 79.07 | 94.88 | 86.05 | 82.57 | 75.97 | 67.97 |
LM-LSTM-CRF | 81.62 | 76.19 | 80.40 | 94.17 | 86.45 | 83.09 | 76.71 | 70.74 |
BiGRU-SelfAttn | 75.24 | 73.57 | 81.11 | 92.28 | 84.83 | 80.55 | 73.01 | 65.63 |
SciBERT | 72.70 | 84.50 | 77.00 | 91.60 | 82.97 | 81.50 | 75.74 | 68.61 |
Char-Level CNN-LSTM | 81.57 | 76.21 | 77.55 | 94.32 | 85.62 | 82.41 | 76.98 | 71.16 |
模型 | SOFC | SOFC Fine-grained | ||
---|---|---|---|---|
Micro F1 | Macro F1 | Micro F1 | Macro F1 | |
BiLSTM-CRF | 72.83 | 70.47 | 63.12 | 51.16 |
+RMHA | 75.43 | 73.40 | 66.26 | 52.70 |
+ULM | 87.72 | 84.22 | 76.93 | 70.12 |
+RMHA+ULM | 87.63 | 85.09 | 79.09 | 74.42 |
表5 消融实验结果 ( %)
Tab. 5 Ablation experimental results
模型 | SOFC | SOFC Fine-grained | ||
---|---|---|---|---|
Micro F1 | Macro F1 | Micro F1 | Macro F1 | |
BiLSTM-CRF | 72.83 | 70.47 | 63.12 | 51.16 |
+RMHA | 75.43 | 73.40 | 66.26 | 52.70 |
+ULM | 87.72 | 84.22 | 76.93 | 70.12 |
+RMHA+ULM | 87.63 | 85.09 | 79.09 | 74.42 |
模型 | SOFC | SOFC Fine-grained | ||
---|---|---|---|---|
Micro F1 | Macro F1 | Micro F1 | Macro F1 | |
BiLSTM-RMHA-CRF | 75.43 | 73.40 | 66.26 | 52.70 |
+Char-level CNN | 87.74 | 83.38 | 78.04 | 71.33 |
+BPEmb | 88.36 | 84.34 | 75.40 | 67.95 |
+ULM | 87.63 | 85.09 | 79.09 | 74.42 |
+BPEmb+ULM | 87.69 | 84.58 | 80.38 | 76.02 |
表6 词嵌入实验结果 ( %)
Tab. 6 Word embedding experimental results
模型 | SOFC | SOFC Fine-grained | ||
---|---|---|---|---|
Micro F1 | Macro F1 | Micro F1 | Macro F1 | |
BiLSTM-RMHA-CRF | 75.43 | 73.40 | 66.26 | 52.70 |
+Char-level CNN | 87.74 | 83.38 | 78.04 | 71.33 |
+BPEmb | 88.36 | 84.34 | 75.40 | 67.95 |
+ULM | 87.63 | 85.09 | 79.09 | 74.42 |
+BPEmb+ULM | 87.69 | 84.58 | 80.38 | 76.02 |
模型 | SOFC | SOFC Fine-grained | ||
---|---|---|---|---|
Micro F1 | Macro F1 | Micro F1 | Macro F1 | |
BiLSTM-CRF | 87.72 | 84.22 | 76.93 | 70.12 |
+CNN | 87.07 | 82.93 | 76.60 | 69.73 |
+SA | 87.84 | 84.44 | 74.96 | 69.28 |
+MHA | 86.63 | 83.27 | 76.33 | 70.99 |
+RMHA | 87.63 | 85.09 | 79.09 | 74.42 |
表7 特征编码器实验结果 ( %)
Tab. 7 Feature encoder experimental results
模型 | SOFC | SOFC Fine-grained | ||
---|---|---|---|---|
Micro F1 | Macro F1 | Micro F1 | Macro F1 | |
BiLSTM-CRF | 87.72 | 84.22 | 76.93 | 70.12 |
+CNN | 87.07 | 82.93 | 76.60 | 69.73 |
+SA | 87.84 | 84.44 | 74.96 | 69.28 |
+MHA | 86.63 | 83.27 | 76.33 | 70.99 |
+RMHA | 87.63 | 85.09 | 79.09 | 74.42 |
1 | LAFFERTY J D, McCALLUM A, PEREIRA F C N. Conditional random fields: probabilistic models for segmenting and labeling sequence data[C]// Proceedings of the 18th International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers Inc., 2001: 282-289. |
2 | KIM Y. Convolutional neural networks for sentence classification[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics, 2014: 1746 - 1751. 10.3115/v1/d14-1181 |
3 | HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8):1735-1780. 10.1162/neco.1997.9.8.1735 |
4 | CHUNG J, GULCEHRE C, CHO K, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling[EB/OL]. (2014-12-11) [2021-02-13]. . 10.1007/978-3-662-44848-9_34 |
5 | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc., 2017: 6000-6010. 10.1016/s0262-4079(17)32358-8 |
6 | MA X Z, HOVY E. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: Association for Computational Linguistics, 2016: 1064-1074. 10.18653/v1/p16-1101 |
7 | CHIU J P C, NICHOLS E. Named entity recognition with bidirectional LSTM-CNNs[J]. Transactions of the Association for Computational Linguistics, 2016, 4: 357-370. 10.1162/tacl_a_00104 |
8 | LIU L Y, SHANG J B, Ren x, et al. Empower sequence labeling with task-aware neural language model[C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press, 2018: 5253-5260. |
9 | DHRISYA K, REMYA G, MOHAN A. Fine-grained entity type classification using GRU with self-attention[J]. International Journal of Information Technology, 2020, 12(3): 869-878. 10.1007/s41870-020-00499-5 |
10 | 杨维,孙德艳,张晓慧,等. 面向电力智能问答系统的命名实体识别算法[J]. 计算机工程与设计, 2019, 40(12) : 3625-3630. |
YANG W, SUN D Y, ZHANG X H, et al. Named entity recognition for intelligent answer system in power service[J]. Computer Engineering and Design, 2019, 40(12): 3625-3630. | |
11 | 李博,康晓东,张华丽,等. 采用Transformer-CRF的中文电子病历命名实体识别[J]. 计算机工程与应用, 2020, 56(5):153-159. |
LI B, KANG X D, ZHANG H L, et al. Named entity recognition in Chinese electronic medical records using Transformer-CRF[J]. Computer Engineering and Applications, 2020, 56(5):153-159. | |
12 | 张华丽,康晓东,李博,等. 结合注意力机制的Bi-LSTM-CRF中文电子病历命名实体识别[J]. 计算机应用, 2020, 40(S1):98-102. 10.11772/j.issn.1001-9081.2019081371 |
ZHANG H L, KANG X D, LI B, et al. Medical name entity recognition based on Bi-LSTM-CRF and attention mechanism[J]. Journal of Computer Applications, 2020, 40(S1):98-102. 10.11772/j.issn.1001-9081.2019081371 | |
13 | 张心怡,冯仕民,丁恩杰. 面向煤矿的实体识别与关系抽取模型[J]. 计算机应用, 2020, 40(8):2182-2188. 10.11772/j.issn.1001-9081.2019122255 |
ZHANG X Y, FENG S M, DING E J. Entity recognition and relation extraction model for coal mine[J]. Journal of Computer Applications, 2020, 40(8):2182-2188. 10.11772/j.issn.1001-9081.2019122255 | |
14 | 许力,李建华. 基于句法依存分析的图网络生物医学命名实体识别[J]. 计算机应用, 2021, 41(2):357-362. 10.11772/j.issn.1001-9081.2020050738 |
XU L, LI J H. Biomedical named entity recognition with graph network based on syntactic dependency parsing[J]. Journal of Computer Applications, 2021, 41(2):357-362. 10.11772/j.issn.1001-9081.2020050738 | |
15 | MYSORE S, KIM E, STRUBELL E, et al. Automatically extracting action graphs from materials science synthesis procedures[EB/OL]. (2017-11-28) [2021-02-13].. 10.18653/v1/w19-4007 |
16 | MYSORE S, JENSEN Z, KIM E, et al. The materials science procedural text corpus: annotating materials synthesis procedures with shallow semantic structures[C]// Proceedings of the 13th Linguistic Annotation Workshop. Stroudsburg, PA: Association for Computational Linguistics, 2019: 56-64. 10.18653/v1/w19-4007 |
17 | MRDJENOVICH D, HORTON M K, MONTOYA J H, et al. propnet: a knowledge graph for materials science[J]. Matter, 2020, 2(2): 464-480. 10.1016/j.matt.2019.11.013 |
18 | FRIEDRICH A, ADEL H, TOMAZIC F, et al. The SOFC-Exp corpus and neural approaches to information extraction in the materials science domain[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2020: 1255-1268. 10.18653/v1/2020.acl-main.116 |
19 | GAJENDRAN S, MANJULA D, SUGUMARAN V. Character level and word level embedding with bidirectional LSTM - dynamic recurrent neural network for biomedical named entity recognition from literature[J]. Journal of Biomedical Informatics, 2020, 112: No.103609. 10.1016/j.jbi.2020.103609 |
20 | CHO M, HA J, PARK C, et al. Combinatorial feature embedding based on CNN and LSTM for biomedical named entity recognition[J]. Journal of Biomedical Informatics, 2020, 103: No.103381. 10.1016/j.jbi.2020.103381 |
21 | YAN H, DENG B C, LI X N, et al. TENER: adapting transformer encoder for named entity recognition[EB/OL]. (2019-12-10) [2021-02-13].. |
22 | MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL]. (2013-09-07) [2021-02-13].. 10.3126/jiee.v3i1.34327 |
23 | HEINZERLING B, STRUBE M. BPEmb: tokenization-free pre-trained subword embeddings in 275 languages[C]// Proceedings of the 11th International Conference on Language Resources and Evaluation . Stroudsburg, PA: Association for Computational Linguistics, 2018: 2989-2993. 10.18653/v1/p19-1027 |
24 | KUDO T. Subword regularization: improving neural network translation models with multiple subword candidates[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: Association for Computational Linguistics, 2018: 66-75. 10.18653/v1/p18-1007 |
25 | LIU Z H, WINATA G I, XU P, et al. Coach: a coarse-to-fine approach for cross-domain slot filling[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2020: 19-25.). 10.18653/v1/2020.acl-main.3 |
[1] | 李顺勇, 李师毅, 胥瑞, 赵兴旺. 基于自注意力融合的不完整多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2696-2703. |
[2] | 黄云川, 江永全, 黄骏涛, 杨燕. 基于元图同构网络的分子毒性预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2964-2969. |
[3] | 潘烨新, 杨哲. 基于多级特征双向融合的小目标检测优化模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2871-2877. |
[4] | 秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974. |
[5] | 王熙源, 张战成, 徐少康, 张宝成, 罗晓清, 胡伏原. 面向手术导航3D/2D配准的无监督跨域迁移网络[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2911-2918. |
[6] | 孙焕良, 王思懿, 刘俊岭, 许景科. 社交媒体数据中水灾事件求助信息提取模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2437-2445. |
[7] | 刘禹含, 吉根林, 张红苹. 基于骨架图与混合注意力的视频行人异常检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2551-2557. |
[8] | 顾焰杰, 张英俊, 刘晓倩, 周围, 孙威. 基于时空多图融合的交通流量预测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2618-2625. |
[9] | 石乾宏, 杨燕, 江永全, 欧阳小草, 范武波, 陈强, 姜涛, 李媛. 面向空气质量预测的多粒度突变拟合网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2643-2650. |
[10] | 吴筝, 程志友, 汪真天, 汪传建, 王胜, 许辉. 基于深度学习的患者麻醉复苏过程中的头部运动幅度分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2258-2263. |
[11] | 李欢欢, 黄添强, 丁雪梅, 罗海峰, 黄丽清. 基于多尺度时空图卷积网络的交通出行需求预测[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2065-2072. |
[12] | 张郅, 李欣, 叶乃夫, 胡凯茜. 基于暗知识保护的模型窃取防御技术DKP[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2080-2086. |
[13] | 赵亦群, 张志禹, 董雪. 基于密集残差物理信息神经网络的各向异性旅行时计算方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2310-2318. |
[14] | 徐松, 张文博, 王一帆. 基于时空信息的轻量视频显著性目标检测网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2192-2199. |
[15] | 孙逊, 冯睿锋, 陈彦如. 基于深度与实例分割融合的单目3D目标检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2208-2215. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||