《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (6): 1862-1868.DOI: 10.11772/j.issn.1001-9081.2021040582

• 人工智能 • 上一篇    

基于子词嵌入和相对注意力的材料实体识别

韩玉民, 郝晓燕()   

  1. 太原理工大学 信息与计算机学院,太原 030600
  • 收稿日期:2021-04-15 修回日期:2021-07-09 接受日期:2021-07-15 发布日期:2022-06-22 出版日期:2022-06-10
  • 通讯作者: 郝晓燕
  • 作者简介:韩玉民(1995—),男,山西临汾人,硕士,主要研究方向:自然语言处理
  • 基金资助:
    山西省软科学研究计划项目(2019041055-1);京大学科研技术项目(203290929-J)

Material entity recognition based on subword embedding and relative attention

Yumin HAN, Xiaoyan HAO()   

  1. College of Information and Computer,Taiyuan University of Technology,Taiyuan Shanxi 030600,China
  • Received:2021-04-15 Revised:2021-07-09 Accepted:2021-07-15 Online:2022-06-22 Published:2022-06-10
  • Contact: Xiaoyan HAO
  • About author:HAN Yumin,born in 1995,M. S. His research interests include natural language processing.
  • Supported by:
    Soft Science Research Program of Shanxi Province(2019041055-1);Scientific Research and Technology Project of Peking University(203290929-J)

摘要:

准确识别命名实体有助于构建专业知识图谱、问答系统等。基于深度学习的命名实体识别(NER)技术已广泛应用于多种专业领域,然而面向材料领域的NER研究相对较少。针对材料领域NER中可用于监督学习的数据集规模小、实体词复杂度高等问题,使用大规模非结构化的材料领域文献数据来训练基于一元语言模型(ULM)的子词嵌入分词模型,并充分利用单词结构蕴含的信息来增强模型鲁棒性;提出以BiLSTM-CRF模型(双向长短时记忆网络与条件随机场结合的模型)为基础并结合能够感知方向和距离的相对多头注意力机制(RMHA)的实体识别模型,以提高对关键词的敏感程度。得到的BiLSTM-RMHA-CRF模型结合ULM子词嵌入方法,相比BiLSTM-CNNs-CRF和SciBERT等模型,在固体氧化物燃料电池(SOFC)NER数据集上的宏平均F1值(Macro F1值)提高了2~4个百分点,在SOFC细粒度实体识别数据集上的Macro F1值提高了3~8个百分点。实验结果表明,基于子词嵌入和相对注意力的识别模型能够有效提高材料领域实体的识别准确率。

关键词: 命名实体识别, 子词嵌入, 相对注意力, 深度学习, 材料领域

Abstract:

Accurately identifying named entities is helpful to construct professional knowledge graphs and question answering systems. Named Entity Recognition (NER) technology based on deep learning has been widely used in a variety of professional fields. However, there are relatively few researches on NER in the field of materials. Concerning the problem of small scale of datasets and high complexity of entity words for supervised learning in NER of materials field, the large-scale unstructured materials field literature data were used to train the subword embedding word segmentation model based on Unigram Language Model (ULM), and the information contained in the word structure was fully utilized to enhance the robustness of the model. At the same time, the entity recognition model with BiLSTM-CRF (Bi-directional Long-Short Term Memory-Conditional Random Field) model as the basis and combined with the Relative Multi-Head Attention(RMHA)capable of perceiving direction and distance of words was proposed to improve the sensitivity of the model to keywords. Compared with BiLSTM-CNNs-CRF, SciBERT (Scientific BERT) and other models, the obtained BiLSTM-RMHA-CRF model combining with the ULM subword embedding method increased the value of Macro F1 by 2-4 percentage points on Solid Oxide Fuel Cell (SOFC) NER dataset, and 3-8 percentage points on SOFC fine-grained entity recognition dataset. Experimental results show that the recognition model based on subword embedding and relative attention can effectively improve the recognition accuracy of entities in the materials field.

Key words: named entity recognition, subword embedding, relative attention, deep learning, material field

中图分类号: