计算机应用 ›› 2021, Vol. 41 ›› Issue (2): 357-362.DOI: 10.11772/j.issn.1001-9081.2020050738

所属专题: 人工智能

• 人工智能 • 上一篇    下一篇

基于句法依存分析的图网络生物医学命名实体识别

许力, 李建华   

  1. 华东理工大学 信息科学与工程学院, 上海 200237
  • 收稿日期:2020-06-01 修回日期:2020-07-29 出版日期:2021-02-10 发布日期:2020-09-15
  • 通讯作者: 李建华
  • 作者简介:许力(1997-),男,安徽合肥人,硕士研究生,主要研究方向:自然语言处理;李建华(1977-),男,安徽广德人,博士,副教授,CCF会员,主要研究方向:计算机辅助设计、药物数据挖掘、生物信息学。
  • 基金资助:
    国家重大新药创制国家科技重大专项(2018ZX09735002);国家重点研发计划项目(2016YFA0502304)。

Biomedical named entity recognition with graph network based on syntactic dependency parsing

XU Li, LI Jianhua   

  1. School of Information Science and Engineering, East University of Science and Technology, Shanghai 200237, China
  • Received:2020-06-01 Revised:2020-07-29 Online:2021-02-10 Published:2020-09-15
  • Supported by:
    This work is partially supported by the National Science and Technology Major Project for "Significant New Drugs Development" (2018ZX09735002), the National Key Research and Development Program of China (2016YFA0502304).

摘要: 现有的生物医学命名实体识别方法没有利用语料中的句法信息,准确率不高。针对这一问题,提出基于句法依存分析的图网络生物医学命名实体识别模型。首先利用卷积神经网络(CNN)生成字符向量并将其与词向量拼接,然后将其送入双向长短期记忆(BiLSTM)网络进行训练;其次以句子为单位对语料进行句法依存分析,并构建邻接矩阵;最后将BiLSTM的输出和通过句法依存分析构建的邻接矩阵送入图卷积网络(GCN)进行训练,并引入图注意力机制优化邻接节点的特征权重得到模型输出。所提模型在JNLPBA和NCBI-disease数据集上的F1值分别达到了76.91%和87.80%,相比基准模型分别提升了2.62和1.66个百分点。实验结果证明,提出的方法能有效提升模型在生物医学命名实体识别任务上的表现。

关键词: 生物医学, 命名实体识别, 双向长短期记忆网络, 图卷积网络, 句法依存分析, 图注意力机制

Abstract: The existing biomedical named entity recognition methods do not use the syntactic information in the corpus, resulting in low precision. To solve this problem, a biomedical named entity recognition model with graph network based on syntactic dependency parsing was proposed. Firstly, the Convolutional Nerual Network (CNN) was used to generate character vectors which were concatenated with word vectors, then they were sent to Bidirectional Long Short-Term Memory (BiLSTM) network for training. Secondly, syntactic dependency parsing to the corpus was conducted with a sentence as a unit, and the adjacency matrix was constructed. Finally, the output of BiLSTM and the adjacency matrix constructed by syntactic dependency parsing were sent to Graph Convolutional Network (GCN) for training, and the graph attention mechanism was introduced to optimize the feature weights of adjacency nodes to obtain the model output. On JNLPBA dataset and NCBI-disease dataset, the proposed model reached F1 score of 76.91% and 87.80% respectively, which were 2.62 and 1.66 percentage points higher than those of the baseline model respectively. Experimental results prove that the proposed method can effectively improve the performance of the model in the biomedical named entity recognition task.

Key words: biomedicine, named entity recognition, Bidirectional Long Short-Term Memory (BiLSTM) network, Graph Convolutional Network (GCN), syntactic dependency parsing, graph attention mechanism

中图分类号: