Biomedical named entity recognition with graph network based on syntactic dependency parsing

doi:10.11772/j.issn.1001-9081.2020050738

Abstract

Abstract: The existing biomedical named entity recognition methods do not use the syntactic information in the corpus, resulting in low precision. To solve this problem, a biomedical named entity recognition model with graph network based on syntactic dependency parsing was proposed. Firstly, the Convolutional Nerual Network (CNN) was used to generate character vectors which were concatenated with word vectors, then they were sent to Bidirectional Long Short-Term Memory (BiLSTM) network for training. Secondly, syntactic dependency parsing to the corpus was conducted with a sentence as a unit, and the adjacency matrix was constructed. Finally, the output of BiLSTM and the adjacency matrix constructed by syntactic dependency parsing were sent to Graph Convolutional Network (GCN) for training, and the graph attention mechanism was introduced to optimize the feature weights of adjacency nodes to obtain the model output. On JNLPBA dataset and NCBI-disease dataset, the proposed model reached F1 score of 76.91% and 87.80% respectively, which were 2.62 and 1.66 percentage points higher than those of the baseline model respectively. Experimental results prove that the proposed method can effectively improve the performance of the model in the biomedical named entity recognition task.

Key words: biomedicine, named entity recognition, Bidirectional Long Short-Term Memory (BiLSTM) network, Graph Convolutional Network (GCN), syntactic dependency parsing, graph attention mechanism

摘要： 现有的生物医学命名实体识别方法没有利用语料中的句法信息，准确率不高。针对这一问题，提出基于句法依存分析的图网络生物医学命名实体识别模型。首先利用卷积神经网络（CNN）生成字符向量并将其与词向量拼接，然后将其送入双向长短期记忆（BiLSTM）网络进行训练；其次以句子为单位对语料进行句法依存分析，并构建邻接矩阵；最后将BiLSTM的输出和通过句法依存分析构建的邻接矩阵送入图卷积网络（GCN）进行训练，并引入图注意力机制优化邻接节点的特征权重得到模型输出。所提模型在JNLPBA和NCBI-disease数据集上的F1值分别达到了76.91%和87.80%，相比基准模型分别提升了2.62和1.66个百分点。实验结果证明，提出的方法能有效提升模型在生物医学命名实体识别任务上的表现。

关键词: 生物医学, 命名实体识别, 双向长短期记忆网络, 图卷积网络, 句法依存分析, 图注意力机制

CLC Number:

TP391.1

XU Li, LI Jianhua. Biomedical named entity recognition with graph network based on syntactic dependency parsing[J]. Journal of Computer Applications, 2021, 41(2): 357-362.

许力, 李建华. 基于句法依存分析的图网络生物医学命名实体识别[J]. 计算机应用, 2021, 41(2): 357-362.

References

[1] KRAUTHAMMER M,RZHETSKY A,MOROZOV P,et al. Using BLAST for identifying gene and protein names in journal articles[J]. Gene,2000,259(1/2):245-252.
[2] HANISCH D,FUNDEL K,MEVISSEN H T,et al. ProMiner:rulebased protein and gene entity recognition[J]. BMC Bioinformatics, 2005,6(S1):No. S14.
[3] LEAMAN R,WEI C H,LU Z. tmChem:a high performance approach for chemical named entity recognition and normalization[J]. Journal of Cheminformatics,2015,7(S1):No. S3.
[4] LI Y,LIN H,YANG Z. Incorporating rich background knowledge for gene named entity classification and recognition[J]. BMC Bioinformatics,2009,10:No. 223.
[5] ROCKTÄSCHEL T, WEIDLICH M, LESER U. ChemSpot:a hybrid system for chemical named entity recognition[J]. Bioinformatics,2012,28(12):1633-1640.
[6] HUANG Z,XU W,YU K. Bidirectional LSTM-CRF models for sequence tagging[EB/OL].[2019-08-09]. https://arxiv.org/pdf/1508.01991.pdf.
[7] 李丽双, 郭元凯. 基于CNN-BLSTM-CRF模型的生物医学命名实体识别[J]. 中文信息学报,2018,32(1):116-122.(LI L S, GUO Y K. Biomedical named entity recognition with CNN-BLSTMCRF[J]. Journal of Chinese Information Processing,2018,32(1):116-122.)
[8] DANG T H,LE H Q,NGUYEN T M,et al. D3NER:biomedical named entity recognition using CRF-biLSTM improved with finetuned embeddings of various linguistic information[J]. Bioinformatics,2018,34(20):3539-3546.
[9] CRICHTON G,PYYSALO S,CHIU B,et al. A neural network multi-task learning approach to biomedical named entity recognition[J]. BMC Bioinformatics,2017,18(1):No. 368.
[10] CHO H,LEE H. Biomedical named entity recognition using deep neural networks with contextual information[J]. BMC Bioinformatics,2019,20(1):No. 735.
[11] LEVY O,GOLDBERG Y. Dependency-based word embeddings[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg,PA:Association for Computational Linguistics,2014:302-308.
[12] JIE Z,LU W. Dependency-guided LSTM-CRF for named entity recognition[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA:Association for Computational Linguistics,2019:3862-3872.
[13] BASTINGS J,TITOV I,AZIZ W,et al. Graph convolutional encoders for syntax-aware neural machine translation[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg,PA:Association for Computational Linguistics,2017:1957-1967.
[14] MARCHEGGIANI D,TITOV I. Encoding sentences with graph convolutional networks for semantic role labeling[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA:Association for Computational Linguistics,2017:1506-1515.
[15] 宋晓思. 词项语法句法学中的依存关系探析[J]. 边疆经济与文化,2013(3):144-145. (SONG X S. An analysis of the dependency relationship in lexical grammar and syntax[J]. The Border Economy and Culture,2013(3):144-145.)
[16] 冯时, 付永陈, 阳锋, 等. 基于依存句法的博文情感倾向分析研究[J]. 计算机研究与发展, 2012,49(11):2395-2406.(FENG S,FU Y C,YANG F,et al. Blog sentiment orientation analysis based on dependency parsing[J]. Journal of Computer Research and Development,2012,49(11):2395-2406.)
[17] VELIČKOVIĆ P,CUCURULL G,CASANOVA A,et al. Graph attention networks[EB/OL].[2019-05-18]. https://arxiv.org/pdf/1710.10903.pdf.
[18] COLLIER N,KIM J D. Introduction to the bio-entity recognition task at JNLPBA[C]//Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. Stroudsburg,PA:Association for Computational Linguistics,2004:73-78.
[19] DOĞAN R I,LEAMAN R,LU Z. NCBI disease corpus:a resource for disease name recognition and concept normalization[J]. Journal of Biomedical Informatics,2014,47:1-10.
[20] TANG B, CAO H, WANG X, et al. Evaluating word representation features in biomedical named entity recognition tasks[J]. BioMed Research International, 2014, 2014:No. 240403.
[21] LI L,JIN L,JIANG Y,et al. Recognizing biomedical named entities based on the sentence vector/twin word embeddings conditioned bidirectional LSTM[C]//Proceedings of the15th China National Conference on Chinese Computational Linguistics/4th International Symposium on Natural Language Processing Based on Naturally Annotated Big Data,LNCS 10035. Cham:Springer, 2016:165-176.
[22] WEI H,GAO M,ZHOU A,et al. Named entity recognition from biomedical texts using a fusion attention-based BiLSTM-CRF[J]. IEEE Access,2019,7:73627-73636.
[23] DAI X,KARIMI S,HACHEY B,et al. Using similarity measures to select pretraining data for NER[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg, PA:Association for Computational Linguistics, 2019:1460-1470.
[24] LEAMAN R,ISLAMAJ DOĞAN R,LU Z. DNorm:disease name normalization with pairwise learning to rank[J]. Bioinformatics, 2013,29(22):2909-2917.
[25] LEAMAN R,LU Z. TaggerOne:joint named entity recognition and normalization with semi-Markov Models[J]. Bioinformatics, 2016,32(18):2839-2846.
[26] WANG X,ZHANG Y,REN X,et al. Cross-type biomedical named entity recognition with deep multi-task learning[J]. Bioinformatics,2019,35(10):1745-1752.
[27] LI Q,HAN Z,WU X. Deeper insights into graph convolutional networks for semi-supervised learning[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto,CA:AAAI Press,2018:3538-3545.