面向领域实体关系联合抽取的标注方法

doi:10.11772/j.issn.1001-9081.2020101678

计算机应用 ›› 2021, Vol. 41 ›› Issue (10): 2858-2863.DOI: 10.11772/j.issn.1001-9081.2020101678

所属专题：人工智能

面向领域实体关系联合抽取的标注方法

吴赛赛, 梁晓贺, 谢能付, 周爱莲, 郝心宁

中国农业科学院农业信息研究所区块链农业应用研究室, 北京 100081

收稿日期:2020-10-29 修回日期:2020-12-22 发布日期:2021-01-13 出版日期:2021-10-10
通讯作者: 梁晓贺
作者简介:吴赛赛(1996-),女(侗族),广西桂林人,硕士研究生,主要研究方向:农业知识图谱、智能问答;梁晓贺(1986-),女,北京人,助理研究员,博士,主要研究方向:农业信息管理;谢能付(1975-),男,湖北浠水人,研究员,博士,主要研究方向:大规模农业知识获取;周爱莲(1973-),女,湖北天门人,副研究员,硕士,主要研究方向:农业信息管理;郝心宁(1981-),女,北京人,助理研究员,博士,主要研究方向:数据出版、知识管理。
基金资助:
国家自然科学基金资助项目（31671588）；国家社会科学基金资助项目（20CTQ019）；中国农业科学院农业信息研究所创新工程项目（CAAS-ASTIP-2016-AII）。

Annotation method for joint extraction of domain-oriented entities and relations

WU Saisai, LIANG Xiaohe, XIE Nengfu, ZHOU Ailian, HAO Xinning

Agricultural Blockchain Application and Research Laboratory, Agricultural Information Institute of Chinese Academy of Agricultural Sciences, Beijing 100081, China

Received:2020-10-29 Revised:2020-12-22 Online:2021-01-13 Published:2021-10-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (31671588), the National Social Science Foundation of China (20CTQ019), the Innovation Engineering Project of Agricultural Information Institute of Chinese Academy of Sciences (CAAS-ASTIP-2016-AII).

摘要/Abstract

摘要： 针对传统实体关系标注方法存在效率低下、错误传播、实体冗余等问题，对于某些领域语料中存在“一实体（主实体）同时与多个实体之间存在重叠关系”的特点，提出一种面向领域实体关系联合抽取的新标注方法。首先，将主实体标注为一个固定标签，将文本中与主实体存在关系的其他每个实体标注为对应实体对间的关系类型，这种对实体和关系进行同步标注的方式节省了至少一半的标注成本；然后，直接对三元组进行建模，而不是分别对实体和关系进行建模，通过标签匹配和映射即可获取三元组数据，从而缓解重叠关系抽取、实体冗余以及错误传播等问题；最后，以作物病虫害领域为例进行实验，测试了来自转换器的双向编码器表征量（BERT）-双向长短期记忆网络（BiLSTM）+条件随机场（CRF）端到端模型在1 619条作物病虫害文档的数据集上的性能。实验结果表明该模型的F1值比基于传统标注方式+BERT模型的流水线方法提高了47.83个百分点；与基于新标注方式+BiLSTM+CRF模型、卷积神经网络（CNN）+BiLSTM+CRF等经典模型的联合学习方法相比，该模型的F1值分别提高了9.55个百分点和10.22个百分点，验证了所提标注方法和模型的有效性。

关键词: 垂直领域, 实体关系联合抽取, 序列标注, 端到端模型

Abstract: In view of the problems of low efficiency, error propagation, and entity redundancy in traditional entities and relations annotation methods, and for the fact that there is the characteristic of "the overlapping relationship between one entity (main-entity) and multiple entities at the same time" in corpuses of some domains, a new annotation method for joint extraction of domain entities and relations was proposed. First, the main entity was marked as a fixed label, each other entity in the text that has relation with the main-entity was marked as the type of relation between the corresponding two entities. This way that entities and relations were simultaneously labeled was able to save at least half of the cost of annotation. Then, the triples were modeled directly instead of modeling entities and relations separately, and, the triple data were able to be obtained through label matching and mapping, which alleviated the problems of overlapping relation extraction, entity redundancy, and error propagation. Finally, the field of crop diseases and pests was taken as the example to conduct experiments, and the Bidirectional Encoder Representations from Transformers (BERT)-Bidirectional Long Short-Term Memory (BiLSTM)+Conditional Random Field (CRF) end-to-end model was tested the performance on the dataset of 1 619 crop diseases and pests articles. Experimental results show that this model has the F1 value 47.83 percentage points higher than the pipeline method based on the traditional annotation method+BERT model; compared with the joint learning method based on the new annotation method+BiLSTM+CRF model, Convolutional Neural Network (CNN)+BiLSTM+CRF or other classic models, the F1 value of the model increased by 9.55 percentage points and 10.22 percentage points respectively, which verify the effectiveness of the proposed annotation method and model.

Key words: vertical field, joint extraction of entities and relations, sequence annotation, end-to-end model

中图分类号:

TP391.1

吴赛赛, 梁晓贺, 谢能付, 周爱莲, 郝心宁. 面向领域实体关系联合抽取的标注方法[J]. 计算机应用, 2021, 41(10): 2858-2863.

WU Saisai, LIANG Xiaohe, XIE Nengfu, ZHOU Ailian, HAO Xinning. Annotation method for joint extraction of domain-oriented entities and relations[J]. Journal of Computer Applications, 2021, 41(10): 2858-2863.

参考文献

[1] 宁尚明, 滕飞, 李天瑞. 基于多通道自注意力机制的电子病历实体关系抽取[J]. 计算机学报, 2020, 43(5):916-929.(NING S M, TENG F, LI T R. Multi-channel self-attention mechanism for relation extraction in clinical records[J]. Chinese Journal of Computers, 2020, 43(5):916-929.)
[2] SOCHER R, HUVAL B, MANNING C D, et al. Semantic compositionality through recursive matrix-vector spaces[C]//Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Stroudsburg, PA:Association for Computational Linguistics, 2012:1201-1211.
[3] MARRERO M, URBANO J, SÁNCHEZ-CUADRADO S, et al. Named entity recognition:fallacies, challenges and opportunities[J]. Computer Standards and Interfaces, 2013, 35(5):482-489.
[4] KUMAR S. A survey of deep learning methods for relation extraction[EB/OL]. (2017-05-10)[2020-11-10]. https://arxiv.org/pdf/1705.03645.pdf.
[5] MIWA M, BANSAL M. End-to-end relation extraction using LSTMs on sequences and tree structures[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics, 2016:1105-1116.
[6] KATIYAR A, CARDIE C. Going out on a limb:joint extraction of entity mentions and relations without dependency trees[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics, 2017:917-928.
[7] ZHENG S C, WANG F, BAO H Y, et al. Joint extraction of entities and relations based on a novel tagging scheme[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics, 2017:1227-1236.
[8] ZENG X R, ZENG D J, HE S Z, et al. Extracting relational facts by an end-to-end neural model with copy mechanism[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics, 2018:506-514.
[9] DAI D, XIAO X Y, LYU Y J, et al. Joint extraction of entities and overlapping relations using position-attentive sequence labeling[C]//Proceedings of the 33rd AAAI Conference on Artificial Intelligence, Palo Alto, CA:AAAI Press, 2019:6300-6308.
[10] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL]. (2013-09-07)[2020-11-11]. https://arxiv.org/pdf/1301.3781.pdf.
[11] PENNINGTON J, SOCHER R, MANNING C D. GloVe:global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA:Association for Computational Linguistics, 2014:1532-1543.
[12] DEVLIN J, CHANG M W, LEE K, et al. BERT:pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg, PA:Association for Computational Linguistics, 2019:4171-4186.
[13] 张秋颖, 傅洛伊, 王新兵. 基于BERT-BiLSTM-CRF的学者主页信息抽取[J]. 计算机应用研究, 2020, 37(S1):47-49. (ZHANG Q Y, FU L Y, WANG X B. Information extraction from scholar homepage based on BERT-BiLSTM-CRF[J]. Application Research of Computers, 2020, 37(S1):47-49.)
[14] GRAVES A, SCHMIDHUBER J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures[J]. Neural Networks, 2005, 18(5/6):602-610.
[15] SUNDERMEYER M, SCHLÜTER R, NEY H. LSTM neural networks for language modeling[C]//Proceedings of the 13th Annual Conference of the International Speech Communication Association. Belfast:International Speech Communication Association, 2012:194-197.
[16] MIKOLOV T, KARAFIÁT M, BURGET L, et al. Recurrent neural network based language model[C]//Proceedings of the 11th Annual Conference of the International Speech Communication Association. Belfast:International Speech Communication Association, 2010:1045-1048.
[17] LAFFERTY J D, McCALLUM A, PEREIRA F C N. Conditional random fields:Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the 18th International Conference on Machine Learning. San Francisco:Morgan Kaufmann Publishers Inc., 2001:282-289.
[18] 李荣陆, 王建会, 陈晓云, 等. 使用最大熵模型进行中文文本分类[J]. 计算机研究与发展, 2005, 42(1):94-101.(LI R L, WANG J H, CHEN X Y, et al. Using maximum entropy model for Chinese text categorization[J]. Journal of Computer Research and Development, 2005, 42(1):94-101.)
[19] EDDY S R. Hidden Markov models[J]. Current Opinion in Structural Biology, 1996, 6(3):361-365.

面向领域实体关系联合抽取的标注方法

Annotation method for joint extraction of domain-oriented entities and relations

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 10

编辑推荐

Metrics

[1]	王炫力, 靳小龙, 侯中妮, 廖华明, 张瑾. 基于森林的实体关系联合抽取模型[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2700-2706.
[2]	左亚尧, 陈皓宇, 陈致然, 洪嘉伟, 陈坤. 融合多语义特征的命名实体识别方法[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2001-2008.
[3]	李灿, 杨雅婷, 马玉鹏, 董瑞. 基于语种相似性挖掘的神经机器翻译语料库扩充方法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3145-3150.
[4]	陈佳沣, 滕冲. 基于强化学习的实体关系联合抽取模型[J]. 计算机应用, 2019, 39(7): 1918-1924.
[5]	严红, 陈兴蜀, 王文贤, 王海舟, 殷明勇. 基于深度神经网络的法语命名实体识别模型[J]. 计算机应用, 2019, 39(5): 1288-1292.
[6]	潘沛克, 王艳, 罗勇, 周激流. 基于U-net模型的全自动鼻咽肿瘤MR图像分割[J]. 计算机应用, 2019, 39(4): 1183-1188.
[7]	王康, 董元菲. 基于角度间隔嵌入特征的端到端声纹识别模型[J]. 计算机应用, 2019, 39(10): 2937-2941.
[8]	张晨, 钱涛, 姬东鸿. 基于神经网络的微博情绪识别与诱因抽取联合模型[J]. 计算机应用, 2018, 38(9): 2464-2468.
[9]	李雅昆, 潘晴, Everett X. WANG. 基于改进的多层BLSTM的中文分词和标点预测[J]. 计算机应用, 2018, 38(5): 1278-1282.
[10]	黄念娥, 黄河, 王儒敬. 本体与条件随机场结合的涉农商品名称抽取与类别标注[J]. 计算机应用, 2017, 37(1): 233-238.