计算机应用 ›› 2021, Vol. 41 ›› Issue (10): 2858-2863.DOI: 10.11772/j.issn.1001-9081.2020101678

所属专题: 人工智能

• 人工智能 • 上一篇    下一篇

面向领域实体关系联合抽取的标注方法

吴赛赛, 梁晓贺, 谢能付, 周爱莲, 郝心宁   

  1. 中国农业科学院农业信息研究所 区块链农业应用研究室, 北京 100081
  • 收稿日期:2020-10-29 修回日期:2020-12-22 出版日期:2021-10-10 发布日期:2021-01-13
  • 通讯作者: 梁晓贺
  • 作者简介:吴赛赛(1996-),女(侗族),广西桂林人,硕士研究生,主要研究方向:农业知识图谱、智能问答;梁晓贺(1986-),女,北京人,助理研究员,博士,主要研究方向:农业信息管理;谢能付(1975-),男,湖北浠水人,研究员,博士,主要研究方向:大规模农业知识获取;周爱莲(1973-),女,湖北天门人,副研究员,硕士,主要研究方向:农业信息管理;郝心宁(1981-),女,北京人,助理研究员,博士,主要研究方向:数据出版、知识管理。
  • 基金资助:
    国家自然科学基金资助项目(31671588);国家社会科学基金资助项目(20CTQ019);中国农业科学院农业信息研究所创新工程项目(CAAS-ASTIP-2016-AII)。

Annotation method for joint extraction of domain-oriented entities and relations

WU Saisai, LIANG Xiaohe, XIE Nengfu, ZHOU Ailian, HAO Xinning   

  1. Agricultural Blockchain Application and Research Laboratory, Agricultural Information Institute of Chinese Academy of Agricultural Sciences, Beijing 100081, China
  • Received:2020-10-29 Revised:2020-12-22 Online:2021-10-10 Published:2021-01-13
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (31671588), the National Social Science Foundation of China (20CTQ019), the Innovation Engineering Project of Agricultural Information Institute of Chinese Academy of Sciences (CAAS-ASTIP-2016-AII).

摘要: 针对传统实体关系标注方法存在效率低下、错误传播、实体冗余等问题,对于某些领域语料中存在“一实体(主实体)同时与多个实体之间存在重叠关系”的特点,提出一种面向领域实体关系联合抽取的新标注方法。首先,将主实体标注为一个固定标签,将文本中与主实体存在关系的其他每个实体标注为对应实体对间的关系类型,这种对实体和关系进行同步标注的方式节省了至少一半的标注成本;然后,直接对三元组进行建模,而不是分别对实体和关系进行建模,通过标签匹配和映射即可获取三元组数据,从而缓解重叠关系抽取、实体冗余以及错误传播等问题;最后,以作物病虫害领域为例进行实验,测试了来自转换器的双向编码器表征量(BERT)-双向长短期记忆网络(BiLSTM)+条件随机场(CRF)端到端模型在1 619条作物病虫害文档的数据集上的性能。实验结果表明该模型的F1值比基于传统标注方式+BERT模型的流水线方法提高了47.83个百分点;与基于新标注方式+BiLSTM+CRF模型、卷积神经网络(CNN)+BiLSTM+CRF等经典模型的联合学习方法相比,该模型的F1值分别提高了9.55个百分点和10.22个百分点,验证了所提标注方法和模型的有效性。

关键词: 垂直领域, 实体关系联合抽取, 序列标注, 端到端模型

Abstract: In view of the problems of low efficiency, error propagation, and entity redundancy in traditional entities and relations annotation methods, and for the fact that there is the characteristic of "the overlapping relationship between one entity (main-entity) and multiple entities at the same time" in corpuses of some domains, a new annotation method for joint extraction of domain entities and relations was proposed. First, the main entity was marked as a fixed label, each other entity in the text that has relation with the main-entity was marked as the type of relation between the corresponding two entities. This way that entities and relations were simultaneously labeled was able to save at least half of the cost of annotation. Then, the triples were modeled directly instead of modeling entities and relations separately, and, the triple data were able to be obtained through label matching and mapping, which alleviated the problems of overlapping relation extraction, entity redundancy, and error propagation. Finally, the field of crop diseases and pests was taken as the example to conduct experiments, and the Bidirectional Encoder Representations from Transformers (BERT)-Bidirectional Long Short-Term Memory (BiLSTM)+Conditional Random Field (CRF) end-to-end model was tested the performance on the dataset of 1 619 crop diseases and pests articles. Experimental results show that this model has the F1 value 47.83 percentage points higher than the pipeline method based on the traditional annotation method+BERT model; compared with the joint learning method based on the new annotation method+BiLSTM+CRF model, Convolutional Neural Network (CNN)+BiLSTM+CRF or other classic models, the F1 value of the model increased by 9.55 percentage points and 10.22 percentage points respectively, which verify the effectiveness of the proposed annotation method and model.

Key words: vertical field, joint extraction of entities and relations, sequence annotation, end-to-end model

中图分类号: