Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (2): 393-402.DOI: 10.11772/j.issn.1001-9081.2023020143

• Artificial intelligence • Previous Articles    

High-precision entity and relation extraction in medical domain based on pseudo-entity data augmentation

Andi GUO1, Zhen JIA1, Tianrui LI1,2()   

  1. 1.School of Computing and Artificial Intelligence,Southwest Jiaotong University,Chengdu Sichuan 611756,China
    2.National Engineering Laboratory of Integrated Transportation Big Data Application Technology (Southwest Jiaotong University),Chengdu Sichuan 611756,China
  • Received:2023-02-20 Revised:2023-03-21 Accepted:2023-04-03 Online:2023-08-14 Published:2024-02-10
  • Contact: Tianrui LI
  • About author:GUO Andi, born in 1998, M. S. candidate. His research interests include natural language processing, knowledge graph.
    JIA Zhen, born in 1975, Ph. D., lecturer. Her research interests include intelligent question answering, knowledge graph.
  • Supported by:
    National Natural Science Foundation of China(62276218)

基于伪实体数据增强的高精准率医学领域实体关系抽取

郭安迪1, 贾真1, 李天瑞1,2()   

  1. 1.西南交通大学 计算机与人工智能学院,成都 611756
    2.综合交通大数据应用技术国家工程实验室(西南交通大学),成都 611756
  • 通讯作者: 李天瑞
  • 作者简介:郭安迪(1998—),男,山东菏泽人,硕士研究生,CCF学生会员,主要研究方向:自然语言处理、知识图谱
    贾真(1975—),女,河北保定人,讲师,博士,CCF会员,主要研究方向:智能问答、知识图谱;
  • 基金资助:
    国家自然科学基金资助项目(62276218)

Abstract:

Aiming at the problems of dense knowledge and the propagation of error during entity extraction and relation classification in medical domain, a high-precision entity and relation extraction framework based on pseudo-entity data augmentation was proposed. First, a Transformer-based feature reading unit was added in the entity extraction module to capture category information for accurately identifying medical long entities among dense entities. Second, a relation negative example generation module was inserted into the pipeline extraction framework, pseudo-entities were generated for confusing relation classification model by an under-sampling-based pseudo-entity generation model, and three data augmentation generation strategies were proposed to improve the model’s ability to identify subject-object reversal, subject-object boundary errors, and relation classification errors. Finally, the problem of the sharp increase in training time caused by data enhancement was alleviated by the levitated-marker-based relation classification model. On CMeIE dataset, four mainstream models were compared with the proposed model. For entity extraction tasks, the proposed model improved the F1 value by 2.26% compared with suboptimal model PL-Marker(Packed Levitated Marker), while for entity relation extraction tasks, the proposed medel improved the F1 value by 5.45% and the precision by 15.62% compared with suboptimal pipeline extraction model proposed by CBLUE (Chinese Biomedical Language Understanding Evaluation). The experimental results show that using both the feature reading unit and the pseudo-entity data enhancement module can effectively improve the precision of extraction.

Key words: entity and relation extraction, data augmentation, high-precision, medical domain, relation negative example generation

摘要:

针对医学领域知识密集、实体抽取和关系分类存在误差传递的问题,提出一种基于伪实体数据增强的高精准率的实体关系抽取框架。首先,在实体抽取模块添加基于Transformer的特征读取单元捕捉类别信息,以在密集的实体中准确识别医学长实体;其次,在流水线抽取框架的基础上插入关系负例生成模块,通过基于欠采样的伪实体生成模型生成混淆关系分类模型的伪实体,并通过三种数据增强生成策略提升模型鉴别主语宾语颠倒、主语宾语边界错误和关系分类错误的能力;最后,通过基于悬浮标记的关系分类模型缓解数据增强带来的训练时间剧增的问题。在CMeIE数据集中,对比了目前主流的4个模型。实体抽取部分相较于次优模型PL-Marker(Packed Levitated Marker),F1值提升了2.26%;实体关系抽取相较于次优模型CBLUE(Chinese Biomedical Language Understanding Evaluation)提出的流水线抽取模型,F1值提升了5.45%,精准率提升了15.62%。实验结果表明使用特征读取单元和伪实体数据增强模块可有效提高抽取的精准率。

关键词: 实体关系抽取, 数据增强, 高精准率, 医学领域, 关系负例生成

CLC Number: