《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (4): 1029-1035.DOI: 10.11772/j.issn.1001-9081.2022030327

• 人工智能 • 上一篇    

基于路径标签的文档级关系抽取方法

袁泉1,2, 徐雲鹏1,2(), 唐成亮1,2   

  1. 1.重庆邮电大学 通信与信息工程学院,重庆 400065
    2.重庆邮电大学 通信新技术应用研究中心,重庆 400065
  • 收稿日期:2022-03-17 修回日期:2022-07-09 接受日期:2022-07-29 发布日期:2023-01-11 出版日期:2023-04-10
  • 通讯作者: 徐雲鹏
  • 作者简介:袁泉(1976—),男,湖南邵阳人,正高级工程师,硕士,主要研究方向:大数据、自然语言处理;
    唐成亮(1998—),男,四川德阳人,硕士研究生,主要研究方向:大数据、自然语言处理。

Document-level relation extraction method based on path labels

Quan YUAN1,2, Yunpeng XU1,2(), Chengliang TANG1,2   

  1. 1.School of Communication and Information Engineering,Chongqing University of Posts and Telecommunications,Chongqing 400065,China
    2.Research Center of New Communication Technology Applications,Chongqing University of Posts and Telecommunications,Chongqing 400065,China
  • Received:2022-03-17 Revised:2022-07-09 Accepted:2022-07-29 Online:2023-01-11 Published:2023-04-10
  • Contact: Yunpeng XU
  • About author:YUAN Quan, born in 1976, M. S., senior engineer. His research interests include big data, natural language processing.
    TANG Chengliang, born in 1998, M. S. candidate. His research interests include big data, natural language processing.

摘要:

针对文档级关系抽取中文本处理复杂性过高,难以提取高效实体关系的问题,提出了一种基于路径标签的文档级关系抽取方法,抽取选择关键的证据句子。首先,引入路径(Path)标签代替实体句子作为处理过的文本数据集进行数据预处理;同时,结合语义分割的U-Net模型,利用输入端的编码模块捕获文档实体的上下文信息,并通过图像风格的U-Net语义分割模块捕获实体三元组之间的全局依赖性;最后,引入Softmax函数减少文本抽取时的噪声。理论分析和仿真结果表明,与基于图神经网络的RoBERTa(RoBERTa?ATLOP)关系抽取算法相比,Path+U-Net在基于文档级别的实体关系抽取数据集(DocRED)上的开发和测试的F1值分别提高了1.31、0.54个百分点,在化学疾病反应(CDR)数据集上的开发和测试的F1值分别提高了1.32、1.19个百分点;并且Path+U-Net在保证实体间的相关性与原始数据集的相关性一致的同时,对数据集的抽取成本更低、对文本的抽取精度更高。实验结果表明,所提出的基于路径标签的抽取方法能够有效提高长文本抽取效率。

关键词: 关系抽取, 关系分类, 远程监督, 注意力机制, 语义分割

Abstract:

Due to the high complexity of text processing in document-level relation extraction, it is difficult to extract efficient entity relations. Therefore, a path label based document-level extraction method was proposed to select key evidence sentences. Firstly, the Path label was introduced to replace the entity sentence as the processed text dataset for data preprocessing. At the same time, combined with the U-Net model of semantic segmentation, the encoding module at the input end was used to capture the context information of the document entity, and the image style was used to capture the context information of the document entities, and the U-Net semantic segmentation module was used to capture the global dependencies among entity triples. Finally, a Softmax function was introduced to decrease the noise of text extraction. Theoretical analysis and simulation results show that compared with the graph neural network-based RoBERTa (Robustly optimized Bidirectional Encoder Representations from Transformers) (RoBERTa?ATLOP) relation extraction algorithm, Path+U-Net has the F1-score in the development and testing of Document-level Relation Extraction Dataset (DocRED) increased by 1.31 and 0.54 percentage points respectively, and the F1-score in development and testing of Chemical Disease Response (CDR) dataset improved by 1.32 and 1.19 percentage points respectively. At the same time, Path+U-Net has lower extraction cost for datasets and higher extraction accuracy of text, while the correlation between entities is consistent with the correlation in the original dataset. Experimental results show that the proposed extraction algorithm based on path labels can effectively improve the extraction efficiency of long texts.

Key words: relation extraction, relation classification, remote supervision, attention mechanism, semantic segmentation

中图分类号: