基于路径标签的文档级关系抽取方法

doi:10.11772/j.issn.1001-9081.2022030327

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (4): 1029-1035.DOI: 10.11772/j.issn.1001-9081.2022030327

• 人工智能 • 上一篇

基于路径标签的文档级关系抽取方法

袁泉¹^,², 徐雲鹏¹^,²(), 唐成亮¹^,²

^1.重庆邮电大学通信与信息工程学院，重庆 400065
^2.重庆邮电大学通信新技术应用研究中心，重庆 400065

收稿日期:2022-03-17 修回日期:2022-07-09 接受日期:2022-07-29 发布日期:2023-01-11 出版日期:2023-04-10
通讯作者: 徐雲鹏
作者简介:袁泉（1976—），男，湖南邵阳人，正高级工程师，硕士，主要研究方向：大数据、自然语言处理；
唐成亮（1998—），男，四川德阳人，硕士研究生，主要研究方向：大数据、自然语言处理。

Document-level relation extraction method based on path labels

Quan YUAN¹^,², Yunpeng XU¹^,²(), Chengliang TANG¹^,²

^1.School of Communication and Information Engineering，Chongqing University of Posts and Telecommunications，Chongqing 400065，China
^2.Research Center of New Communication Technology Applications，Chongqing University of Posts and Telecommunications，Chongqing 400065，China

Received:2022-03-17 Revised:2022-07-09 Accepted:2022-07-29 Online:2023-01-11 Published:2023-04-10
Contact: Yunpeng XU
About author:YUAN Quan， born in 1976， M. S.， senior engineer. His research interests include big data， natural language processing.
TANG Chengliang， born in 1998， M. S. candidate. His research interests include big data， natural language processing.

摘要/Abstract

摘要：

针对文档级关系抽取中文本处理复杂性过高，难以提取高效实体关系的问题，提出了一种基于路径标签的文档级关系抽取方法，抽取选择关键的证据句子。首先，引入路径（Path）标签代替实体句子作为处理过的文本数据集进行数据预处理；同时，结合语义分割的U-Net模型，利用输入端的编码模块捕获文档实体的上下文信息，并通过图像风格的U-Net语义分割模块捕获实体三元组之间的全局依赖性；最后，引入Softmax函数减少文本抽取时的噪声。理论分析和仿真结果表明，与基于图神经网络的RoBERTa（RoBERTa?ATLOP）关系抽取算法相比，Path+U-Net在基于文档级别的实体关系抽取数据集（DocRED）上的开发和测试的F1值分别提高了1.31、0.54个百分点，在化学疾病反应（CDR）数据集上的开发和测试的F1值分别提高了1.32、1.19个百分点；并且Path+U-Net在保证实体间的相关性与原始数据集的相关性一致的同时，对数据集的抽取成本更低、对文本的抽取精度更高。实验结果表明，所提出的基于路径标签的抽取方法能够有效提高长文本抽取效率。

关键词: 关系抽取, 关系分类, 远程监督, 注意力机制, 语义分割

Abstract:

Due to the high complexity of text processing in document-level relation extraction， it is difficult to extract efficient entity relations. Therefore， a path label based document-level extraction method was proposed to select key evidence sentences. Firstly， the Path label was introduced to replace the entity sentence as the processed text dataset for data preprocessing. At the same time， combined with the U-Net model of semantic segmentation， the encoding module at the input end was used to capture the context information of the document entity， and the image style was used to capture the context information of the document entities， and the U-Net semantic segmentation module was used to capture the global dependencies among entity triples. Finally， a Softmax function was introduced to decrease the noise of text extraction. Theoretical analysis and simulation results show that compared with the graph neural network-based RoBERTa （Robustly optimized Bidirectional Encoder Representations from Transformers）（RoBERTa?ATLOP） relation extraction algorithm， Path+U-Net has the F1-score in the development and testing of Document-level Relation Extraction Dataset （DocRED） increased by 1.31 and 0.54 percentage points respectively， and the F1-score in development and testing of Chemical Disease Response （CDR） dataset improved by 1.32 and 1.19 percentage points respectively. At the same time， Path+U-Net has lower extraction cost for datasets and higher extraction accuracy of text， while the correlation between entities is consistent with the correlation in the original dataset. Experimental results show that the proposed extraction algorithm based on path labels can effectively improve the extraction efficiency of long texts.

Key words: relation extraction, relation classification, remote supervision, attention mechanism, semantic segmentation

中图分类号:

TP391

袁泉, 徐雲鹏, 唐成亮. 基于路径标签的文档级关系抽取方法[J]. 计算机应用, 2023, 43(4): 1029-1035.

Quan YUAN, Yunpeng XU, Chengliang TANG. Document-level relation extraction method based on path labels[J]. Journal of Computer Applications, 2023, 43(4): 1029-1035.

图/表 11

参考文献 15

1	YAO Y， YE D M， LI P. DocRED： a large-scale document-level relation extraction dataset［C］// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg， PA： ACL， 2019： 764-777. 10.18653/v1/P19-1074
2	JIA R， WONG C， POON H. Document-level n-ary relation extraction with multiscale representation learning［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsburg， PA： ACL， 2019： 3693-3704. 10.18653/v1/n19-1370
3	TANG H Z， CAO Y N， ZHANG Z Y， et al. HIN： hierarchical inference network for document-level relation extraction［C］// Proceedings of the 2020 Pacific-Asia Conference of Knowledge Discovery and Data Mining， LNCS 12084. Cham： Springer， 2020： 197-209. 10.48550/arXiv.2003.12754
4	XU B F， WANG Q， LYU Y J， et al. Entity structure within and throughout： modeling mention dependencies for document-level relation extraction［C］// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2021： 14149-14157. 10.1609/aaai.v35i16.17665
5	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 6000-6010.
6	ZHOU W X， HUANG K， MA T Y， et al. Document-level relation extraction with adaptive thresholding and localized context pooling［C］// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2021： 14612-14620. 10.1609/aaai.v35i16.17717
7	SEO M， KEMBHAVI A， FARHADI A， et al. Bidirectional attention flow for machine comprehension［C］// Proceedings of the 5th International Conference on Learning Representations. Puerto Rico： ICLR， 2017： 1-13 .
8	LI J， SUN Y P， JOHNSON R J， et al. BioCreative V CDR task corpus： a resource for chemical disease relation extraction［J］. Database， 2016， 2016： No.baw068. 10.1093/database/baw068
9	LUAN Y， HE L， OSTENDORF M， et al. Multi-task identification of entities， relations， and coreference for scientific knowledge graph construction［C］// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： ACL， 2018： 3219-3232. 10.18653/v1/d18-1360
10	RONNEBERGER O， FISCHER P， BROX T. U-Net： convolutional networks for biomedical image segmentation［C］// Proceedings of the 2015 Medical Image Computing and Computer-Assisted Intervention. Cham： Springer， 2015： 234-241. 10.1007/978-3-319-24574-4_28
11	SUN Y F， CHENG C M， ZHANG Y H， et al. Circle loss： a unified perspective of pair similarity optimization［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 6397-6406. 10.1109/cvpr42600.2020.00643
12	谢腾，杨俊安，刘辉. 融合多特征BERT模型的中文实体关系抽取［J］. 计算机系统应用， 2021， 30（5）：253-261. 10.15888/j.cnki.csa.007899
	XIE T， YANG J A， LIU H. Chinese entity relation extraction based on multi-feature BERT model［J］. Computer Systems and Applications， 2021， 30（5）：253-261. 10.15888/j.cnki.csa.007899
13	LI B， YE W， SHENG Z， et al. Graph enhanced dual attention network for document-level relation extraction［C］// Proceedings of the 28th International Conference on Computational Linguistics. Stroudsburg， PA： ACL， 2020： 1551-1560. 10.18653/v1/2020.coling-main.136
14	YE D M， LIN Y K， DU J J， et al. Coreferential reasoning learning for language representation［C］// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： ACL， 2020： 7170-7186. 10.18653/v1/2020.emnlp-main.582
15	ZENG S， XU R X， CHANG B B， et al. Double graph based reasoning for document-level relation extraction［C］// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： ACL， 2020： 1630-1640. 10.18653/v1/2020.emnlp-main.127

数据集	召回率/%					Sent
数据集	Sen=0	Sen=1	Sen=2	Sen=3	Sen≥4	Sent
DocRED	2.9	49.7	87.0	94.9	99.1	8.0
SCIERC	1.5	52.5	87.8	96.2	98.8	9.5
CDR	0.0	68.0	86.7	94.5	99.0	4.2
GDA	0.0	66.0	88.4	92.6	98.4	11.4

数据集	召回率/%					Sent
数据集	Sen=0	Sen=1	Sen=2	Sen=3	Sen≥4	Sent
DocRED	2.9	49.7	87.0	94.9	99.1	8.0
SCIERC	1.5	52.5	87.8	96.2	98.8	9.5
CDR	0.0	68.0	86.7	94.5	99.0	4.2
GDA	0.0	66.0	88.4	92.6	98.4	11.4

路径类型	召回率/%	Sent	Path
相邻路径	74.3	2.71	2.11
接续路径	32.5	3.24	2.40
相邻+接续路径	81.3	2.85	2.46
相邻+接续+综合路径	88.5	2.76	2.47

路径类型	召回率/%	Sent	Path
相邻路径	74.3	2.71	2.11
接续路径	32.5	3.24	2.40
相邻+接续路径	81.3	2.85	2.46
相邻+接续+综合路径	88.5	2.76	2.47

实验环境	具体信息	实验环境	具体信息
操作系统	Windows10	开发语言	Python3.6
CPU	Intel Core i7-11700 @ 2.50 GHz	开发平台	PyTorch1.2.0
显卡	GeForce GTX 3080T	开发工具	Pycharm2021.2
内存	16 GB

基于路径标签的文档级关系抽取方法

Document-level relation extraction method based on path labels

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 15

相关文章 15

编辑推荐

Metrics

数据集	参数类别	开发集	训练集	测试集
DocRED	文档数量	1 061	2 127	1 560
	句子数量	9 482	17 346	12 173
	实体对数量	4 164	9 637	5 807
	关系种类	97	97	97
CDR	文档数量	1 298	4 620	3 614
	句子数量	11 640	31 850	27 578
	实体对数量	6 843	14 662	11 339
	关系种类	120	120	120

模型	开发集		测试集
模型	IgnF₁	F₁	IgnF₁	F₁
BERT-GEDA	54.52	56.16	53.71	55.74
CorefBERT	55.32	57.51	54.54	56.96
HIN-BERT	54.29	56.31	53.70	55.60
GAIN-BERT	59.14	61.22	59.00	61.24
BERT-ATLOP	59.22	61.09	59.31	61.30
BERT+Path+U-Net	60.56	62.13	60.08	62.06
RoBERTa-ATLOP	61.32	63.18	61.39	63.40
RoBERTa+Path+U-Net	62.14	64.49	61.80	64.06

模型	IgnF₁	F₁
Path+U-Net（Context-based）	62.14	64.49
Path+U-Net（Similarity-based）	60.07	61.52
移除平衡Softmax	59.26	60.81
移除U-Net	58.21	60.23

[1]	李佳东, 张丹普, 范亚琼, 杨剑锋. 基于改进YOLOv5的轻量级船舶目标检测算法[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 923-929.
[2]	孙杰, 吴绍鑫, 王学军, 华璟. 基于Sophon SC5+芯片构架的行人搜索算法与优化[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 744-751.
[3]	尹聪, 胡汉平. 基于时间注意力机制的时滞混沌系统参数辨识模型[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 842-847.
[4]	何雪东, 宣士斌, 王款, 陈梦楠. 融合累积分布函数和通道注意力机制的DeepLabV3+图像分割算法[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 936-942.
[5]	王萍, 陈楠, 鲁磊. 基于场景先验及注意力引导的跌倒检测算法[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 529-535.
[6]	邵小萌, 张猛. 融合注意力机制的时间卷积知识追踪模型[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 343-348.
[7]	刘聪, 万根顺, 高建清, 付中华. 基于韵律特征辅助的端到端语音识别方法[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 380-384.
[8]	徐铭, 李林昊, 齐巧玲, 王利琴. 基于注意力平衡列表的溯因推理模型[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 349-355.
[9]	谌贵辉, 林瑾瑜, 李跃华, 李忠兵, 魏钰力, 卢凯. 注意力机制下的多阶段低照度图像增强网络[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 552-559.
[10]	申志军, 穆丽娜, 高静, 史远航, 刘志强. 细粒度图像分类综述[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 51-60.
[11]	孙泽强, 陈炳才, 崔晓博, 王磊, 陆雅诺. 融合频域注意力机制和解耦头的YOLOv5带钢表面缺陷检测[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 242-249.
[12]	杨洪刚, 陈洁洁, 徐梦飞. 双线性内卷神经网络用于眼底疾病图像分类[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 259-264.
[13]	张军, 吴朋莉, 石陆魁, 史进, 潘斌. 联合MOD11A1和地面气象站点数据的多站点温度预测深度学习模型[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 321-328.
[14]	邹斌, 张聪. 基于Faster R-CNN的密集人群检测算法[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 61-66.
[15]	刘辉, 马祥, 张琳玉, 何如瑾. 融合匹配长短时记忆网络和语法距离的方面级情感分析模型[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 45-50.