《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (8): 2426-2430.DOI: 10.11772/j.issn.1001-9081.2022071004

• 人工智能 • 上一篇    

基于负训练和迁移学习的关系抽取方法

陈克正1,2, 郭晓然3, 钟勇1,2, 李振平1,2   

  1. 1.中国科学院 成都计算机应用研究所, 成都 610213
    2.中国科学院大学 计算机科学与技术学院, 北京 100049
    3.西北民族大学 数学与计算机科学学院, 兰州 730124
  • 收稿日期:2022-07-11 修回日期:2022-11-03 接受日期:2022-11-21 发布日期:2023-01-15 出版日期:2023-08-10
  • 通讯作者: 钟勇
  • 作者简介:陈克正(1998—),男,山东济宁人,硕士研究生,CCF会员,主要研究方向:自然语言处理、大数据
    郭晓然(1981—),女,河北藁城人,副教授,博士,主要研究方向:信息抽取、知识图谱
    李振平(1990—),男,河南郑州人,博士研究生,主要研究方向:自然语言处理。
  • 基金资助:
    四川省科技成果转移转化平台项目(2020ZHCG0002);中央高校基本科研业务费(青年教师创新)项目(31920210090)

Relation extraction method based on negative training and transfer learning

Kezheng CHEN1,2, Xiaoran GUO3, Yong ZHONG1,2, Zhenping LI1,2   

  1. 1.Chengdu Institute of Computer Application,Chinese Academy of Sciences,Chengdu Sichuan 610213,China
    2.School of Computer Science and Technology,University of Chinese Academy of Sciences,Beijing 100049,China
    3.School of Mathematics and Computer Science,Northwest Minzu University,Lanzhou Gansu 730124,China
  • Received:2022-07-11 Revised:2022-11-03 Accepted:2022-11-21 Online:2023-01-15 Published:2023-08-10
  • Contact: Yong ZHONG
  • About author:CHEN Kezheng, born in 1998, M. S. candidate. His research interests include nature language processing, big data.
    GUO Xiaoran, born in 1981, Ph. D., associate professor. Her research interests include information extraction, knowledge graph.
    LI Zhenping, born in 1990, Ph. D. candidate. His research interests include natural language processing.
  • Supported by:
    Science and Technology Achievement Transformation Platform Program of Sichuan(2020ZHCG0002);Youth Teacher Innovation Project of Fundamental Research Funds for the Central Universities(31920210090)

摘要:

远程监督是关系抽取任务中常用的数据自动标注方法,然而该方法会引入大量的噪声数据,从而影响模型的表现效果。为了解决噪声数据的问题,提出一种基于负训练和迁移学习的关系抽取方法。首先通过负训练的方法训练一个噪声数据识别模型;然后根据样本的预测概率值对噪声数据进行过滤和重新标注;最后利用迁移学习的方法解决远程监督存在的域偏移问题,从而进一步提升模型预测的精确率和召回率。以唐卡文化为基础,构建了具有民族特色的关系抽取数据集。实验结果表明,所提方法的F1值达到91.67%,相较于SENT(Sentence level distant relation Extraction via Negative Training)方法,提升了3.95个百分点,并且远高于基于BERT(Bidirectional Encoder Representations from Transformers)、BiLSTM+ATT(Bi-directional Long Short-Term Memory And Attention)、PCNN(Piecewise Convolutional Neural Network)的关系抽取方法。

关键词: 远程监督, 负训练, 知识图谱, 关系抽取, 迁移学习, 自然语言处理

Abstract:

In relation extraction tasks, distant supervision is a common method for automatic data labeling. However, this method will introduce a large amount of noisy data, which affects the performance of the model. In order to solve the problem of noisy data, a relation extraction method based on negative training and transfer learning was proposed. Firstly, a noisy data recognition model was trained through negative training method. Then, the noisy data were filtered and relabeled according to the predicted probability value of the sample, Finally, a transfer learning method was used to solve the domain shift problem existing in distant supervision tasks, and the precision and recall of the model were further improved. Based on Thangka culture, a relation extraction dataset with national characteristics was constructed. Experimental results show that the F1 score of the proposed method reaches 91.67%, which is 3.95 percentage points higher than that of SENT (Sentence level distant relation Extraction via Negative Training) method, and is much higher than those of the relation extraction methods based on BERT (Bidirectional Encoder Representations from Transformers), BiLSTM+ATT(Bi-directional Long Short-Term Memory and Attention), and PCNN (Piecewise Convolutional Neural Network).

Key words: distant supervision, negative training, knowledge graph, relation extraction, transfer learning, Natural Language Processing

中图分类号: