《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (7): 2026-2033.DOI: 10.11772/j.issn.1001-9081.2023070943

• 人工智能 • 上一篇    下一篇

面向煤矿机电设备领域的三元组抽取方法

游新冬, 问英姿, 佘鑫鹏, 吕学强()   

  1. 网络文化与数字传播北京市重点实验室(北京信息科技大学),北京 100101
  • 收稿日期:2023-07-14 修回日期:2023-09-14 接受日期:2023-09-20 发布日期:2023-10-26 出版日期:2024-07-10
  • 通讯作者: 吕学强
  • 作者简介:游新冬(1979—),女,福建永定人,教授,博士,CCF会员,主要研究方向:自然语言处理、中文信息处理;
    问英姿(2001—),女,河南漯河人,硕士研究生,主要研究方向:自然语言处理、知识图谱;
    佘鑫鹏(1998—),男,福建莆田人,硕士研究生,主要研究方向:自然语言处理、知识图谱;
    第一联系人:吕学强(1970—),男,辽宁抚顺人,教授,博士,CCF高级会员,主要研究方向:中文信息处理、多媒体信息处理。
  • 基金资助:
    国家语委项目(ZDI145-10);北京市自然科学基金资助项目(4212020);华能集团总部科技项目(HNKJ21-HF43)

Triplet extraction method for mine electromechanical equipment field

Xindong YOU, Yingzi WEN, Xinpeng SHE, Xueqiang LYU()   

  1. Beijing Key Laboratory of Network Culture and Digital Communication (Beijing Information Science and Technology University),Beijing 100101,China
  • Received:2023-07-14 Revised:2023-09-14 Accepted:2023-09-20 Online:2023-10-26 Published:2024-07-10
  • Contact: Xueqiang LYU
  • About author:YOU Xindong, born in 1979, Ph. D., professor. Her research interests include natural language processing, Chinese information processing.
    WEN Yingzi, born in 2001, M. S. candidate. Her research interests include natural language processing, knowledge graph.
    SHE Xinpeng, born in 1998, M. S. candidate. His research interests include natural language processing, knowledge graph.
    First author contact:LYU Xueqiang, born in 1970, Ph. D., professor. His research interests include Chinese information processing, multimedia information processing.
  • Supported by:
    Language Commission Project of China(ZDI145-10);Natural Science Foundation of Beijing(4212020);Huaneng Group Headquarters Science and Technology Project(HNKJ21-HF43)

摘要:

针对机电设备领域相关语料匮乏、关系类型特征挖掘不充分以及文本包含重叠三元组的问题,提出一种融合提示学习与先验知识以迭代式对抗训练的三元组抽取方法TBPA(Triplet extraction Based on Prompt and Antagonistic training)。首先,利用BERT(Bidirectional Encoder Representations from Transformers)模型在自构语料库上进行微调,以获取输入文本的特征向量;接着,采用投影梯度下降(PGD)方法在嵌入层进行迭代式对抗训练,提高模型对干扰样本的抵御能力和对真实样本的泛化能力;然后,利用单层头尾指针网络识别出头实体,并结合提示学习模板获取头实体对应的领域先验特征,将字向量与Prompt模板中预测得到的提示向量相结合;最后,在分层标注框架下,使用单层头尾指针网络逐个识别预定义的所有关系类型所对应的尾实体。与基线模型CasRel相比,TBPA在精确率、召回率和F1值上分别提高了3.10、6.12、4.88个百分点。实验结果表明,TBPA在煤矿机电设备领域三元组抽取任务中具有一定的优势。

关键词: 煤矿机电设备, 三元组抽取, 提示学习, 迭代式对抗训练, 自构语料库

Abstract:

To address the challenges of scarce domain-specific corpora, insufficient feature mining of relation types, and the presence of overlapping triplets in texts for electromechanical equipment domain, a triplet extraction method TBPA (Triplet extraction Based on Prompt and Antagonistic training) based on prompt learning with prior knowledge through iterative adversarial training was proposed. Firstly, the BERT (Bidirectional Encoder Representations from Transformers) model was fine-tuned on a self-constructed corpus to obtain feature vectors for input text. Then, an iterative adversarial training using the Projection Gradient Descent (PGD) method was conducted at the embedding layer to enhance the model’s resistance to perturbed samples and generalization ability to real samples. Furthermore, a single-layer head-tail pointer network was used to identify the head entity, and domain-specific prior features corresponding to the head entity were obtained by incorporating the word vectors with the prompt vectors predicted by the prompt learning templates. Finally, within a hierarchical annotation framework, another single-layer head-tail pointer network was employed to sequentially identify the tail entities associated with predefined relation types. In comparison with the baseline model CasRel, TBPA achieves improvements of 3.10, 6.12 and 4.88 percentage points in precision, recall, and F1 score, respectively. Experimental results demonstrate its advantages in triplet extraction tasks within the domain of mine electromechanical equipment.

Key words: mine electromechanical equipment, triplet extraction, prompt learning, iterative adversarial training, self- constructed corpora

中图分类号: