Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (7): 2018-2025.DOI: 10.11772/j.issn.1001-9081.2023071051

• Artificial intelligence • Previous Articles     Next Articles

Chinese entity and relation extraction model based on parallel heterogeneous graph and sequential attention mechanism

Dianhui MAO1,2, Xuebo LI1, Junling LIU1, Denghui ZHANG1, Wenjing YAN2()   

  1. 1.Beijing Key Laboratory of Big Data Technology for Food Safety (Beijing Technology and Business University),Beijing 100048,China
    2.National Engineering Laboratory for Agri-Product Quality Traceability (Beijing Technology and Business University),Beijing 100048,China
  • Received:2023-08-03 Revised:2023-09-16 Accepted:2023-09-21 Online:2023-10-26 Published:2024-07-10
  • Contact: Wenjing YAN
  • About author:MAO Dianhui, born in 1979, Ph. D., professor. His research interests include blockchain, smart financial technology and food safety, deep learning.
    LI Xuebo, born in 1997, M. S. candidate. His research interests include food safety, knowledge graph, deep learning.
    LIU Junling, born in 1999, M. S. candidate. Her research interests include food safety, molecular property prediction, deep learning.
    ZHANG Denghui, born in 2000, M. S. candidate. His research interests include blockchain, deep learning.
    First author contact:YAN Wenjing, born in 1985, Ph. D., associate professor. Her research interests include intelligent processing of biological information, deep learning, image recognition.
  • Supported by:
    Beijing Municipal Natural Science Foundation(9232005);Project of Beijing Municipal University Teacher Team Construction Support Plan(BPHR20220104)

基于并行异构图和序列注意力机制的中文实体关系抽取模型

毛典辉1,2, 李学博1, 刘峻岭1, 张登辉1, 颜文婧2()   

  1. 1.食品安全大数据技术北京市重点实验室(北京工商大学), 北京 100048
    2.农产品质量安全追溯技术及应用国家工程实验室(北京工商大学), 北京 100048
  • 通讯作者: 颜文婧
  • 作者简介:毛典辉(1979—),男,湖北浠水人,教授,博士,主要研究方向:区块链、智能金融科技和食品安全、深度学习;
    李学博(1997—),男,河南商丘人,硕士研究生,CCF会员,主要研究方向:食品安全、知识图谱、深度学习;
    刘峻岭(1999—),女,河南南阳人,硕士研究生,CCF会员,主要研究方向:食品安全、分子性质预测、深度学习;
    张登辉(2000—),男,黑龙江伊春人,硕士研究生,CCF会员,主要研究方向:区块链、深度学习;
    第一联系人:颜文婧(1985—),女,安徽淮南人,副教授,博士,主要研究方向:生物信息智能处理、深度学习、图像识别。
  • 基金资助:
    北京市自然科学基金资助项目(9232005);北京市属高校教师队伍建设支持计划项目(BPHR20220104)

Abstract:

In recent years, with the rapid development of deep learning technology, entity and relation extraction has made remarkable progress in many fields. However, due to complex syntactic structures and semantic relationships of Chinese text, there are still many challenges in Chinese entity and relation extraction. Among them, the problem of overlapping triple in Chinese text is one of the important challenges. A Hybrid Neural Network Entity and Relation Joint Extraction (HNNERJE) model was proposed in this article to address the issue of overlapping triple in Chinese text. HNNERJE model fused sequence attention mechanism and heterogeneous graph attention mechanism in a parallel manner, and combined them with a gated fusion strategy, so that it could capture both word order information and entity association information of Chinese text, and adaptively adjusted the output of subject and object markers, effectively solving the overlapping triple issue. Moreover, adversarial training algorithm was introduced to improve the model’s adaptability in processing unseen samples and noise. Finally, SHapley Additive exPlanations (SHAP) method was adopted to explain and analyze HNNERJE model, which effectively revealed key features in extracting entities and relations. HNNERJE model achieved high performance on NYT, WebNLG, CMeIE, and DuIE datasets with F1 score of 92.17%, 93.42%, 47.40%, and 67.98%, respectively. The experimental results indicate that HNNERJE model can transform unstructured text data into structured knowledge representations and effectively extract valuable information.

Key words: entity and relation extraction, heterogenous graph, attention mechanism, adversarial training, SHapley Additive exPlanations (SHAP) method

摘要:

近年来,随着深度学习技术的快速发展,实体关系抽取在许多领域取得了显著的进展。然而,由于汉语具有复杂的句法结构和语义关系,面向中文的实体关系抽取任务中仍然存在着多项挑战。其中,中文文本中的重叠三元组问题是领域中的重要难题之一。针对中文文本中的重叠三元组问题,提出了一种混合神经网络实体关系联合抽取(HNNERJE)模型。HNNERJE模型以并行方式融合序列注意力机制和异构图注意力机制,并结合门控融合策略构建了深度集成框架。该模型不仅可以同时捕获中文文本的语序信息和实体关联信息,还能够自适应地调整主客体标记器的输出,从而有效解决重叠三元组问题。另外,通过引入对抗训练算法提高模型对未见样本和噪声的适应能力。运用SHAP(SHapley Additive exPlanations)方法对HNNERJE模型进行解释分析,基于模型的识别结果解析它在抽取实体和关系时所依据的关键特征。HNNERJE模型在NYT、WebNLG、CMeIE和DuIE数据集上的F1值分别达到了92.17%、93.42%、47.40%和67.98%。实验结果表明:HNNERJE模型可以将非结构化的文本数据转化为结构化的知识表示,有效提取其中蕴含的有价值信息。

关键词: 实体关系抽取, 异构图, 注意力机制, 对抗训练, SHAP方法

CLC Number: