Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (10): 3121-3130.DOI: 10.11772/j.issn.1001-9081.2024101536

• Artificial intelligence • Previous Articles    

Entity-relation extraction strategy in Chinese open-domains based on large language model

Yonggang GONG, Shuhan CHEN(), Xiaoqin LIAN, Qiansheng LI, Hongming MO, Hongyu LIU   

  1. School of Computer and Artificial Intelligence,Beijing Technology and Business University,Beijing 100048,China
  • Received:2024-10-30 Revised:2025-03-20 Accepted:2025-03-27 Online:2025-04-21 Published:2025-10-10
  • Contact: Shuhan CHEN
  • About author:GONG Yonggang, born in 1973, Ph. D., associate professor. His research interests include large language models, internet of things, natural language processing.
    CHEN Shuhan, born in 1999, M. S. candidate. His research interests include large language models, natural language processing.
    LIAN Xiaoqin, born in 1967, Ph. D., professor. Her research interests include computer measurement and control.
    LI Qiansheng, born in 2000, M. S. candidate. His research interests include natural language processing.
    MO Hongming, born in 2000, M. S. candidate. His research interests include natural language processing.
    LIU Hongyu, born in 1998, M. S. candidate. His research interests include natural language processing.
  • Supported by:
    2024 Postgraduate Education and Teaching Achievement Cultivation Project of Beijing Technology and Business University(19008024042)

基于大语言模型的中文开放领域实体关系抽取策略

龚永罡, 陈舒汉(), 廉小亲, 李乾生, 莫鸿铭, 刘宏宇   

  1. 北京工商大学 计算机与人工智能学院,北京 100048
  • 通讯作者: 陈舒汉
  • 作者简介:龚永罡(1973—),男,河南洛阳人,副教授,博士,主要研究方向:大语言模型、物联网、自然语言处理
    陈舒汉(1999—),男,北京人,硕士研究生,主要研究方向:大语言模型、自然语言处理 Email:2836693196@qq.com
    廉小亲(1967—),女,河南沁阳人,教授,博士,主要研究方向:计算机测控
    李乾生(2000—),男,四川绵阳人,硕士研究生,主要研究方向:自然语言处理
    莫鸿铭(2000—),男,北京人,硕士研究生,主要研究方向:自然语言处理
    刘宏宇(1998—),男,北京人,硕士研究生,主要研究方向:自然语言处理。
  • 基金资助:
    2024北京工商大学研究生教育教学成果培育项目(19008024042)

Abstract:

Large Language Models (LLMs) face issues of unstable extraction performance in Entity-Relation Extraction (ERE) tasks in Chinese open-domains, and have low precision in recognizing texts and annotated categories in certain specific fields. Therefore, a Chinese open-domain entity-relation extraction strategy based on LLM, called Multi-Level Dialog Strategy for Large Language Model (MLDS-LLM), was proposed. In the strategy, the superior semantic understanding and transfer learning capabilities of LLMs were used to achieve entity-relation extraction through multi-turn dialogues of different tasks. Firstly, structured summaries were generated by using LLM based on the structured logic of open-domain text and a Chain-of-Thought (CoT) mechanism, thereby avoiding relational and factual hallucinations generated by model as well as the problem of inability to consider subsequent information. Then, the limitations of the context window were reduced through the use of a text simplification strategy and the introduction of a replaceable vocabulary. Finally, multi-level prompt templates were constructed on the basis of structured summaries and simplified texts, the influence of the parameter temperature on ERE was explored using LLaMA-2-70B model, and the Precision, Recall, F1 value (F1), and Exact Match (EM) values of entity-relation extraction by LLaMA-2-70B model were tested before and after applying the proposed strategy. Experimental results demonstrate that the proposed strategy enhances the performance of LLM in Named Entity Recognition (NER) and Relation Extraction (RE) on five different domain Chinese datasets such as CL-NE-DS, DiaKG, and CCKS2021. Particularly on the DiaKG and IEPA datasets, which are highly specialized with poor zero-shot test results of model, compared to few-shot prompt test, the model has the precision of NER improved by 9.3 and 6.7 percentage points respectively with EM values increased by 2.7 and 2.2 percentage points respectively, and has the precision of RE improved by 12.2 and 16.0 percentage points respectively with F1 values increased by 10.7 and 10.0 percentage points respectively, proving that the proposed strategy enhances performance of LLM in ERE effectively and solves problem of unstable model performance.

Key words: Large Language Model (LLM), Chinese open-domain, Named Entity Recognition (NER), Relation Extraction (RE), prompt learning

摘要:

大语言模型(LLM)在中文开放领域的实体关系抽取(ERE)任务中存在抽取性能不稳定的问题,对某些特定领域文本和标注类别的识别精准率较低。因此,提出一种基于LLM的中文开放领域实体关系抽取策略——基于LLM多级对话策略(MLDS-LLM)。该策略利用LLM优秀的语义理解和迁移学习能力,通过多轮不同任务的对话实现实体关系抽取。首先,基于开放领域文本结构化逻辑和思维链(CoT)机制,使用LLM生成结构化摘要,避免模型产生关系、事实幻觉和无法兼顾后文信息的问题;其次,通过文本简化策略并引入可替换词表,减少上下文窗口的限制;最后,基于结构化摘要和简化文本构建多级提示模板,使用LLaMA-2-70B模型探究参数temperature对实体关系抽取的影响。测试了LLaMA-2-70B在使用所提策略前后进行实体关系抽取的精准率、召回率、调和平均值(F1)和精确匹配(EM)值。实验结果表明,在CL-NE-DS、DiaKG和CCKS2021等5个不同领域的中文数据集上,所提策略提升了LLM在命名实体识别(NER)和关系抽取(RE)上的性能。特别是在专业性强且模型零样本测试结果不佳的DiaKG和IEPA数据集上,在应用所提策略后,相较于少样本提示测试,在NER上模型的精准率分别提升了9.3和6.7个百分点,EM值提升了2.7和2.2个百分点;在RE上模型的精准率分别提升了12.2和16.0个百分点,F1值分别提升了10.7和10.0个百分点。实验结果验证了所提策略能有效提升LLM实体关系抽取的效果并解决模型性能不稳定的问题。

关键词: 大语言模型, 中文开放领域, 命名实体识别, 关系抽取, 提示学习

CLC Number: