大语言模型幻觉现象的识别与优化

doi:10.11772/j.issn.1001-9081.2024081190

《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (3): 709-714.DOI: 10.11772/j.issn.1001-9081.2024081190

• 大模型前沿研究与典型应用 • 上一篇下一篇

大语言模型幻觉现象的识别与优化

何静¹, 沈阳²(), 谢润锋³

^1.北京航空航天大学人文与社会科学高等研究院，北京 100083
^2.清华大学新闻与传播学院，北京 100084
^3.北京工业大学信息科学技术学院，北京 100124

收稿日期:2024-08-23 修回日期:2024-11-09 接受日期:2024-11-12 发布日期:2024-11-19 出版日期:2025-03-10
通讯作者: 沈阳
作者简介:何静（1989—），女，四川遂宁人，讲师，博士，主要研究方向：人工智能、大数据
谢润锋（1999—），男，福建漳州人，硕士研究生，主要研究方向：自然语言处理。
基金资助:
2023年北京市社会科学基金资助项目(23XCC020);青少年心理健康与危机智能干预安徽省哲学社会科学重点实验室开放基金项目(SYS2023A07)

Recognition and optimization of hallucination phenomena in large language models

Jing HE¹, Yang SHEN²(), Runfeng XIE³

^1.Institute for Advanced Studies in Humanities and Social Sciences，Beihang University，Beijing 100083，China
^2.School of Journalism and Communication，Tsinghua University，Beijing 100084，China
^3.School of Information Science and Technology，Beijing University of Technology，Beijing 100124，China

Received:2024-08-23 Revised:2024-11-09 Accepted:2024-11-12 Online:2024-11-19 Published:2025-03-10
Contact: Yang SHEN
About author:HE Jing， born in 1989， Ph. D.， lecturer. Her research interests include artificial intelligence， big data.
XIE Runfeng， born in 1999， M. S. candidate. His research interests include natural language processing.
Supported by:
2023 Beijing Social Science Fund Program(23XCC020);Open Fund of Youth Mental Health and Crisis Intelligent Intervention Anhui Provincial Key Laboratory of Philosophy and Social Sciences(SYS2023A07)

摘要/Abstract

摘要：

针对大语言模型（LLM）会产生幻觉，难以完全应用到现实生活各个领域（尤其是医疗领域），以及没有高质量的LLM幻觉评估数据集及相应的LLM幻觉程度评估的问题，提出在医疗问答领域中的LLM幻觉识别与优化方法。首先，根据公开数据集Huatuo，结合GPT-4生成问题答案和人工标注的形式构建医疗问答领域LLM幻觉评估数据集；其次，基于所构建的幻觉评估数据集，定义“幻觉率”的概念，通过设计prompt让待测模型回答“是”或“否”的方式测试和量化各个LLM的幻觉程度，并发现LLM的“YES MAN”幻觉现象；再次，采用低幻觉率的大模型GPT-4作为LeaderAI来提供先验知识辅助高幻觉率LLM进行判断；最后，为探究多个不同LLM是否会在同一个问题上犯错，定义“幻觉碰撞”的概念，并基于概率统计方法揭示不同LLM在医疗问答领域的幻觉碰撞情况。实验结果表明，引入LeaderAI的方法可以提升高幻觉率LLM的表现，使LLM能够以低幻觉率应对医疗问答领域的“YES MAN”幻觉现象，并且目前的LLM同时在一个问题上出现幻觉（发生碰撞）的概率较低。

关键词: 大语言模型, 幻觉识别, 幻觉率, 幻觉碰撞, 模型优化

Abstract:

Focusing on problems that Large Language Models （LLMs） may generate hallucinations and are difficult to be fully applied to various fields of real life， especially medical field， as well as there is no high-quality LLM hallucination evaluation dataset and corresponding LLM hallucination degree evaluation， a method for identifying and optimizing LLM hallucinations in medical question answering field was proposed. Firstly， based on the publicly available dataset Huatuo， an LLM hallucination evaluation dataset in medical question answering field was constructed by combining GPT-4 generated question answers and manual annotation. Secondly， based on the constructed hallucination evaluation dataset， the concept of “hallucination rate” was defined. By designing prompts for the models to be tested answering “yes” or “no”， the degree of hallucination of each LLM was tested and quantified， and the “YES MAN” hallucination phenomenon of LLM was discovered. Thirdly， a low hallucination rate LLM， GPT-4， was used as LeaderAI to provide prior knowledge to assist LLMs with high hallucination rate in making judgments. Finally， to explore whether multiple different LLMs will make mistakes on the same problem， the concept of “hallucination collision” was defined， and based on probability statistical method， the hallucination collision situations of different LLMs in medical question answering field were revealed. Experimental results show that the introduction of LeaderAI can improve the performance of LLMs with high hallucination rate， so that LLMs can handle with the “YES MAN” hallucination phenomenon in medical question answering with low hallucination rate. Moreover， the current LLMs have a low probability of having hallucinations on a single question （collisions）.

Key words: Large Language Model (LLM), hallucination recognition, hallucination rate, hallucination collision, model optimization

中图分类号:

G206

何静, 沈阳, 谢润锋. 大语言模型幻觉现象的识别与优化[J]. 计算机应用, 2025, 45(3): 709-714.

Jing HE, Yang SHEN, Runfeng XIE. Recognition and optimization of hallucination phenomena in large language models[J]. Journal of Computer Applications, 2025, 45(3): 709-714.

图/表 9

图1 生成幻觉数据集的prompt格式

Fig. 1 Prompt format for generated hallucination dataset

图2 测试AI系统幻觉率的prompt格式

Fig. 2 Prompt format for testing hallucination rate of AI systems

表1 幻觉判断规则

Tab. 1 Hallucination judgment rules

r	h	y	r	h	y
0	0	0	0	1	1
1	0	1	1	1	1

表2 不同大模型的幻觉率测试结果

Tab. 2 Test results of hallucination rate of different large models

大模型	$∑ i = 1 M r = 1$	$∑ i = 1 M h = 1$	$∑ i = 1 M y = 1$	幻觉率/%
GPT-4	65	11	74	3.70
Claude2	106	174	261	13.05
ChatGLM3-6B	94	369	432	21.60
ChatGPT	272	505	623	31.15
Baichuan-13B	4	1 037	1 037	51.85

表2 不同大模型的幻觉率测试结果

Tab. 2 Test results of hallucination rate of different large models

大模型	$∑ i = 1 M r = 1$	$∑ i = 1 M h = 1$	$∑ i = 1 M y = 1$	幻觉率/%
GPT-4	65	11	74	3.70
Claude2	106	174	261	13.05
ChatGLM3-6B	94	369	432	21.60
ChatGPT	272	505	623	31.15
Baichuan-13B	4	1 037	1 037	51.85

图3 采用LeaderAI生成先验知识的prompt格式

Fig. 3 Prompt format for generating prior knowledge by using LeaderAI

图4 结合先验知识再次测试大模型幻觉率的prompt格式

Fig. 4 Prompt format for retesting large model hallucination rate by combining prior knowledge

表3 引入LeaderAI先验知识辅助后各AI模型的幻觉率

Tab. 3 Hallucination rate of each AI model after introducing LeaderAI prior knowledge assistance

大模型	$∑ i = 1 M r = 1$	$∑ i = 1 M h = 1$	$∑ i = 1 M y = 1$	幻觉率/%	降幅/%
Claude2	130	62	185	9.25	29.12
ChatGPT	167	193	341	17.05	45.26
ChatGLM3-6B	40	375	400	20.00	7.40
Baichuan-13B	471	43	504	25.20	51.40

表3 引入LeaderAI先验知识辅助后各AI模型的幻觉率

Tab. 3 Hallucination rate of each AI model after introducing LeaderAI prior knowledge assistance

大模型	$∑ i = 1 M r = 1$	$∑ i = 1 M h = 1$	$∑ i = 1 M y = 1$	幻觉率/%	降幅/%
Claude2	130	62	185	9.25	29.12
ChatGPT	167	193	341	17.05	45.26
ChatGLM3-6B	40	375	400	20.00	7.40
Baichuan-13B	471	43	504	25.20	51.40

图5 两两不同待测模型间的碰撞率

Fig. 5 Collision rate between two different models to be tested

表4 不同大模型间的碰撞率对比

Tab. 4 Collision rate comparison among different large models

大模型数	AI系统组合	碰撞率/%
3	GPT-4，Baichuan-13B，ChatGLM3-6B	0.372 6
	GPT-4，Baichuan-13B，ChatGPT	0.505 6
	GPT-4，Baichuan-13B，Claude2	0.231 5
	GPT-4，ChatGLM3-6B，ChatGPT	0.234 9
	GPT-4，ChatGLM3-6B，Claude2	0.100 9
	GPT-4，ChatGPT， Claude2	0.143 6
	Baichuan-13B，ChatGLM3-6B，ChatGPT	2.619 2
	Baichuan-13B，ChatGLM3-6B，Claude2	1.232 2
	Baichuan-13B，ChatGPT，Claude2	1.651 4
	ChatGLM3-6B，ChatGPT，Claude2	0.793 5
4	GPT-4，Baichuan-13B，ChatGLM3-6B，ChatGPT	0.092 9
	GPT-4，Baichuan-13B，ChatGLM3-6B，Claude2	0.044 2
	GPT-4，Baichuan-13B，ChatGPT，Claude2	0.058 8
	GPT-4，ChatGLM3-6B，ChatGPT，Claude2	0.028 7
	Baichuan-13B，ChatGLM3-6B，ChatGPT，Claude2	0.291 6
5	GPT-4，Baichuan-13B，ChatGLM3-6B，ChatGPT，Claude2	0.010 0

参考文献 19

1	徐月梅，胡玲，赵佳艺，等. 大语言模型与多语言智能的研究进展与启示［J］. 计算机应用， 2023， 43（S2）： 1-8.
	XU Y M， HU L， ZHAO J Y， et al. Research progress and enlightenment of large language models on multi-lingual intelligence［J］. Journal of Computer Applications， 2023， 43（S2）： 1-8.
2	陈炫婷，叶俊杰，祖璨，等. GPT系列大语言模型在自然语言处理任务中的鲁棒性［J］. 计算机研究与发展， 2024， 61（5）：1128-1142.
	CHEN X T， YE J J， ZU C， et al. Robustness of GPT large language models on natural language processing tasks ［J］. Journal of Computer Research and Development， 2024， 61（5）： 1128-1142.
3	陈璐，张儒清，郭嘉丰，等. 面向文本摘要的反事实纠偏方法［J］. 计算机学报， 2023， 46（11）：2400-2415.
	CHEN L， ZHANG R Q， GUO J F， et al. Counterfactual debiasing for text summarization［J］. Chinese Journal of Computers， 2023， 46（11）：2400-2415.
4	ZHANG S， PAN L， ZHAO J， et al. The knowledge alignment problem： bridging human and external knowledge for large language models ［C］// Findings of the Association for Computational Linguistics： ACL 2024. Stroudsburg： ACL， 2024： 2025-2038.
5	HUANG L， YU W， MA W， et al. A survey on hallucination in large language models： principles， taxonomy， challenges， and open questions ［EB/OL］. ［2024-05-23］. .
6	ZHENG S， HUANG J， CHANG K C C. Why does ChatGPT fall short in providing truthful answers ［EB/OL］. ［2024-08-23］..
7	ZHANG M， PRESS O， MERRILL W， et al. How language model hallucinations can snowball ［C］// Proceedings of the 41st International Conference on Machine Learning. New York： JMLR.org， 2024：59670-59684.
8	LIN S， HILTON J， EVANS O. TruthfulQA： measuring how models mimic human falsehoods ［C］// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2022： 3214-3252.
9	LI J， CHENG X， ZHAO X， et al. HaluEval： a large-scale hallucination evaluation benchmark for large language models ［C］// Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2023： 6449-6464.
10	MIN S， KRISHNA K， LYU X， et al. FActScore： fine-grained atomic evaluation of factual precision in long form text generation［C］// Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2023： 12076-12100.
11	LEE N， PING W， XU P， et al. Factuality enhanced language models for open-ended text generation ［C］// Proceedings of the 36th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2022：34586-34599.
12	MANAKUL P， LIUSIE A， GALES M. SelfCheckGPT： zero-resource black-box hallucination detection for generative large language models ［C］// Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2023： 9004-9017.
13	PENG B， GALLEY M， HE P， et al. Check your facts and try again： improving large language models with external knowledge and automated feedback［EB/OL］. ［2024-08-23］. .
14	HUANG K H， CHAN H P， JI H. Zero-shot faithful factual error correction ［C］// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2023： 5660-5676.
15	GOU Z， SHAO Z， GONG Y， et al. CRITIC： large language models can self-correct with tool-interactive critiquing ［EB/OL］. ［2024-08-24］..
16	LI K， PATEL O， VIÉGAS F， et al. Inference-time intervention： eliciting truthful answers from a language model ［C］// Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2023： 41451-41530.
17	CHEN A， PASUPAT P， SINGH S， et al. PURR： efficiently editing language model hallucinations by denoising language model corruptions ［EB/OL］. ［2024-08-13］. .
18	DU Y， LI S， TORRALBA A， et al. Improving factuality and reasoning in language models through multiagent debate ［C］//Proceedings of the 41st International Conference on Machine Learning. New York： JMLR.org， 2024：11733-11763.
19	ZHANG Y， LI Y， CUI L， et al. Siren’s song in the AI ocean： a survey on hallucination in large language models ［EB/OL］. ［2024-07-23］. .

[1]	徐月梅, 叶宇齐, 何雪怡. 大语言模型的偏见挑战：识别、评估与去除[J]. 《计算机应用》唯一官方网站, 2025, 45(3): 697-708.
[2]	秦小林, 古徐, 李弟诚, 徐海文. 大语言模型综述与展望[J]. 《计算机应用》唯一官方网站, 2025, 45(3): 685-696.
[3]	张淑芬, 张宏扬, 任志强, 陈学斌. 联邦学习的公平性综述[J]. 《计算机应用》唯一官方网站, 2025, 45(1): 1-14.
[4]	徐月梅, 胡玲, 赵佳艺, 杜宛泽, 王文清. 大语言模型的技术应用前景与风险挑战[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1655-1662.
[5]	张睿, 潘俊铭, 白晓露, 胡静, 张荣国, 张鹏云. 面向深度分类模型超参数自优化的代理模型[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3021-3031.
[6]	姜雨杉, 张仰森. 大语言模型驱动的立场感知事实核查[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3067-3073.
[7]	胡秀华, 王长元, 肖锋, 王亚文. 利用空间结构信息的相关滤波目标跟踪算法[J]. 计算机应用, 2019, 39(4): 1150-1156.
[8]	冷冕冕孙少斌. 基于粒子系统的战场特效通用模型的设计与实现[J]. 计算机应用, 2009, 29(08): 2124-2127.

大语言模型幻觉现象的识别与优化

Recognition and optimization of hallucination phenomena in large language models

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 19

相关文章 8

编辑推荐

Metrics