Recognition and optimization of hallucination phenomena in large language models

doi:10.11772/j.issn.1001-9081.2024081190

Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (3): 709-714.DOI: 10.11772/j.issn.1001-9081.2024081190

• Frontier research and typical applications of large models • Previous Articles Next Articles

Recognition and optimization of hallucination phenomena in large language models

Jing HE¹, Yang SHEN²(), Runfeng XIE³

^1.Institute for Advanced Studies in Humanities and Social Sciences，Beihang University，Beijing 100083，China
^2.School of Journalism and Communication，Tsinghua University，Beijing 100084，China
^3.School of Information Science and Technology，Beijing University of Technology，Beijing 100124，China

Received:2024-08-23 Revised:2024-11-09 Accepted:2024-11-12 Online:2024-11-19 Published:2025-03-10
Contact: Yang SHEN
About author:HE Jing， born in 1989， Ph. D.， lecturer. Her research interests include artificial intelligence， big data.
XIE Runfeng， born in 1999， M. S. candidate. His research interests include natural language processing.
Supported by:
2023 Beijing Social Science Fund Program(23XCC020);Open Fund of Youth Mental Health and Crisis Intelligent Intervention Anhui Provincial Key Laboratory of Philosophy and Social Sciences(SYS2023A07)

大语言模型幻觉现象的识别与优化

何静¹, 沈阳²(), 谢润锋³

^1.北京航空航天大学人文与社会科学高等研究院，北京 100083
^2.清华大学新闻与传播学院，北京 100084
^3.北京工业大学信息科学技术学院，北京 100124

通讯作者: 沈阳
作者简介:何静（1989—），女，四川遂宁人，讲师，博士，主要研究方向：人工智能、大数据
谢润锋（1999—），男，福建漳州人，硕士研究生，主要研究方向：自然语言处理。
基金资助:
2023年北京市社会科学基金资助项目(23XCC020);青少年心理健康与危机智能干预安徽省哲学社会科学重点实验室开放基金项目(SYS2023A07)

Abstract

Abstract:

Focusing on problems that Large Language Models （LLMs） may generate hallucinations and are difficult to be fully applied to various fields of real life， especially medical field， as well as there is no high-quality LLM hallucination evaluation dataset and corresponding LLM hallucination degree evaluation， a method for identifying and optimizing LLM hallucinations in medical question answering field was proposed. Firstly， based on the publicly available dataset Huatuo， an LLM hallucination evaluation dataset in medical question answering field was constructed by combining GPT-4 generated question answers and manual annotation. Secondly， based on the constructed hallucination evaluation dataset， the concept of “hallucination rate” was defined. By designing prompts for the models to be tested answering “yes” or “no”， the degree of hallucination of each LLM was tested and quantified， and the “YES MAN” hallucination phenomenon of LLM was discovered. Thirdly， a low hallucination rate LLM， GPT-4， was used as LeaderAI to provide prior knowledge to assist LLMs with high hallucination rate in making judgments. Finally， to explore whether multiple different LLMs will make mistakes on the same problem， the concept of “hallucination collision” was defined， and based on probability statistical method， the hallucination collision situations of different LLMs in medical question answering field were revealed. Experimental results show that the introduction of LeaderAI can improve the performance of LLMs with high hallucination rate， so that LLMs can handle with the “YES MAN” hallucination phenomenon in medical question answering with low hallucination rate. Moreover， the current LLMs have a low probability of having hallucinations on a single question （collisions）.

Key words: Large Language Model (LLM), hallucination recognition, hallucination rate, hallucination collision, model optimization

摘要：

针对大语言模型（LLM）会产生幻觉，难以完全应用到现实生活各个领域（尤其是医疗领域），以及没有高质量的LLM幻觉评估数据集及相应的LLM幻觉程度评估的问题，提出在医疗问答领域中的LLM幻觉识别与优化方法。首先，根据公开数据集Huatuo，结合GPT-4生成问题答案和人工标注的形式构建医疗问答领域LLM幻觉评估数据集；其次，基于所构建的幻觉评估数据集，定义“幻觉率”的概念，通过设计prompt让待测模型回答“是”或“否”的方式测试和量化各个LLM的幻觉程度，并发现LLM的“YES MAN”幻觉现象；再次，采用低幻觉率的大模型GPT-4作为LeaderAI来提供先验知识辅助高幻觉率LLM进行判断；最后，为探究多个不同LLM是否会在同一个问题上犯错，定义“幻觉碰撞”的概念，并基于概率统计方法揭示不同LLM在医疗问答领域的幻觉碰撞情况。实验结果表明，引入LeaderAI的方法可以提升高幻觉率LLM的表现，使LLM能够以低幻觉率应对医疗问答领域的“YES MAN”幻觉现象，并且目前的LLM同时在一个问题上出现幻觉（发生碰撞）的概率较低。

关键词: 大语言模型, 幻觉识别, 幻觉率, 幻觉碰撞, 模型优化

CLC Number:

G206

Jing HE, Yang SHEN, Runfeng XIE. Recognition and optimization of hallucination phenomena in large language models[J]. Journal of Computer Applications, 2025, 45(3): 709-714.

何静, 沈阳, 谢润锋. 大语言模型幻觉现象的识别与优化[J]. 《计算机应用》唯一官方网站, 2025, 45(3): 709-714.

Figures/Tables 9

Fig. 1 Prompt format for generated hallucination dataset

Fig. 2 Prompt format for testing hallucination rate of AI systems

Tab. 1 Hallucination judgment rules

r	h	y	r	h	y
0	0	0	0	1	1
1	0	1	1	1	1

Tab. 2 Test results of hallucination rate of different large models

大模型	$∑ i = 1 M r = 1$	$∑ i = 1 M h = 1$	$∑ i = 1 M y = 1$	幻觉率/%
GPT-4	65	11	74	3.70
Claude2	106	174	261	13.05
ChatGLM3-6B	94	369	432	21.60
ChatGPT	272	505	623	31.15
Baichuan-13B	4	1 037	1 037	51.85

Tab. 2 Test results of hallucination rate of different large models

大模型	$∑ i = 1 M r = 1$	$∑ i = 1 M h = 1$	$∑ i = 1 M y = 1$	幻觉率/%
GPT-4	65	11	74	3.70
Claude2	106	174	261	13.05
ChatGLM3-6B	94	369	432	21.60
ChatGPT	272	505	623	31.15
Baichuan-13B	4	1 037	1 037	51.85

Fig. 3 Prompt format for generating prior knowledge by using LeaderAI

Fig. 4 Prompt format for retesting large model hallucination rate by combining prior knowledge

Tab. 3 Hallucination rate of each AI model after introducing LeaderAI prior knowledge assistance

大模型	$∑ i = 1 M r = 1$	$∑ i = 1 M h = 1$	$∑ i = 1 M y = 1$	幻觉率/%	降幅/%
Claude2	130	62	185	9.25	29.12
ChatGPT	167	193	341	17.05	45.26
ChatGLM3-6B	40	375	400	20.00	7.40
Baichuan-13B	471	43	504	25.20	51.40

Tab. 3 Hallucination rate of each AI model after introducing LeaderAI prior knowledge assistance

大模型	$∑ i = 1 M r = 1$	$∑ i = 1 M h = 1$	$∑ i = 1 M y = 1$	幻觉率/%	降幅/%
Claude2	130	62	185	9.25	29.12
ChatGPT	167	193	341	17.05	45.26
ChatGLM3-6B	40	375	400	20.00	7.40
Baichuan-13B	471	43	504	25.20	51.40

Fig. 5 Collision rate between two different models to be tested

Tab. 4 Collision rate comparison among different large models

大模型数	AI系统组合	碰撞率/%
3	GPT-4，Baichuan-13B，ChatGLM3-6B	0.372 6
	GPT-4，Baichuan-13B，ChatGPT	0.505 6
	GPT-4，Baichuan-13B，Claude2	0.231 5
	GPT-4，ChatGLM3-6B，ChatGPT	0.234 9
	GPT-4，ChatGLM3-6B，Claude2	0.100 9
	GPT-4，ChatGPT， Claude2	0.143 6
	Baichuan-13B，ChatGLM3-6B，ChatGPT	2.619 2
	Baichuan-13B，ChatGLM3-6B，Claude2	1.232 2
	Baichuan-13B，ChatGPT，Claude2	1.651 4
	ChatGLM3-6B，ChatGPT，Claude2	0.793 5
4	GPT-4，Baichuan-13B，ChatGLM3-6B，ChatGPT	0.092 9
	GPT-4，Baichuan-13B，ChatGLM3-6B，Claude2	0.044 2
	GPT-4，Baichuan-13B，ChatGPT，Claude2	0.058 8
	GPT-4，ChatGLM3-6B，ChatGPT，Claude2	0.028 7
	Baichuan-13B，ChatGLM3-6B，ChatGPT，Claude2	0.291 6
5	GPT-4，Baichuan-13B，ChatGLM3-6B，ChatGPT，Claude2	0.010 0

References 19

1	徐月梅，胡玲，赵佳艺，等. 大语言模型与多语言智能的研究进展与启示［J］. 计算机应用， 2023， 43（S2）： 1-8.
	XU Y M， HU L， ZHAO J Y， et al. Research progress and enlightenment of large language models on multi-lingual intelligence［J］. Journal of Computer Applications， 2023， 43（S2）： 1-8.
2	陈炫婷，叶俊杰，祖璨，等. GPT系列大语言模型在自然语言处理任务中的鲁棒性［J］. 计算机研究与发展， 2024， 61（5）：1128-1142.
	CHEN X T， YE J J， ZU C， et al. Robustness of GPT large language models on natural language processing tasks ［J］. Journal of Computer Research and Development， 2024， 61（5）： 1128-1142.
3	陈璐，张儒清，郭嘉丰，等. 面向文本摘要的反事实纠偏方法［J］. 计算机学报， 2023， 46（11）：2400-2415.
	CHEN L， ZHANG R Q， GUO J F， et al. Counterfactual debiasing for text summarization［J］. Chinese Journal of Computers， 2023， 46（11）：2400-2415.
4	ZHANG S， PAN L， ZHAO J， et al. The knowledge alignment problem： bridging human and external knowledge for large language models ［C］// Findings of the Association for Computational Linguistics： ACL 2024. Stroudsburg： ACL， 2024： 2025-2038.
5	HUANG L， YU W， MA W， et al. A survey on hallucination in large language models： principles， taxonomy， challenges， and open questions ［EB/OL］. ［2024-05-23］. .
6	ZHENG S， HUANG J， CHANG K C C. Why does ChatGPT fall short in providing truthful answers ［EB/OL］. ［2024-08-23］..
7	ZHANG M， PRESS O， MERRILL W， et al. How language model hallucinations can snowball ［C］// Proceedings of the 41st International Conference on Machine Learning. New York： JMLR.org， 2024：59670-59684.
8	LIN S， HILTON J， EVANS O. TruthfulQA： measuring how models mimic human falsehoods ［C］// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2022： 3214-3252.
9	LI J， CHENG X， ZHAO X， et al. HaluEval： a large-scale hallucination evaluation benchmark for large language models ［C］// Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2023： 6449-6464.
10	MIN S， KRISHNA K， LYU X， et al. FActScore： fine-grained atomic evaluation of factual precision in long form text generation［C］// Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2023： 12076-12100.
11	LEE N， PING W， XU P， et al. Factuality enhanced language models for open-ended text generation ［C］// Proceedings of the 36th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2022：34586-34599.
12	MANAKUL P， LIUSIE A， GALES M. SelfCheckGPT： zero-resource black-box hallucination detection for generative large language models ［C］// Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2023： 9004-9017.
13	PENG B， GALLEY M， HE P， et al. Check your facts and try again： improving large language models with external knowledge and automated feedback［EB/OL］. ［2024-08-23］. .
14	HUANG K H， CHAN H P， JI H. Zero-shot faithful factual error correction ［C］// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2023： 5660-5676.
15	GOU Z， SHAO Z， GONG Y， et al. CRITIC： large language models can self-correct with tool-interactive critiquing ［EB/OL］. ［2024-08-24］..
16	LI K， PATEL O， VIÉGAS F， et al. Inference-time intervention： eliciting truthful answers from a language model ［C］// Proceedings of the 37th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2023： 41451-41530.
17	CHEN A， PASUPAT P， SINGH S， et al. PURR： efficiently editing language model hallucinations by denoising language model corruptions ［EB/OL］. ［2024-08-13］. .
18	DU Y， LI S， TORRALBA A， et al. Improving factuality and reasoning in language models through multiagent debate ［C］//Proceedings of the 41st International Conference on Machine Learning. New York： JMLR.org， 2024：11733-11763.
19	ZHANG Y， LI Y， CUI L， et al. Siren’s song in the AI ocean： a survey on hallucination in large language models ［EB/OL］. ［2024-07-23］. .

[1]	Xiaolin QIN, Xu GU, Dicheng LI, Haiwen XU. Survey and prospect of large language models [J]. Journal of Computer Applications, 2025, 45(3): 685-696.
[2]	Yuemei XU, Yuqi YE, Xueyi HE. Bias challenges of large language models： identification， evaluation， and mitigation [J]. Journal of Computer Applications, 2025, 45(3): 697-708.
[3]	Shufen ZHANG, Hongyang ZHANG, Zhiqiang REN, Xuebin CHEN. Survey of fairness in federated learning [J]. Journal of Computer Applications, 2025, 45(1): 1-14.
[4]	Yuemei XU, Ling HU, Jiayi ZHAO, Wanze DU, Wenqing WANG. Technology application prospects and risk challenges of large language models [J]. Journal of Computer Applications, 2024, 44(6): 1655-1662.
[5]	Yushan JIANG, Yangsen ZHANG. Large language model-driven stance-aware fact-checking [J]. Journal of Computer Applications, 2024, 44(10): 3067-3073.
[6]	Rui ZHANG, Junming PAN, Xiaolu BAI, Jing HU, Rongguo ZHANG, Pengyun ZHANG. Agent model for hyperparameter self-optimization of deep classification model [J]. Journal of Computer Applications, 2024, 44(10): 3021-3031.
[7]	HU Xiuhua, WANG Changyuan, XIAO Feng, WANG Yawen. Object tracking algorithm based on correlation filter with spatial structure information [J]. Journal of Computer Applications, 2019, 39(4): 1150-1156.

Recognition and optimization of hallucination phenomena in large language models

大语言模型幻觉现象的识别与优化

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 9

References 19

Related Articles 7

Recommended Articles

Metrics