基于6W2H的大语言模型思维链提示框架WH-CoT

doi:10.11772/j.issn.1001-9081.2024050667

《计算机应用》唯一官方网站 ›› 0, Vol. ›› Issue (): 1-6.DOI: 10.11772/j.issn.1001-9081.2024050667

基于6W2H的大语言模型思维链提示框架WH-CoT

陈孟科¹^,², 边赟¹(), 梁云浩¹^,², 王海全¹^,²

^1.中国科学院成都计算机应用研究所，成都 610213
^2.中国科学院大学计算机科学与技术学院，北京 100049

收稿日期:2024-05-27 修回日期:2024-06-26 接受日期:2024-07-01 发布日期:2025-01-24 出版日期:2024-12-31
通讯作者: 边赟
作者简介:陈孟科（1998—），男，四川巴中人，硕士研究生，主要研究方向：自然语言处理、大语言模型
边赟（1988—），女，甘肃酒泉人，工程师，博士研究生，主要研究方向：自然语言处理、大语言模型
梁云浩（2000—），男，四川成都人，硕士研究生，主要研究方向：自然语言处理、大语言模型
王海全（1999—），男，河南焦作人，硕士研究生，主要研究方向：自然语言处理、大语言模型。

WH-CoT： 6W2H-based chain-of-thought prompting framework on large language models

Mengke CHEN¹^,², Yun BIAN¹(), Yunhao LIANG¹^,², Haiquan WANG¹^,²

^1.Chengdu Institute of Computer Application，Chinese Academy of Sciences，Chengdu Sichuan 610213，China
^2.School of Computer Science and Technology，University of Chinese Academy of Sciences，Beijing 100049，China

Received:2024-05-27 Revised:2024-06-26 Accepted:2024-07-01 Online:2025-01-24 Published:2024-12-31
Contact: Yun BIAN

摘要/Abstract

摘要：

针对当前思维链（CoT）提示技术缺乏进一步人类策略的指导以及对小规模大语言模型（LLM）应用效果不佳的问题，提出一种基于6W2H问题分解策略的CoT提示框架WH-CoT （6W2H Chain-of-Thought）。首先，利用Sentence-BERT模型对任务数据集进行聚类采样并划分为训练集和测试集；其次，在训练集中对所有样本进行元素提取、问题分解、答案段落构建和答案生成等操作以形成CoT，进而构建任务语料库；最后，在推理阶段，自适应地从语料库中采样演示样本并添加至提示词，使得模型结合提示词生成测试问题的答案。对于Qwen-turbo模型，在算术推理任务上，WH-CoT的平均准确率相较于主流的Zero-Shot-CoT和Manual-CoT分别提升了3.35和4.27个百分点；在多跳推理任务上，WH-CoT的总性能提升比在预测与答案完全相同的比例（EM）上相较于Zero-Shot-CoT和Manual-CoT分别提升了36和111个百分点。另外，对于中小规模的Qwen-14B-Chat和Qwen-7B-Chat模型，WH-CoT的总性能提升比在EM和F1上均高于Zero-Shot-CoT和Manual-CoT。可见WH-CoT通过进一步结合人类策略与机器智能，对于不同规模的LLM，均能有效地提升它们在算术推理和多跳推理任务上的推理性能。

关键词: 6W2H, 思维链提示, 提示学习, 大语言模型, 上下文学习, 自适应采样

Abstract:

Concerning the limitations of Chain-of-Thought （CoT） prompting technology， such as insufficient integration of human strategies and poorly performance for small-scale Large Language Models （LLMs）， a CoT prompting framework based on the 6W2H （Why， What， Which， When， Where， Who， How， How much） problem decomposition strategy， WH-CoT （6W2H Chain-of-Thought）， was proposed. Firstly， the task dataset was clustered， sampled and divided into training and test datasets by using the Sentence-BERT model. Then， in the training dataset， all samples were subjected to element extraction， problem decomposition， answer paragraph construction， and answer generation to form the CoT， thereby constructing a task-specific corpus. Finally， during the reasoning stage， demonstration samples were extracted adaptively from the corpus and added to the prompts， allowing the model to combine the prompts to generate answers to test questions. For the Qwen-turbo model， on arithmetic reasoning task， the average accuracy of WH-CoT is improved by 3.35 and 4.27 percentage points respectively compared with those of the mainstream Zero-Shot-CoT and Manual-CoT； on multi-hop reasoning task， compared with Zero-Shot-CoT and Manual-CoT， WH-CoT has the total performance improvement ratio on EM （Exact Matching ratio） increased by 36 and 111 percentage points respectively. In addition， for the Qwen-14B-Chat and Qwen-7B-Chat models， the total performance improvement ratios of WH-CoT are higher than those of Zero-Shot-CoT and Manual-CoT on both EM and F1. It can be seen that by further integrating human strategies with machine intelligence， WH-CoT can improve the reasoning performance of LLMs of different sizes effectively on both arithmetic reasoning and multi-hop reasoning tasks.

Key words: 6W2H（Why， What， Which， When， Where， Who， How， How much）, Chain-of-Thought (CoT) prompting, prompt learning, Large Language Model (LLM), In-Context Learning (ICL), adaptive sampling

中图分类号:

TP391.1

陈孟科, 边赟, 梁云浩, 王海全. 基于6W2H的大语言模型思维链提示框架WH-CoT[J]. 计算机应用, 0, (): 1-6.

Mengke CHEN, Yun BIAN, Yunhao LIANG, Haiquan WANG. WH-CoT： 6W2H-based chain-of-thought prompting framework on large language models[J]. Journal of Computer Applications, 0, (): 1-6.

图/表 9

参考文献 31

1	BROWN T， MANN B， RYDER N， et al. Language models are few-shot learners［C］// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2020： 1877-1901.
2	CHOWDHERY A， NARANG S， DEVLIN J， et al. PaLM： scaling language modeling with pathways［J］. Journal of Machine Learning Research， 2023， 24： 1-113.
3	WEI J， WANG X， SCHUURMANS D， et al. Chain-of-thought prompting elicits reasoning in large language models［C］// Proceedings of the 36th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2022： 24824-24837.
4	KOJIMA T， GU S S， REID M， et al. Large language models are zero-shot reasoners［C］// Proceedings of the 36th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2022： 22199-22213.
5	WANG X， WEI J， SCHUURMANS D， et al. Self-consistency improves chain of thought reasoning in language models［EB/OL］. ［2024-01-11］..
6	WANG J， LI J， ZHAO H. Self-prompted chain-of-thought on large language models for open-domain multi-hop reasoning［C］// Findings of the Association for Computational Linguistics： EMNLP 2023. Stroudsburg： ACL， 2023： 2717-2731.
7	WEI J， TAY Y， BOMMASANI R， et al. Emergent abilities of large language models［EB/OL］.［2024-03-27］. .
8	COBBE K， KOSARAJU V， BAVARIAN M， et al. Training verifiers to solve math word problems［EB/OL］. ［2024-03-05］. .
9	LING W， YOGATAMA D， DYER C， et al. Program induction by rationale generation： learning to solve and explain algebraic word problems［C］// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2017： 158-167.
10	PATEL A， BHATTAMISHRA S， GOYAL N. Are NLP models really able to solve simple math word problems？［C］// Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg： ACL， 2021： 2080-2094.
11	TRIVEDI H， BALASUBRAMANIAN N， KHOT T， et al. MuSiQue： multi-hop questions via single-hop question composition［J］. Transactions of the Association for Computational Linguistics， 2022， 10： 539-554.
12	YANG Z， QI P， ZHANG S， et al. HotpotQA： a dataset for diverse， explainable multi-hop question answering［C］// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2018： 2369-2380.
13	HO X， DUONG NGUYEN A K， SUGAWARA S， et al. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps［C］// Proceedings of the 28th International Conference on Computational Linguistics. ［S.l.］： International Committee on Computational Linguistics， 2020： 6609-6625.
14	大岛祥誉. 麦肯锡思考工具［M］. 朱悦玮，译. 北京：北京时代华文书局， 2023：154-157.
15	CHAKMA K， DAS A， DEBBARMA S. Deep semantic role labeling for tweets using 5W1H： Who， What， When， Where， Why and How［J］. Computación y Sistemas， 2019， 23（3）： 751-763.
16	姜天笑.浅谈科技查新工作中的5W1H分析法［J］.情报探索，2011（5）：96-97.
17	CHAKMA K， DAS A. A 5W1H based annotation scheme for semantic role labeling of English tweets ［J］. Computación y Sistemas， 2018， 22（3）： 747-755.
18	DAS A， BANDYAOPADHYAY S， GAMBÄCK B. The 5W structure for sentiment summarization-visualization-tracking［C］// Proceedings of the 2012 International Conference on Computational Linguistics and Intelligent Text Processing， LNCS 7181. Berlin： Springer， 2012： 540-555.
19	PARTON K， McKEOWN K R， COYNE R， et al . Who， what， when， where， why？ comparing multiple approaches to the cross-lingual 5W task［C］// Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Stroudsburg： ACL， 2009： 423-431.
20	VASWANI A， SHAZEER NRMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 6000-6010.
21	DEVLIN J， CHANG M W， LEE K， et al. BERT： pre-training of deep bidirectional transformers for language understanding［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsburg： ACL， 2019： 4171-4186.
22	LIU P， YUAN W， FU J， et al. Pre-train， prompt， and predict： a systematic survey of prompting methods in natural language processing［J］. ACM Computing Surveys， 2023， 55（9）： No.195.
23	RAE J W， BORGEAUD S， CAI T， et al. Scaling language models： methods， analysis & insights from training Gopher［EB/OL］. ［2024-02-21］. .
24	ZELIKMAN E， WU Y， MU J， et al. STaR： self-taught reasoner bootstrapping reasoning with reasoning［C］// Proceedings of the 36th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2022： 15476-15488.
25	PRESS O， ZHANG M， MIN S， et al. Measuring and narrowing the compositionality gap in language models［C］// Findings of the Association for Computational Linguistics： EMNLP 2023. Stroudsburg： ACL， 2023： 5687-5711.
26	ZHANG Z， ZHANG A， LI M， et al. Automatic chain of thought prompting in large language models［EB/OL］. ［2024-01-06］. .
27	AGRAWAL S， ZHOU C， LEWIS M， et al. In-context examples selection for machine translation［C］// Proceedings of the Findings of Association for Computational Linguistics： ACL 2023. Stroudsburg： ACL， 2023： 8857-8873.
28	YE J， WU Z， FENG J， et al. Compositional exemplars for in-context learning［C］// Proceedings of the 40th International Conference on Machine Learning. New York： JMLR.org， 2023： 39818-39833.
29	REIMERS N， GUREVYCH I. Sentence-BERT： sentence embeddings using Siamese BERT-networks［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg： ACL， 2019： 3982-3992.
30	ZHOU D， SCHÄRLI N， HOU L， et al. Least-to-most prompting enables complex reasoning in large language models［EB/OL］. ［2024-03-28］. .
31	BAI J， BAI S， CHU Y， et al. QWEN technical report［EB/OL］. ［2024-02-28］. .

数据集	平均问题长度	平均答案长度	平均推理步骤数
MSQ	18.11	2.80	2.65
Hotpot	15.83	2.46	2.68
2Wiki	11.98	2.41	2.47

数据集	平均问题长度	平均答案长度	平均推理步骤数
MSQ	18.11	2.80	2.65
Hotpot	15.83	2.46	2.68
2Wiki	11.98	2.41	2.47

方法	准确率/%
方法	GSM8K	AQUA	SVAMP		均值
Zero-Shot	70.00	40.62	56.66	55.76
Zero-Shot-CoT	76.87	47.65	81.66	68.73
Manual-CoT	78.75	54.68	70.00	67.81
WH-CoT	79.38	59.37	77.50	72.08

方法	准确率/%
方法	GSM8K	AQUA	SVAMP		均值
Zero-Shot	70.00	40.62	56.66	55.76
Zero-Shot-CoT	76.87	47.65	81.66	68.73
Manual-CoT	78.75	54.68	70.00	67.81
WH-CoT	79.38	59.37	77.50	72.08

方法	MSQ		Hotpot		2Wiki		总性能提升比
方法	EM/%	F1/%	EM/%	F1/%	EM/%	F1/%	EM	F1
Zero-Shot	0.83	1.97	8.33	18.65	11.67	16.05	—	—
Zero-Shot-CoT	2.50	7.15	14.17	21.23	5.83	7.39	1.54↑	1.90↑
Manual-CoT	1.67	5.04	5.00	10.37	22.50	25.51	0.79↑	1.15↑
WH-CoT	3.00	6.62	8.33	18.10	8.33	12.59	1.90↑	1.71↑

基于6W2H的大语言模型思维链提示框架WH-CoT

WH-CoT： 6W2H-based chain-of-thought prompting framework on large language models

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 31

相关文章 15

编辑推荐

Metrics

采样方式	MSQ	Hotpot	2Wiki	均值
随机采样	0.00	4.17	6.67	3.61
簇内随机采样	1.67	5.83	5.00	4.16
相似性采样	0.83	4.17	6.67	3.89
自适应采样	3.00	8.33	8.33	6.55

[1]	孙熠衡, 刘茂福. 基于知识提示微调的标书信息抽取方法[J]. 《计算机应用》唯一官方网站, 2025, 45(4): 1169-1176.
[2]	曹鹏, 温广琪, 杨金柱, 陈刚, 刘歆一, 季学纯. 面向测试用例生成的大模型高效微调方法[J]. 《计算机应用》唯一官方网站, 2025, 45(3): 725-731.
[3]	鲁超峰, 陶冶, 文连庆, 孟菲, 秦修功, 杜永杰, 田云龙. 融合大语言模型和预训练模型的少量语料说话人-情感语音转换方法[J]. 《计算机应用》唯一官方网站, 2025, 45(3): 815-822.
[4]	孙晨伟, 侯俊利, 刘祥根, 吕建成. 面向工程图纸理解的大语言模型提示生成方法[J]. 《计算机应用》唯一官方网站, 2025, 45(3): 801-807.
[5]	董艳民, 林佳佳, 张征, 程程, 吴金泽, 王士进, 黄振亚, 刘淇, 陈恩红. 个性化学情感知的智慧助教算法设计与实践[J]. 《计算机应用》唯一官方网站, 2025, 45(3): 765-772.
[6]	马灿, 黄瑞章, 任丽娜, 白瑞娜, 伍瑶瑶. 基于大语言模型的多输入中文拼写纠错方法[J]. 《计算机应用》唯一官方网站, 2025, 45(3): 849-855.
[7]	何静, 沈阳, 谢润锋. 大语言模型幻觉现象的识别与优化[J]. 《计算机应用》唯一官方网站, 2025, 45(3): 709-714.
[8]	陈维, 施昌勇, 马传香. 基于多模态数据融合的农作物病害识别方法[J]. 《计算机应用》唯一官方网站, 2025, 45(3): 840-848.
[9]	盛坤, 王中卿. 基于大语言模型和数据增强的通感隐喻分析[J]. 《计算机应用》唯一官方网站, 2025, 45(3): 794-800.
[10]	徐月梅, 叶宇齐, 何雪怡. 大语言模型的偏见挑战：识别、评估与去除[J]. 《计算机应用》唯一官方网站, 2025, 45(3): 697-708.
[11]	杨燕, 叶枫, 许栋, 张雪洁, 徐津. 融合大语言模型和提示学习的数字孪生水利知识图谱构建[J]. 《计算机应用》唯一官方网站, 2025, 45(3): 785-793.
[12]	张学飞, 张丽萍, 闫盛, 侯敏, 赵宇博. 知识图谱与大语言模型协同的个性化学习推荐[J]. 《计算机应用》唯一官方网站, 2025, 45(3): 773-784.
[13]	秦小林, 古徐, 李弟诚, 徐海文. 大语言模型综述与展望[J]. 《计算机应用》唯一官方网站, 2025, 45(3): 685-696.
[14]	袁成哲, 陈国华, 李丁丁, 朱源, 林荣华, 钟昊, 汤庸. ScholatGPT：面向学术社交网络的大语言模型及智能应用[J]. 《计算机应用》唯一官方网站, 2025, 45(3): 755-764.
[15]	李斌, 林民, 斯日古楞null, 高颖杰, 王玉荣, 张树钧. 基于提示学习和全局指针网络的中文古籍实体关系联合抽取方法[J]. 《计算机应用》唯一官方网站, 2025, 45(1): 75-81.

方法	MSQ		Hotpot		2Wiki		总性能提升比
方法	EM/%	F1/%	EM/%	F1/%	EM/%	F1/%	EM	F1
Zero-Shot	0.83	2.84	5.83	13.54	4.12	7.75	—	—
Zero-Shot-CoT	2.50	10.47	15.00	27.81	13.33	21.33	1.97↑	2.34↑
Manual-CoT	0.83	4.76	12.50	20.36	26.67	29.50	0.77↑	0.89↑
WH-CoT	10.83	23.66	15.00	24.81	26.67	29.29	9.75↑	5.86↑

方法	MSQ		Hotpot		2Wiki		总性能提升比
方法	EM/%	F1/%	EM/%	F1/%	EM/%	F1/%	EM	F1
Zero-Shot	0.83	2.84	5.83	13.54	4.12	7.75	—	—
Zero-Shot-CoT	2.50	10.47	15.00	27.81	13.33	21.33	1.97↑	2.34↑
Manual-CoT	0.83	4.76	12.50	20.36	26.67	29.50	0.77↑	0.89↑
WH-CoT	10.83	23.66	15.00	24.81	26.67	29.29	9.75↑	5.86↑

方法	MSQ		Hotpot		2Wiki		总性能提升比
方法	EM/%	F1/%	EM/%	F1/%	EM/%	F1/%	EM	F1
Zero-Shot	0.83	2.01	3.33	11.77	5.00	13.68	—	—
Zero-Shot-CoT	0.83	4.90	10.00	17.57	19.17	23.33	0.61↑	1.22↑
Manual-CoT	2.50	5.85	7.50	11.48	16.67	21.09	1.93↑	1.47↑
WH-CoT	4.17	10.34	9.17	16.27	15.83	19.83	3.48↑	3.17↑

方法	MSQ		Hotpot		2Wiki		总性能提升比
方法	EM/%	F1/%	EM/%	F1/%	EM/%	F1/%	EM	F1
Zero-Shot	0.83	2.01	3.33	11.77	5.00	13.68	—	—
Zero-Shot-CoT	0.83	4.90	10.00	17.57	19.17	23.33	0.61↑	1.22↑
Manual-CoT	2.50	5.85	7.50	11.48	16.67	21.09	1.93↑	1.47↑
WH-CoT	4.17	10.34	9.17	16.27	15.83	19.83	3.48↑	3.17↑