WH-CoT： 6W2H-based chain-of-thought prompting framework on large language models

doi:10.11772/j.issn.1001-9081.2024050667

Abstract

Abstract:

Concerning the limitations of Chain-of-Thought （CoT） prompting technology， such as insufficient integration of human strategies and poorly performance for small-scale Large Language Models （LLMs）， a CoT prompting framework based on the 6W2H （Why， What， Which， When， Where， Who， How， How much） problem decomposition strategy， WH-CoT （6W2H Chain-of-Thought）， was proposed. Firstly， the task dataset was clustered， sampled and divided into training and test datasets by using the Sentence-BERT model. Then， in the training dataset， all samples were subjected to element extraction， problem decomposition， answer paragraph construction， and answer generation to form the CoT， thereby constructing a task-specific corpus. Finally， during the reasoning stage， demonstration samples were extracted adaptively from the corpus and added to the prompts， allowing the model to combine the prompts to generate answers to test questions. For the Qwen-turbo model， on arithmetic reasoning task， the average accuracy of WH-CoT is improved by 3.35 and 4.27 percentage points respectively compared with those of the mainstream Zero-Shot-CoT and Manual-CoT； on multi-hop reasoning task， compared with Zero-Shot-CoT and Manual-CoT， WH-CoT has the total performance improvement ratio on EM （Exact Matching ratio） increased by 36 and 111 percentage points respectively. In addition， for the Qwen-14B-Chat and Qwen-7B-Chat models， the total performance improvement ratios of WH-CoT are higher than those of Zero-Shot-CoT and Manual-CoT on both EM and F1. It can be seen that by further integrating human strategies with machine intelligence， WH-CoT can improve the reasoning performance of LLMs of different sizes effectively on both arithmetic reasoning and multi-hop reasoning tasks.

Key words: 6W2H（Why， What， Which， When， Where， Who， How， How much）, Chain-of-Thought (CoT) prompting, prompt learning, Large Language Model (LLM), In-Context Learning (ICL), adaptive sampling

摘要：

针对当前思维链（CoT）提示技术缺乏进一步人类策略的指导以及对小规模大语言模型（LLM）应用效果不佳的问题，提出一种基于6W2H问题分解策略的CoT提示框架WH-CoT （6W2H Chain-of-Thought）。首先，利用Sentence-BERT模型对任务数据集进行聚类采样并划分为训练集和测试集；其次，在训练集中对所有样本进行元素提取、问题分解、答案段落构建和答案生成等操作以形成CoT，进而构建任务语料库；最后，在推理阶段，自适应地从语料库中采样演示样本并添加至提示词，使得模型结合提示词生成测试问题的答案。对于Qwen-turbo模型，在算术推理任务上，WH-CoT的平均准确率相较于主流的Zero-Shot-CoT和Manual-CoT分别提升了3.35和4.27个百分点；在多跳推理任务上，WH-CoT的总性能提升比在预测与答案完全相同的比例（EM）上相较于Zero-Shot-CoT和Manual-CoT分别提升了36和111个百分点。另外，对于中小规模的Qwen-14B-Chat和Qwen-7B-Chat模型，WH-CoT的总性能提升比在EM和F1上均高于Zero-Shot-CoT和Manual-CoT。可见WH-CoT通过进一步结合人类策略与机器智能，对于不同规模的LLM，均能有效地提升它们在算术推理和多跳推理任务上的推理性能。

关键词: 6W2H, 思维链提示, 提示学习, 大语言模型, 上下文学习, 自适应采样

CLC Number:

TP391.1

Mengke CHEN, Yun BIAN, Yunhao LIANG, Haiquan WANG. WH-CoT： 6W2H-based chain-of-thought prompting framework on large language models[J]. Journal of Computer Applications, 0, (): 1-6.

陈孟科, 边赟, 梁云浩, 王海全. 基于6W2H的大语言模型思维链提示框架WH-CoT[J]. 《计算机应用》唯一官方网站, 0, (): 1-6.

Figures/Tables 9

References 31

1	BROWN T， MANN B， RYDER N， et al. Language models are few-shot learners［C］// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2020： 1877-1901.
2	CHOWDHERY A， NARANG S， DEVLIN J， et al. PaLM： scaling language modeling with pathways［J］. Journal of Machine Learning Research， 2023， 24： 1-113.
3	WEI J， WANG X， SCHUURMANS D， et al. Chain-of-thought prompting elicits reasoning in large language models［C］// Proceedings of the 36th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2022： 24824-24837.
4	KOJIMA T， GU S S， REID M， et al. Large language models are zero-shot reasoners［C］// Proceedings of the 36th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2022： 22199-22213.
5	WANG X， WEI J， SCHUURMANS D， et al. Self-consistency improves chain of thought reasoning in language models［EB/OL］. ［2024-01-11］..
6	WANG J， LI J， ZHAO H. Self-prompted chain-of-thought on large language models for open-domain multi-hop reasoning［C］// Findings of the Association for Computational Linguistics： EMNLP 2023. Stroudsburg： ACL， 2023： 2717-2731.
7	WEI J， TAY Y， BOMMASANI R， et al. Emergent abilities of large language models［EB/OL］.［2024-03-27］. .
8	COBBE K， KOSARAJU V， BAVARIAN M， et al. Training verifiers to solve math word problems［EB/OL］. ［2024-03-05］. .
9	LING W， YOGATAMA D， DYER C， et al. Program induction by rationale generation： learning to solve and explain algebraic word problems［C］// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2017： 158-167.
10	PATEL A， BHATTAMISHRA S， GOYAL N. Are NLP models really able to solve simple math word problems？［C］// Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg： ACL， 2021： 2080-2094.
11	TRIVEDI H， BALASUBRAMANIAN N， KHOT T， et al. MuSiQue： multi-hop questions via single-hop question composition［J］. Transactions of the Association for Computational Linguistics， 2022， 10： 539-554.
12	YANG Z， QI P， ZHANG S， et al. HotpotQA： a dataset for diverse， explainable multi-hop question answering［C］// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2018： 2369-2380.
13	HO X， DUONG NGUYEN A K， SUGAWARA S， et al. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps［C］// Proceedings of the 28th International Conference on Computational Linguistics. ［S.l.］： International Committee on Computational Linguistics， 2020： 6609-6625.
14	大岛祥誉. 麦肯锡思考工具［M］. 朱悦玮，译. 北京：北京时代华文书局， 2023：154-157.
15	CHAKMA K， DAS A， DEBBARMA S. Deep semantic role labeling for tweets using 5W1H： Who， What， When， Where， Why and How［J］. Computación y Sistemas， 2019， 23（3）： 751-763.
16	姜天笑.浅谈科技查新工作中的5W1H分析法［J］.情报探索，2011（5）：96-97.
17	CHAKMA K， DAS A. A 5W1H based annotation scheme for semantic role labeling of English tweets ［J］. Computación y Sistemas， 2018， 22（3）： 747-755.
18	DAS A， BANDYAOPADHYAY S， GAMBÄCK B. The 5W structure for sentiment summarization-visualization-tracking［C］// Proceedings of the 2012 International Conference on Computational Linguistics and Intelligent Text Processing， LNCS 7181. Berlin： Springer， 2012： 540-555.
19	PARTON K， McKEOWN K R， COYNE R， et al . Who， what， when， where， why？ comparing multiple approaches to the cross-lingual 5W task［C］// Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Stroudsburg： ACL， 2009： 423-431.
20	VASWANI A， SHAZEER NRMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 6000-6010.
21	DEVLIN J， CHANG M W， LEE K， et al. BERT： pre-training of deep bidirectional transformers for language understanding［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsburg： ACL， 2019： 4171-4186.
22	LIU P， YUAN W， FU J， et al. Pre-train， prompt， and predict： a systematic survey of prompting methods in natural language processing［J］. ACM Computing Surveys， 2023， 55（9）： No.195.
23	RAE J W， BORGEAUD S， CAI T， et al. Scaling language models： methods， analysis & insights from training Gopher［EB/OL］. ［2024-02-21］. .
24	ZELIKMAN E， WU Y， MU J， et al. STaR： self-taught reasoner bootstrapping reasoning with reasoning［C］// Proceedings of the 36th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2022： 15476-15488.
25	PRESS O， ZHANG M， MIN S， et al. Measuring and narrowing the compositionality gap in language models［C］// Findings of the Association for Computational Linguistics： EMNLP 2023. Stroudsburg： ACL， 2023： 5687-5711.
26	ZHANG Z， ZHANG A， LI M， et al. Automatic chain of thought prompting in large language models［EB/OL］. ［2024-01-06］. .
27	AGRAWAL S， ZHOU C， LEWIS M， et al. In-context examples selection for machine translation［C］// Proceedings of the Findings of Association for Computational Linguistics： ACL 2023. Stroudsburg： ACL， 2023： 8857-8873.
28	YE J， WU Z， FENG J， et al. Compositional exemplars for in-context learning［C］// Proceedings of the 40th International Conference on Machine Learning. New York： JMLR.org， 2023： 39818-39833.
29	REIMERS N， GUREVYCH I. Sentence-BERT： sentence embeddings using Siamese BERT-networks［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg： ACL， 2019： 3982-3992.
30	ZHOU D， SCHÄRLI N， HOU L， et al. Least-to-most prompting enables complex reasoning in large language models［EB/OL］. ［2024-03-28］. .
31	BAI J， BAI S， CHU Y， et al. QWEN technical report［EB/OL］. ［2024-02-28］. .

数据集	平均问题长度	平均答案长度	平均推理步骤数
MSQ	18.11	2.80	2.65
Hotpot	15.83	2.46	2.68
2Wiki	11.98	2.41	2.47

数据集	平均问题长度	平均答案长度	平均推理步骤数
MSQ	18.11	2.80	2.65
Hotpot	15.83	2.46	2.68
2Wiki	11.98	2.41	2.47

方法	准确率/%
方法	GSM8K	AQUA	SVAMP		均值
Zero-Shot	70.00	40.62	56.66	55.76
Zero-Shot-CoT	76.87	47.65	81.66	68.73
Manual-CoT	78.75	54.68	70.00	67.81
WH-CoT	79.38	59.37	77.50	72.08

方法	准确率/%
方法	GSM8K	AQUA	SVAMP		均值
Zero-Shot	70.00	40.62	56.66	55.76
Zero-Shot-CoT	76.87	47.65	81.66	68.73
Manual-CoT	78.75	54.68	70.00	67.81
WH-CoT	79.38	59.37	77.50	72.08

方法	MSQ		Hotpot		2Wiki		总性能提升比
方法	EM/%	F1/%	EM/%	F1/%	EM/%	F1/%	EM	F1
Zero-Shot	0.83	1.97	8.33	18.65	11.67	16.05	—	—
Zero-Shot-CoT	2.50	7.15	14.17	21.23	5.83	7.39	1.54↑	1.90↑
Manual-CoT	1.67	5.04	5.00	10.37	22.50	25.51	0.79↑	1.15↑
WH-CoT	3.00	6.62	8.33	18.10	8.33	12.59	1.90↑	1.71↑