大语言模型的技术应用前景与风险挑战

doi:10.11772/j.issn.1001-9081.2023060885

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (6): 1655-1662.DOI: 10.11772/j.issn.1001-9081.2023060885

所属专题： CCF第38届中国计算机应用大会 (CCF NCCA 2023)

• CCF第38届中国计算机应用大会 (CCF NCCA 2023) • 下一篇

大语言模型的技术应用前景与风险挑战

徐月梅¹(), 胡玲¹, 赵佳艺¹, 杜宛泽², 王文清²

^1.北京外国语大学信息科学技术学院，北京 100089
^2.北京大学软件与微电子学院，北京 102600

收稿日期:2023-07-06 修回日期:2023-08-09 接受日期:2023-08-15 发布日期:2023-09-14 出版日期:2024-06-10
通讯作者: 徐月梅
作者简介:胡玲（2000—），女，江西南昌人，硕士研究生，主要研究方向：自然语言处理
赵佳艺（2002—），女，陕西渭南人，主要研究方向：自然语言处理
杜宛泽（2000—），女，北京人，硕士研究生，主要研究方向：自然语言处理
王文清（2000—），女，河南商丘人，硕士研究生，主要研究方向：自然语言处理。
基金资助:
中央高校基本科研业务费专项(2022JJ006)

Technology application prospects and risk challenges of large language models

Yuemei XU¹(), Ling HU¹, Jiayi ZHAO¹, Wanze DU², Wenqing WANG²

^1.School of Information Science and Technology，Beijing Foreign Studies University，Beijing 100089，China
^2.School of Software and Microelectronics，Peking University，Beijing 102600，China

Received:2023-07-06 Revised:2023-08-09 Accepted:2023-08-15 Online:2023-09-14 Published:2024-06-10
Contact: Yuemei XU
About author:HU Ling， born in 2000， M. S. candidate. Her research interests include natural language processing.
ZHAO Jiayi， born in 2002. Her research interests include natural language processing.
DU Wanze， born in 2000， M. S. candidate. Her research interests include natural language processing.
WANG Wenqing， born in 2000， M. S. candidate. Her research interests include natural language processing.
Supported by:
Fundamental Research Funds for Central Universities(2022JJ006)

摘要/Abstract

摘要：

针对大语言模型（LLM）技术的快速发展，剖析它的技术应用前景和风险挑战，对通用人工智能（AGI）的发展和治理有重要参考价值。首先，以Multi-BERT（Multilingual Bidirectional Encoder Representations from Transformers）、GPT（Generative Pre-trained Transformer）和ChatGPT（Chat Generative Pre-Trained Transformer）等语言模型为代表，综述LLM的发展脉络、核心技术和评估体系；其次，分析LLM现存的技术局限和安全风险；最后，提出LLM在技术上改进、政策上跟进的建议。分析指出作为发展阶段的LLM，现有模型存在非真实性及偏见性输出、实时自主学习能力欠缺，算力需求庞大，对数据质量和数量依赖性强，语言风格单一；存在数据隐私、信息安全和伦理等方面的安全风险。未来发展可从技术上继续改进，从“大规模”转向“轻量化”、从“单模态”走向“多模态”、从“通用”迈入“垂类”；从政策上实时跟进，实施有针对性的监管措施，规范应用和发展。

关键词: 大语言模型, 风险挑战, 技术监管, 应用前景, 通用人工智能

Abstract:

In view of the rapid development of Large Language Model （LLM） technology， a comprehensive analysis was conducted on its technical application prospects and risk challenges which has great reference value for the development and governance of Artificial General Intelligence （AGI）. Firstly， with representative language models such as Multi-BERT （Multilingual Bidirectional Encoder Representations from Transformer）， GPT （Generative Pre-trained Transformer） and ChatGPT （Chat Generative Pre-trained Transformer） as examples， the development process， key technologies and evaluation systems of LLM were reviewed. Then， a detailed analysis of LLM on technical limitations and security risks was conducted. Finally， suggestions were put forward for technical improvement and policy follow-up of the LLM. The analysis indicates that at a developing status， the current LLMs still produce non-truthful and biased output， lack real-time autonomous learning ability， require huge computing power， highly rely on data quality and quantity， and tend towards monotonous language style. They have security risks related to data privacy， information security， ethics， and other aspects. Their future developments can continue to improve technically， from “large-scale” to “lightweight”， from “single-modal” to “multi-modal”， from “general-purpose” to “vertical”； for real-time follow-up in policy， their applications and developments should be regulated by targeted regulatory measures.

Key words: Large Language Model (LLM), risk challenge, technology supervision, application prospect, Artificial General Intelligence (AGI)

中图分类号:

TP399

徐月梅, 胡玲, 赵佳艺, 杜宛泽, 王文清. 大语言模型的技术应用前景与风险挑战[J]. 计算机应用, 2024, 44(6): 1655-1662.

Yuemei XU, Ling HU, Jiayi ZHAO, Wanze DU, Wenqing WANG. Technology application prospects and risk challenges of large language models[J]. Journal of Computer Applications, 2024, 44(6): 1655-1662.

图/表 4

参考文献 57

1	WEI J， TAY Y， BOMMASANI R， et al. Emergent abilities of large language models ［EB/OL］. ［2023-03-10］. .
2	GOERTZEL B. Artificial general intelligence： concept， state of the art， and future prospects ［J］. Journal of Artificial General Intelligence， 2014， 5（1）： 1-46.
3	OpenAI. ChatGPT plugins ［EB/OL］. ［2023-05-05］. .
4	VAN DIS E A M， BOLLEN J， ZUIDEMA W， et al. ChatGPT： five priorities for research ［J］. Nature， 2023， 614（7947）： 224-226.
5	MIKOLOV T， CHEN K， CORRADO G， et al. Efficient estimation of word representations in vector space ［EB/OL］. ［2023-02-23］ .
6	PENNINGTON J， SOCHER R， MANNING C. GloVe： global vectors for word representation［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2014： 1532-1543.
7	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need ［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017：6000-6010.
8	DEVLIN J， CHANG M-W， LEE K， et al. BERT： pre-training of deep bidirectional transformers for language understanding ［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies （Volume 1： Long and Short Papers）. Stroudsburg： ACL， 2019： 4171-4186.
9	RADFORD A， NARASIMHAN K， SALIMANS T， et al. Improving language understanding by generative pre-training ［EB/OL］. ［2023-05-30］. .
10	RAFFEL C， SHAZEER N， ROBERTS A， et al. Exploring the limits of transfer learning with a unified text-to-text transformer ［J］. The Journal of Machine Learning Research， 2020， 21（1）：5485-5551.
11	YANG Z， DAI Z， YANG Y， et al. XLNet： generalized autoregressive pretraining for language understanding ［C］// Proceedings of the 33rd Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2019： 5753-5763.
12	LIU Y， OTT M， GOYAL N， et al. RoBERTa： a robustly optimized BERT pretraining approach ［EB/OL］. ［2023-02-23］. .
13	LAN Z， CHEN M， GOODMAN S， et al. ALBERT： A lite BERT for selfsupervised learning of language representations ［EB/OL］. ［2023-05-30］. .
14	CLARK K， M-T LUONG， LE Q V， et al. ELECTRA： pre-training text encoders as discriminators rather than generators ［EB/OL］. ［2023-05-30］. .
15	RADFORD A， WU J， CHILD R， et al. Language models are unsupervised multitask learners ［EB/OL］. ［2023-05-30］. .
16	BROWN T， MANN B， RYDER N， et al. Language models are few-shot learners ［C］// Proceedings of the 34th Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2020： 1877-1901.
17	OUYANG L， WU J， JIANG X， et al. Training language models to follow instructions with human feedback ［EB/OL］. ［2023-02-23］. .
18	CHEN M， TWOREK J， JUN H， et al. Evaluating large language models trained on code ［EB/OL］. ［2023-02-23］. .
19	OpenAI. GPT-4 technical report ［EB/OL］. ［2023-06-07］. .
20	THOPPILAN R， DE FREITAS D， HALL J， et al. LaMDA： language models for dialog applications ［EB/OL］. ［2023-06-07］. .
21	CHOWDHERY A， NARANG S， DEVLIN J， et al. PaLM： scaling language modeling with pathways ［EB/OL］. ［2023-06-07］. .
22	ANIL R， DAI A M， FIRAT O， et al. PaLM 2 technical report ［EB/OL］. ［2023-06-07］. .
23	TOUVRON H， LAVRIL T， IZACARD G， et al. LLaMA： open and efficient foundation language models ［EB/OL］. ［2023-06-07］. .
24	The Vicuna Team. Vicuna： an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality ［EB/OL］. ［2023-06-07］. .
25	SMITH S， PATWARY M， NORICK B， et al. Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B， a large-scale generative language model ［EB/OL］. ［2023-07-05］. .
26	ZENG W， REN X， SU T， et al. PanGu‑α： large-scale autoregressive pretrained Chinese language models with auto-parallel computation ［EB/OL］. ［2023-02-23］. .
27	REN X， ZHOU P， MENG X， et al. PanGu‑Σ： towards trillion parameter language model with sparse heterogeneous computing ［EB/OL］. ［2023-06-07］. .
28	DU Z， QIAN Y， LIU X， et al. GLM： general language model pretraining with autoregressive blank infilling ［EB/OL］. ［2023-07-05］. .
29	ZENG A， LIU X， DU Z， et al. GLM-130B： an open bilingual pre-trained model ［EB/OL］. ［2023-07-05］. .
30	XIONG H， WANG S， ZHU Y， et al. DoctorGLM： fine-tuning your Chinese doctor is not a Herculean task ［EB/OL］. ［2023-07-05］. .
31	STIENNON N， OUYANG L， WU J， et al. Learning to summarize with human feedback ［C］// Proceedings of the 34th Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2020： 3008-3021.
32	WU Z， HU Y， SHI W， et al. Fine-grained human feedback gives better rewards for language model training ［EB/OL］. ［2023-06-15］. .
33	DONG H， XIONG W， GOYAL D， et al. RAFT： reward ranked fine tuning for generative foundation model alignment ［EB/OL］. ［2023-06-14］. .
34	YUAN Z， YUAN H， TAN C， et al. RRHF： rank responses to align language models with human feedback without tears ［EB/OL］. ［2023-06-14］. .
35	RAFAILOV R， SHARMA A， MITCHELL E， et al. Direct preference optimization： your language model is secretly a reward model ［EB/OL］. ［2023-06-14］. .
36	HENDRYCKS D， BURNS C， BASART S， et al. Measuring massive multitask language understanding ［C/OL］// Proceedings of the 9th International Conference on Learning Representations. 2021 ［2023-05-30］. .
37	WANG A， PRUKSACHATKUN Y， NANGIA N， et al. SuperGLUE： a stickier benchmark for general-purpose language understanding systems ［C］// Proceedings of the 33rd Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2019： 3261-3275.
38	SRIVASTAVA A， RASTOGI A， RAO A， et al. Beyond the imitation game： quantifying and extrapolating the capabilities of language models ［EB/OL］. ［2023-02-25］. .
39	ZHONG W， CUI R， GUO Y， et al. AGIEval： a human-centric benchmark for evaluating foundation models ［EB/OL］. ［2023-06-27］. .
40	ZENG H. Measuring massive multitask Chinese understanding ［EB/OL］. ［2023-06-27］. .
41	HUANG Y， BAI Y， ZHU Z， et al. C-EVAL： a multi-level multi-discipline Chinese evaluation suite for foundation models ［EB/OL］. ［2023-06-27］. .
42	FU C， CHEN P， SHEN Y， et al. MME： a comprehensive evaluation benchmark for multimodal large language models ［EB/OL］. ［2023-06-27］. .
43	XU P， SHAO W， ZHANG K， et al. LVLM-eHub： a comprehensive evaluation benchmark for large vision-language models ［EB/OL］. ［2023-06-27］. .
44	LIANG P， BOMMASANI R， LEE T， et al. Holistic evaluation of language models ［EB/OL］. ［2023-06-08］. .
45	CHIA Y K， HONG P， BING L， et al. INSTRUCTEVAL： towards holistic evaluation of instruction-tuned large language models ［EB/OL］. ［2023-06-27］. .
46	LIU Y， ITER D， XU Y， et al. G-Eval： NLG evaluation using GPT-4 with better human alignment ［EB/OL］. （2023-05-23）［2023-06-27］. .
47	LIU C， JIN R， REN Y， et al. M3 KE： a massive multi-level multi-subject knowledge evaluation benchmark for chinese large language models ［EB/OL］. ［2023-06-27］. .
48	DAVID ROZADO. The political orientation of the ChatGPT AI system 2022 ［EB/OL］. ［2023-03-09］. .
49	WEI J， WANG X， SCHUURMANS D， et al. Chain of thought prompting elicits reasoning in large language models ［C/OL］//Proceedings of the 36th Conference on Neural Information Processing Systems. 2022［2023-05-30］. .
50	KAPLAN J， McCANDLISH S， HENIGHAN T， et al. Scaling laws for neural language models ［EB/OL］. ［2023-02-23］. .
51	TAO C， HOU L， ZHANG W， et al. Compression of generative pre-trained language models via quantization ［C］// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg： ACL， 2022： 4821-4836.
52	HE Y， ZHANG X， SUN J. Channel pruning for accelerating very deep neural networks［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 1398-1406.
53	WEN W， WU C， WANG Y， et al. Learning structured sparsity in deep neural networks ［C］// Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2016： 2082-2090.
54	HUANG S， DONG L， WANG W， et al. Language is not all you need： aligning perception with language models ［EB/OL］. ［2023-06-21］. .
55	SOLAIMAN I， BRUNDAGE M， CLARK J， et al. Release strategies and the social impacts of language models ［EB/OL］. ［2023-02-23］. .
56	曹建峰.迈向可信AI： ChatGPT类生成式人工智能的治理挑战及应对［J］. 上海政法学院学报（法治论丛）， 2023， 38（4）： 28-42.
	CAO J F. Towards trustworthy AI： governance challenges and responses for generative AI like ChatGPT ［J］. Journal of Shanghai University of Political Science and Law （The Rule of Law Forum）， 2023， 38（4）：28-42.
57	支振锋.生成式人工智能大模型的信息内容治理［J］.政法论坛，2023，41（4）：34-48.
	ZHI Z F. Information content governance of large model of generative artificial intelligence ［J］. Tribune of Political Science and Law， 2023， 41（4）：34-48.

模型	发布年份	最大参数规模	训练数据	模型架构
GPT^［9］	2018	约1.2亿	BooksCorpus	12层Transformer解码器
GPT‑2^［15］	2019	约15亿	WebText（约40 GB文本）	48层Transformer解码器
GPT‑3^［16］	2020	约1 750亿	Commom Crawl、WebText2、Books1、Books2和Wikipedia （共约5 000亿标记（Tokens））	96层Transformer解码器
LaMDA^［20］	2022	约1 370亿	公开对话和网络文本，人工标注数据（7 680亿标记）	64层Transformer解码器
InstructGPT^［17］	2022	约1 750亿	数万提示文本及生成结果标注	GPT-3+RLHF算法
PaLM^［21］	2022	约5 400亿	Webpages、books、Wikipedia、News、Github和social media conversations （7 800亿标记）	118层Transformer解码器
GLM‑130B^［29］	2022	约1 300亿	超过4 000亿标记，大致英文中文各2 000亿	Transformer解码器
ChatGPT^［3］	2022	约1 750亿	额外标注数据（具体信息未知）	GPT-3+RLHF算法
LLaMA^［23］	2023	约650亿	CommonCrawl、C4、Github、Wikipedia、books、ArXiv和StackExchange （1.4万亿标记）	80层Transformer解码器
GPT‑4^［19］	2023	—	文本数据、图像数据	Transformer解码器
ChatGLM‑6B^［29］	2023	约62亿	约1万亿标记，大致英文中文各5 000亿	Transformer解码器
PanGu‑Σ^［27］	2023	约1.085万亿	WuDaoCorpora2.0、Pile dataset、Python和Java code （4个主领域超3 000亿标记）	40层Transformer解码器
Vicuna^［24］	2023	约130亿	在LLaMa-13B的基础上使用监督数据微调（7万个用户共享的ChatGPT对话）	40层Transformer解码器
PaLM 2^［22］	2023	约3 400亿	Web documents、books、code、mathematics和conversational data （更高比例非英语数据）（3.6万亿标记）	Transformer解码器

模型	发布年份	最大参数规模	训练数据	模型架构
GPT^［9］	2018	约1.2亿	BooksCorpus	12层Transformer解码器
GPT‑2^［15］	2019	约15亿	WebText（约40 GB文本）	48层Transformer解码器
GPT‑3^［16］	2020	约1 750亿	Commom Crawl、WebText2、Books1、Books2和Wikipedia （共约5 000亿标记（Tokens））	96层Transformer解码器
LaMDA^［20］	2022	约1 370亿	公开对话和网络文本，人工标注数据（7 680亿标记）	64层Transformer解码器
InstructGPT^［17］	2022	约1 750亿	数万提示文本及生成结果标注	GPT-3+RLHF算法
PaLM^［21］	2022	约5 400亿	Webpages、books、Wikipedia、News、Github和social media conversations （7 800亿标记）	118层Transformer解码器
GLM‑130B^［29］	2022	约1 300亿	超过4 000亿标记，大致英文中文各2 000亿	Transformer解码器
ChatGPT^［3］	2022	约1 750亿	额外标注数据（具体信息未知）	GPT-3+RLHF算法
LLaMA^［23］	2023	约650亿	CommonCrawl、C4、Github、Wikipedia、books、ArXiv和StackExchange （1.4万亿标记）	80层Transformer解码器
GPT‑4^［19］	2023	—	文本数据、图像数据	Transformer解码器
ChatGLM‑6B^［29］	2023	约62亿	约1万亿标记，大致英文中文各5 000亿	Transformer解码器
PanGu‑Σ^［27］	2023	约1.085万亿	WuDaoCorpora2.0、Pile dataset、Python和Java code （4个主领域超3 000亿标记）	40层Transformer解码器
Vicuna^［24］	2023	约130亿	在LLaMa-13B的基础上使用监督数据微调（7万个用户共享的ChatGPT对话）	40层Transformer解码器
PaLM 2^［22］	2023	约3 400亿	Web documents、books、code、mathematics和conversational data （更高比例非英语数据）（3.6万亿标记）	Transformer解码器

名称	评估框架	链接
MMLU^［36］	包含57项任务，主要衡量文本模型的多任务准确性	github.com/hendrycks/test
SuperGLUE^［37］	包含8个语言理解任务，最后模型得分为8个任务的得分加权和	https：//super.gluebenchmark.com/
BIG-Bench^［38］	包含204项多样化任务，4个单指标，不同任务对应不同指标	github.com/google/BIG-bench
AGIEval^［39］	评估基础模型在与人类认知和问题解决相关任务的一般能力，包含20个面向普通人考生的录取和资格考试，如SAT、LSAT	github.com/microsoft/AGIEval
MMCU（中文）^［40］	在医学、法律、心理学和教育领域评估大型中文语言模型的多任务准确性	github.com/Felixgithub2017/MMCU
C-EVAL（中文）^［41］	包括13 948个多项选择题，跨越52个不同的学科和4个难度级别	github.com/SJTU-LIT/ceval
MME^［42］	包含14个子任务，评估多模态LLM的感知和认知能力	github.com/bradyfu/awesome- multimodal-large-language-models
LVLM-eHub^［43］	通过47个数据集和1个竞技场在线平台从6个类别的多模态能力方面广泛评估了8个大型视觉语言模型（Large Vision-Language Model， LVLM）	github.com/opengvlab/ multi-modality-arena
HELM^［44］	采用多指标的方法，对16个核心场景采用7个指标评估，在42场景上评估 30个主流LLM	github.com/stanford-crfm/helm
INSTRUCTEVAL^［45］	基于问题解决、写作能力和与人类价值观的一致性，全面评估指令调优（Instruction-tuned）的LLM	github.com/declare-lab/instruct-eval
G-Eval^［46］	使用LLM的打分作为指标评估自然语言生成任务输出的质量	github.com/nlpyang/geval
M3KE（中文）^［47］	包括从71个任务中收集的20 477个问题，涵盖中国教育体系的各个主要层面	github.com/tjunlp-lab/m3ke

名称	评估框架	链接
MMLU^［36］	包含57项任务，主要衡量文本模型的多任务准确性	github.com/hendrycks/test
SuperGLUE^［37］	包含8个语言理解任务，最后模型得分为8个任务的得分加权和	https：//super.gluebenchmark.com/
BIG-Bench^［38］	包含204项多样化任务，4个单指标，不同任务对应不同指标	github.com/google/BIG-bench
AGIEval^［39］	评估基础模型在与人类认知和问题解决相关任务的一般能力，包含20个面向普通人考生的录取和资格考试，如SAT、LSAT	github.com/microsoft/AGIEval
MMCU（中文）^［40］	在医学、法律、心理学和教育领域评估大型中文语言模型的多任务准确性	github.com/Felixgithub2017/MMCU
C-EVAL（中文）^［41］	包括13 948个多项选择题，跨越52个不同的学科和4个难度级别	github.com/SJTU-LIT/ceval
MME^［42］	包含14个子任务，评估多模态LLM的感知和认知能力	github.com/bradyfu/awesome- multimodal-large-language-models
LVLM-eHub^［43］	通过47个数据集和1个竞技场在线平台从6个类别的多模态能力方面广泛评估了8个大型视觉语言模型（Large Vision-Language Model， LVLM）	github.com/opengvlab/ multi-modality-arena
HELM^［44］	采用多指标的方法，对16个核心场景采用7个指标评估，在42场景上评估 30个主流LLM	github.com/stanford-crfm/helm
INSTRUCTEVAL^［45］	基于问题解决、写作能力和与人类价值观的一致性，全面评估指令调优（Instruction-tuned）的LLM	github.com/declare-lab/instruct-eval
G-Eval^［46］	使用LLM的打分作为指标评估自然语言生成任务输出的质量	github.com/nlpyang/geval
M3KE（中文）^［47］	包括从71个任务中收集的20 477个问题，涵盖中国教育体系的各个主要层面	github.com/tjunlp-lab/m3ke

大语言模型的技术应用前景与风险挑战

Technology application prospects and risk challenges of large language models

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 4

参考文献 57

相关文章 1

编辑推荐

Metrics