《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (6): 1655-1662.DOI: 10.11772/j.issn.1001-9081.2023060885
• CCF第38届中国计算机应用大会 (CCF NCCA 2023) • 下一篇
收稿日期:
2023-07-06
修回日期:
2023-08-09
接受日期:
2023-08-15
发布日期:
2023-09-14
出版日期:
2024-06-10
通讯作者:
徐月梅
作者简介:
胡玲(2000—),女,江西南昌人,硕士研究生,主要研究方向:自然语言处理基金资助:
Yuemei XU1(), Ling HU1, Jiayi ZHAO1, Wanze DU2, Wenqing WANG2
Received:
2023-07-06
Revised:
2023-08-09
Accepted:
2023-08-15
Online:
2023-09-14
Published:
2024-06-10
Contact:
Yuemei XU
About author:
HU Ling, born in 2000, M. S. candidate. Her research interests include natural language processing.Supported by:
摘要:
针对大语言模型(LLM)技术的快速发展,剖析它的技术应用前景和风险挑战,对通用人工智能(AGI)的发展和治理有重要参考价值。首先,以Multi-BERT(Multilingual Bidirectional Encoder Representations from Transformers)、GPT(Generative Pre-trained Transformer)和ChatGPT(Chat Generative Pre-Trained Transformer)等语言模型为代表,综述LLM的发展脉络、核心技术和评估体系;其次,分析LLM现存的技术局限和安全风险;最后,提出LLM在技术上改进、政策上跟进的建议。分析指出作为发展阶段的LLM,现有模型存在非真实性及偏见性输出、实时自主学习能力欠缺,算力需求庞大,对数据质量和数量依赖性强,语言风格单一;存在数据隐私、信息安全和伦理等方面的安全风险。未来发展可从技术上继续改进,从“大规模”转向“轻量化”、从“单模态”走向“多模态”、从“通用”迈入“垂类”;从政策上实时跟进,实施有针对性的监管措施,规范应用和发展。
中图分类号:
徐月梅, 胡玲, 赵佳艺, 杜宛泽, 王文清. 大语言模型的技术应用前景与风险挑战[J]. 计算机应用, 2024, 44(6): 1655-1662.
Yuemei XU, Ling HU, Jiayi ZHAO, Wanze DU, Wenqing WANG. Technology application prospects and risk challenges of large language models[J]. Journal of Computer Applications, 2024, 44(6): 1655-1662.
模型 | 发布 年份 | 最大 参数规模 | 训练数据 | 模型架构 |
---|---|---|---|---|
GPT[ | 2018 | 约1.2亿 | BooksCorpus | 12层Transformer解码器 |
GPT‑2[ | 2019 | 约15亿 | WebText(约40 GB文本) | 48层Transformer解码器 |
GPT‑3[ | 2020 | 约1 750亿 | Commom Crawl、WebText2、Books1、Books2和Wikipedia (共约5 000亿标记(Tokens)) | 96层Transformer解码器 |
LaMDA[ | 2022 | 约1 370亿 | 公开对话和网络文本,人工标注数据(7 680亿标记) | 64层Transformer解码器 |
InstructGPT[ | 2022 | 约1 750亿 | 数万提示文本及生成结果标注 | GPT-3+RLHF算法 |
PaLM[ | 2022 | 约5 400亿 | Webpages、books、Wikipedia、News、Github和social media conversations (7 800亿标记) | 118层Transformer解码器 |
GLM‑130B[ | 2022 | 约1 300亿 | 超过4 000亿标记,大致英文中文各2 000亿 | Transformer解码器 |
ChatGPT[ | 2022 | 约1 750亿 | 额外标注数据(具体信息未知) | GPT-3+RLHF算法 |
LLaMA[ | 2023 | 约650亿 | CommonCrawl、C4、Github、Wikipedia、books、ArXiv和StackExchange (1.4万亿标记) | 80层Transformer解码器 |
GPT‑4[ | 2023 | — | 文本数据、图像数据 | Transformer解码器 |
ChatGLM‑6B[ | 2023 | 约62亿 | 约1万亿标记,大致英文中文各5 000亿 | Transformer解码器 |
PanGu‑Σ[ | 2023 | 约1.085万亿 | WuDaoCorpora2.0、Pile dataset、Python和Java code (4个主领域超3 000亿标记) | 40层Transformer解码器 |
Vicuna[ | 2023 | 约130亿 | 在LLaMa-13B的基础上使用监督数据微调(7万个用户共享的ChatGPT对话) | 40层Transformer解码器 |
PaLM 2[ | 2023 | 约3 400亿 | Web documents、books、code、mathematics和conversational data (更高比例非英语数据)(3.6万亿标记) | Transformer解码器 |
表1 部分代表性LLM的模型参数
Tab.1 Model parameters of some representative LLMs
模型 | 发布 年份 | 最大 参数规模 | 训练数据 | 模型架构 |
---|---|---|---|---|
GPT[ | 2018 | 约1.2亿 | BooksCorpus | 12层Transformer解码器 |
GPT‑2[ | 2019 | 约15亿 | WebText(约40 GB文本) | 48层Transformer解码器 |
GPT‑3[ | 2020 | 约1 750亿 | Commom Crawl、WebText2、Books1、Books2和Wikipedia (共约5 000亿标记(Tokens)) | 96层Transformer解码器 |
LaMDA[ | 2022 | 约1 370亿 | 公开对话和网络文本,人工标注数据(7 680亿标记) | 64层Transformer解码器 |
InstructGPT[ | 2022 | 约1 750亿 | 数万提示文本及生成结果标注 | GPT-3+RLHF算法 |
PaLM[ | 2022 | 约5 400亿 | Webpages、books、Wikipedia、News、Github和social media conversations (7 800亿标记) | 118层Transformer解码器 |
GLM‑130B[ | 2022 | 约1 300亿 | 超过4 000亿标记,大致英文中文各2 000亿 | Transformer解码器 |
ChatGPT[ | 2022 | 约1 750亿 | 额外标注数据(具体信息未知) | GPT-3+RLHF算法 |
LLaMA[ | 2023 | 约650亿 | CommonCrawl、C4、Github、Wikipedia、books、ArXiv和StackExchange (1.4万亿标记) | 80层Transformer解码器 |
GPT‑4[ | 2023 | — | 文本数据、图像数据 | Transformer解码器 |
ChatGLM‑6B[ | 2023 | 约62亿 | 约1万亿标记,大致英文中文各5 000亿 | Transformer解码器 |
PanGu‑Σ[ | 2023 | 约1.085万亿 | WuDaoCorpora2.0、Pile dataset、Python和Java code (4个主领域超3 000亿标记) | 40层Transformer解码器 |
Vicuna[ | 2023 | 约130亿 | 在LLaMa-13B的基础上使用监督数据微调(7万个用户共享的ChatGPT对话) | 40层Transformer解码器 |
PaLM 2[ | 2023 | 约3 400亿 | Web documents、books、code、mathematics和conversational data (更高比例非英语数据)(3.6万亿标记) | Transformer解码器 |
名称 | 评估框架 | 链接 |
---|---|---|
MMLU[ | 包含57项任务,主要衡量文本模型的多任务准确性 | github.com/hendrycks/test |
SuperGLUE[ | 包含8个语言理解任务,最后模型得分为8个任务的得分加权和 | https://super.gluebenchmark.com/ |
BIG-Bench[ | 包含204项多样化任务,4个单指标,不同任务对应不同指标 | github.com/google/BIG-bench |
AGIEval[ | 评估基础模型在与人类认知和问题解决相关任务的一般能力, 包含20个面向普通人考生的录取和资格考试,如SAT、LSAT | github.com/microsoft/AGIEval |
MMCU(中文)[ | 在医学、法律、心理学和教育领域评估大型中文语言模型的多任务准确性 | github.com/Felixgithub2017/MMCU |
C-EVAL(中文)[ | 包括13 948个多项选择题,跨越52个不同的学科和4个难度级别 | github.com/SJTU-LIT/ceval |
MME[ | 包含14个子任务,评估多模态LLM的感知和认知能力 | github.com/bradyfu/awesome- multimodal-large-language-models |
LVLM-eHub[ | 通过47个数据集和1个竞技场在线平台从6个类别的多模态能力方面 广泛评估了8个大型视觉语言模型(Large Vision-Language Model, LVLM) | github.com/opengvlab/ multi-modality-arena |
HELM[ | 采用多指标的方法,对16个核心场景采用7个指标评估,在42场景上评估 30个主流LLM | github.com/stanford-crfm/helm |
INSTRUCTEVAL[ | 基于问题解决、写作能力和与人类价值观的一致性,全面评估 指令调优(Instruction-tuned)的LLM | github.com/declare-lab/instruct-eval |
G-Eval[ | 使用LLM的打分作为指标评估自然语言生成任务输出的质量 | github.com/nlpyang/geval |
M3KE(中文)[ | 包括从71个任务中收集的20 477个问题,涵盖中国教育体系的各个主要层面 | github.com/tjunlp-lab/m3ke |
表2 部分代表性评估基准
Tab.2 Some representative evaluation benchmarks
名称 | 评估框架 | 链接 |
---|---|---|
MMLU[ | 包含57项任务,主要衡量文本模型的多任务准确性 | github.com/hendrycks/test |
SuperGLUE[ | 包含8个语言理解任务,最后模型得分为8个任务的得分加权和 | https://super.gluebenchmark.com/ |
BIG-Bench[ | 包含204项多样化任务,4个单指标,不同任务对应不同指标 | github.com/google/BIG-bench |
AGIEval[ | 评估基础模型在与人类认知和问题解决相关任务的一般能力, 包含20个面向普通人考生的录取和资格考试,如SAT、LSAT | github.com/microsoft/AGIEval |
MMCU(中文)[ | 在医学、法律、心理学和教育领域评估大型中文语言模型的多任务准确性 | github.com/Felixgithub2017/MMCU |
C-EVAL(中文)[ | 包括13 948个多项选择题,跨越52个不同的学科和4个难度级别 | github.com/SJTU-LIT/ceval |
MME[ | 包含14个子任务,评估多模态LLM的感知和认知能力 | github.com/bradyfu/awesome- multimodal-large-language-models |
LVLM-eHub[ | 通过47个数据集和1个竞技场在线平台从6个类别的多模态能力方面 广泛评估了8个大型视觉语言模型(Large Vision-Language Model, LVLM) | github.com/opengvlab/ multi-modality-arena |
HELM[ | 采用多指标的方法,对16个核心场景采用7个指标评估,在42场景上评估 30个主流LLM | github.com/stanford-crfm/helm |
INSTRUCTEVAL[ | 基于问题解决、写作能力和与人类价值观的一致性,全面评估 指令调优(Instruction-tuned)的LLM | github.com/declare-lab/instruct-eval |
G-Eval[ | 使用LLM的打分作为指标评估自然语言生成任务输出的质量 | github.com/nlpyang/geval |
M3KE(中文)[ | 包括从71个任务中收集的20 477个问题,涵盖中国教育体系的各个主要层面 | github.com/tjunlp-lab/m3ke |
1 | WEI J, TAY Y, BOMMASANI R, et al. Emergent abilities of large language models [EB/OL]. [2023-03-10]. . |
2 | GOERTZEL B. Artificial general intelligence: concept, state of the art, and future prospects [J]. Journal of Artificial General Intelligence, 2014, 5(1): 1-46. |
3 | OpenAI. ChatGPT plugins [EB/OL]. [2023-05-05]. . |
4 | VAN DIS E A M, BOLLEN J, ZUIDEMA W, et al. ChatGPT: five priorities for research [J]. Nature, 2023, 614(7947): 224-226. |
5 | MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space [EB/OL]. [2023-02-23] . |
6 | PENNINGTON J, SOCHER R, MANNING C. GloVe: global vectors for word representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2014: 1532-1543. |
7 | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc., 2017:6000-6010. |
8 | DEVLIN J, CHANG M-W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding [C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long and Short Papers). Stroudsburg: ACL, 2019: 4171-4186. |
9 | RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training [EB/OL]. [2023-05-30]. . |
10 | RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer [J]. The Journal of Machine Learning Research, 2020, 21(1):5485-5551. |
11 | YANG Z, DAI Z, YANG Y, et al. XLNet: generalized autoregressive pretraining for language understanding [C]// Proceedings of the 33rd Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2019: 5753-5763. |
12 | LIU Y, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach [EB/OL]. [2023-02-23]. . |
13 | LAN Z, CHEN M, GOODMAN S, et al. ALBERT: A lite BERT for selfsupervised learning of language representations [EB/OL]. [2023-05-30]. . |
14 | CLARK K, M-T LUONG, LE Q V, et al. ELECTRA: pre-training text encoders as discriminators rather than generators [EB/OL]. [2023-05-30]. . |
15 | RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners [EB/OL]. [2023-05-30]. . |
16 | BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners [C]// Proceedings of the 34th Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 1877-1901. |
17 | OUYANG L, WU J, JIANG X, et al. Training language models to follow instructions with human feedback [EB/OL]. [2023-02-23]. . |
18 | CHEN M, TWOREK J, JUN H, et al. Evaluating large language models trained on code [EB/OL]. [2023-02-23]. . |
19 | OpenAI. GPT-4 technical report [EB/OL]. [2023-06-07]. . |
20 | THOPPILAN R, DE FREITAS D, HALL J, et al. LaMDA: language models for dialog applications [EB/OL]. [2023-06-07]. . |
21 | CHOWDHERY A, NARANG S, DEVLIN J, et al. PaLM: scaling language modeling with pathways [EB/OL]. [2023-06-07]. . |
22 | ANIL R, DAI A M, FIRAT O, et al. PaLM 2 technical report [EB/OL]. [2023-06-07]. . |
23 | TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: open and efficient foundation language models [EB/OL]. [2023-06-07]. . |
24 | The Vicuna Team. Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality [EB/OL]. [2023-06-07]. . |
25 | SMITH S, PATWARY M, NORICK B, et al. Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B, a large-scale generative language model [EB/OL]. [2023-07-05]. . |
26 | ZENG W, REN X, SU T, et al. PanGu‑α: large-scale autoregressive pretrained Chinese language models with auto-parallel computation [EB/OL]. [2023-02-23]. . |
27 | REN X, ZHOU P, MENG X, et al. PanGu‑Σ: towards trillion parameter language model with sparse heterogeneous computing [EB/OL]. [2023-06-07]. . |
28 | DU Z, QIAN Y, LIU X, et al. GLM: general language model pretraining with autoregressive blank infilling [EB/OL]. [2023-07-05]. . |
29 | ZENG A, LIU X, DU Z, et al. GLM-130B: an open bilingual pre-trained model [EB/OL]. [2023-07-05]. . |
30 | XIONG H, WANG S, ZHU Y, et al. DoctorGLM: fine-tuning your Chinese doctor is not a Herculean task [EB/OL]. [2023-07-05]. . |
31 | STIENNON N, OUYANG L, WU J, et al. Learning to summarize with human feedback [C]// Proceedings of the 34th Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 3008-3021. |
32 | WU Z, HU Y, SHI W, et al. Fine-grained human feedback gives better rewards for language model training [EB/OL]. [2023-06-15]. . |
33 | DONG H, XIONG W, GOYAL D, et al. RAFT: reward ranked fine tuning for generative foundation model alignment [EB/OL]. [2023-06-14]. . |
34 | YUAN Z, YUAN H, TAN C, et al. RRHF: rank responses to align language models with human feedback without tears [EB/OL]. [2023-06-14]. . |
35 | RAFAILOV R, SHARMA A, MITCHELL E, et al. Direct preference optimization: your language model is secretly a reward model [EB/OL]. [2023-06-14]. . |
36 | HENDRYCKS D, BURNS C, BASART S, et al. Measuring massive multitask language understanding [C/OL]// Proceedings of the 9th International Conference on Learning Representations. 2021 [2023-05-30]. . |
37 | WANG A, PRUKSACHATKUN Y, NANGIA N, et al. SuperGLUE: a stickier benchmark for general-purpose language understanding systems [C]// Proceedings of the 33rd Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2019: 3261-3275. |
38 | SRIVASTAVA A, RASTOGI A, RAO A, et al. Beyond the imitation game: quantifying and extrapolating the capabilities of language models [EB/OL]. [2023-02-25]. . |
39 | ZHONG W, CUI R, GUO Y, et al. AGIEval: a human-centric benchmark for evaluating foundation models [EB/OL]. [2023-06-27]. . |
40 | ZENG H. Measuring massive multitask Chinese understanding [EB/OL]. [2023-06-27]. . |
41 | HUANG Y, BAI Y, ZHU Z, et al. C-EVAL: a multi-level multi-discipline Chinese evaluation suite for foundation models [EB/OL]. [2023-06-27]. . |
42 | FU C, CHEN P, SHEN Y, et al. MME: a comprehensive evaluation benchmark for multimodal large language models [EB/OL]. [2023-06-27]. . |
43 | XU P, SHAO W, ZHANG K, et al. LVLM-eHub: a comprehensive evaluation benchmark for large vision-language models [EB/OL]. [2023-06-27]. . |
44 | LIANG P, BOMMASANI R, LEE T, et al. Holistic evaluation of language models [EB/OL]. [2023-06-08]. . |
45 | CHIA Y K, HONG P, BING L, et al. INSTRUCTEVAL: towards holistic evaluation of instruction-tuned large language models [EB/OL]. [2023-06-27]. . |
46 | LIU Y, ITER D, XU Y, et al. G-Eval: NLG evaluation using GPT-4 with better human alignment [EB/OL]. (2023-05-23)[2023-06-27]. . |
47 | LIU C, JIN R, REN Y, et al. M3 KE: a massive multi-level multi-subject knowledge evaluation benchmark for chinese large language models [EB/OL]. [2023-06-27]. . |
48 | DAVID ROZADO. The political orientation of the ChatGPT AI system 2022 [EB/OL]. [2023-03-09]. . |
49 | WEI J, WANG X, SCHUURMANS D, et al. Chain of thought prompting elicits reasoning in large language models [C/OL]//Proceedings of the 36th Conference on Neural Information Processing Systems. 2022[2023-05-30]. . |
50 | KAPLAN J, McCANDLISH S, HENIGHAN T, et al. Scaling laws for neural language models [EB/OL]. [2023-02-23]. . |
51 | TAO C, HOU L, ZHANG W, et al. Compression of generative pre-trained language models via quantization [C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2022: 4821-4836. |
52 | HE Y, ZHANG X, SUN J. Channel pruning for accelerating very deep neural networks[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 1398-1406. |
53 | WEN W, WU C, WANG Y, et al. Learning structured sparsity in deep neural networks [C]// Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2016: 2082-2090. |
54 | HUANG S, DONG L, WANG W, et al. Language is not all you need: aligning perception with language models [EB/OL]. [2023-06-21]. . |
55 | SOLAIMAN I, BRUNDAGE M, CLARK J, et al. Release strategies and the social impacts of language models [EB/OL]. [2023-02-23]. . |
56 | 曹建峰.迈向可信AI: ChatGPT类生成式人工智能的治理挑战及应对[J]. 上海政法学院学报(法治论丛), 2023, 38(4): 28-42. |
CAO J F. Towards trustworthy AI: governance challenges and responses for generative AI like ChatGPT [J]. Journal of Shanghai University of Political Science and Law (The Rule of Law Forum), 2023, 38(4):28-42. | |
57 | 支振锋.生成式人工智能大模型的信息内容治理[J].政法论坛,2023,41(4):34-48. |
ZHI Z F. Information content governance of large model of generative artificial intelligence [J]. Tribune of Political Science and Law, 2023, 41(4):34-48. |
[1] | 邓亚平, 李迎江. YOLO算法及其在自动驾驶场景中目标检测综述[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1949-1958. |
[2] | 吕锡婷, 赵敬华, 荣海迎, 赵嘉乐. 基于Transformer和关系图卷积网络的信息传播预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1760-1766. |
[3] | 王美, 苏雪松, 刘佳, 殷若南, 黄珊. 时频域多尺度交叉注意力融合的时间序列分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1842-1847. |
[4] | 郭玲玲, 李志强, 段孟环. 量子近似优化算法在精确覆盖问题中的应用[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 849-854. |
[5] | 曾渝 张洋 曾尚 付茂栗 何启学 曾林隆. 基于多尺度门控膨胀卷积网络的时间序列预测算法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0. |
[6] | 徐月梅, 胡玲, 赵佳艺, 杜宛泽, 王文清. 大语言模型与多语言智能的研究进展与启示[J]. 《计算机应用》唯一官方网站, 2023, 43(S2): 1-8. |
[7] | 王明月, 邹晓红, 陈晶, 许成伟. 基于标签传播与多指标的重叠社区检测算法[J]. 《计算机应用》唯一官方网站, 2023, 43(S2): 105-110. |
[8] | 曾云飞, 葛富东. 融合自动储备池神经网络和长短时记忆网络的森林火灾成灾面积预测方法[J]. 《计算机应用》唯一官方网站, 2023, 43(S2): 60-64. |
[9] | 李烨恒 罗光圣 苏前敏. 基于改进YOLOv5的Logo检测算法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0. |
[10] | 葛丽娜, 徐婧雅, 王哲, 张桂芬, 颜亮, 胡政. 区块链在供应链应用中的研究现状与挑战[J]. 《计算机应用》唯一官方网站, 2023, 43(11): 3315-3326. |
[11] | 田鹏新, 司冠南, 安兆亮, 李建辛, 周风余. 基于数据驱动的云边智能协同综述[J]. 《计算机应用》唯一官方网站, 2023, 43(10): 3162-3169. |
[12] | 左敏, 王虹, 颜文婧, 张青川. 基于BERT和CNN的基因剪接位点识别[J]. 《计算机应用》唯一官方网站, 2023, 43(10): 3309-3314. |
[13] | 王菁怡, 李超, 宋衡, 李迪, 朱俊武. 基于随机游走算法的频谱组合拍卖机制[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2352-2357. |
[14] | 徐则林, 杨敏, 陈勐. 融合空间和文本信息的兴趣点类别表征模型[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2456-2461. |
[15] | 高志辉, 秦琦, 段暕, 沈旭, 计效园, 刘智勇, 廖广兰. 基于实时Web技术的车间监测系统设计与实现[J]. 《计算机应用》唯一官方网站, 2023, 43(S1): 201-206. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||