端到端语音到语音翻译的优化方法综述

doi:10.11772/j.issn.1001-9081.2024050666

《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (5): 1363-1371.DOI: 10.11772/j.issn.1001-9081.2024050666

• 第十届中国数据挖掘会议 • 下一篇

端到端语音到语音翻译的优化方法综述

宗伟¹^,², 赵悦¹^,²(), 李尹¹^,², 徐晓娜¹^,²

^1.民族语言智能分析与安全治理教育部重点实验室（中央民族大学），北京 100081
^2.中央民族大学信息工程学院，北京 100081

收稿日期:2024-05-23 修回日期:2024-06-26 接受日期:2024-06-26 发布日期:2024-07-25 出版日期:2025-05-10
通讯作者: 赵悦
作者简介:宗伟（2002—），男，山东烟台人，硕士研究生，CCF会员，主要研究方向：语音翻译
赵悦（1974—），女，辽宁抚顺人，教授，博士，主要研究方向：概率图模型、机器学习、语音信号处理
李尹（2003—），女，广西南宁人，主要研究方向：语音信号处理
徐晓娜（1979—），女，河南巩义人，讲师，博士，主要研究方向：语音处理、图像处理、机器学习。
基金资助:
国家自然科学基金资助项目(61976236)

Review of optimization methods for end-to-end speech-to-speech translation

Wei ZONG¹^,², Yue ZHAO¹^,²(), Yin LI¹^,², Xiaona XU¹^,²

^1.Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance，Ministry of Education （Minzu University of China），Beijing 100081，China
^2.School of Information Engineering，Minzu University of China，Beijing 100081，China

Received:2024-05-23 Revised:2024-06-26 Accepted:2024-06-26 Online:2024-07-25 Published:2025-05-10
Contact: Yue ZHAO
About author:ZONG Wei， born in 2002， M. S. candidate. His research interests include speech translation.
ZHAO Yue， born in 1974， Ph. D.， professor. Her research interests include probabilistic graphical model， machine learning， speech signal processing.
LI Yin， born in 2003. Her research interests include speech signal processing.
XU Xiaona， born in 1979， Ph. D.， lecturer. Her research interests include speech processing， image processing， machine learning.
Supported by:
National Natural Science Foundation of China(61976236)

摘要/Abstract

摘要：

语音到语音翻译（S2ST）是智能语音领域中新兴的研究方向，旨在将一种语言的语音准确翻译成另一种语言的语音。随着人们对跨语言交流需求的增加，S2ST受到广泛的关注，相关研究也不断涌现。传统的级联模型在S2ST过程中存在诸多问题，如错误传播、推理延迟和无法翻译无文字系统的语言等，因此如何通过端到端模型实现直接S2ST成为当前研究的重点。在全面调查端到端S2ST的基础上，详细分析和归纳了端到端S2ST的各种模型，综述了已有的相关技术，将端到端S2ST面临的挑战总结为建模负担、数据稀缺和现实应用三类问题，并重点探讨了现有工作是如何解决这三类问题的。大语言模型（LLM）强大的理解和生成能力为S2ST提供了新的可能性，同时也带来了更多的挑战。因此，讨论了LLM在S2ST中的应用，并设想了未来可能的发展方向。

关键词: 端到端语音到语音翻译, 建模负担, 数据稀缺, 现实应用, 语音基石模型

Abstract:

Speech-to-Speech Translation （S2ST） is an emerging research direction in intelligent speech field， aiming to seamlessly translate spoken language from one language into another language. With increasing demands for cross-linguistic communication， S2ST has garnered significant attention， driving continuous research. Traditional cascaded models face numerous challenges in S2ST， including error propagation， inference latency， and inability to translate languages without a writing system. To address these issues， achieving direct S2ST using end-to-end models has become a key research focus. Based on a comprehensive survey of end-to-end S2ST models， a detailed analysis and summary of various end-to-end S2ST models was provided， the existing related technologies were reviewed， and the challenges were summarized into three categories： modeling burden， data scarcity， and real-world application， with a focus on how existing work has addressed these three categories. The extensive comprehension and generative capabilities of Large Language Models （LLMs） offer new possibilities for S2ST， while simultaneously presenting additional challenges. Exploring effective applications of LLMs in S2ST was also discussed， and potential future development directions were looked forward.

Key words: end-to-end Speech-to-Speech Translation (S2ST), modeling burden, data scarcity, real-world application, speech foundation model

中图分类号:

TP391

宗伟, 赵悦, 李尹, 徐晓娜. 端到端语音到语音翻译的优化方法综述[J]. 计算机应用, 2025, 45(5): 1363-1371.

Wei ZONG, Yue ZHAO, Yin LI, Xiaona XU. Review of optimization methods for end-to-end speech-to-speech translation[J]. Journal of Computer Applications, 2025, 45(5): 1363-1371.

图/表 12

参考文献 73

1	LAVIE A， WAIBEL A， LEVIN L， et al. JANUS-Ⅲ： speech-to- speech translation in multiple languages［C］// Proceedings of the 1997 IEEE International Conference on Acoustics， Speech， and Signal Processing — Volume 1. Piscataway： IEEE， 1997： 99-102.
2	NAKAMURA S， MARKOV K， NAKAIWA H， et al. The ATR multilingual speech-to-speech translation system［J］. IEEE Transactions on Audio， Speech， and Language Processing， 2006， 14（2）： 365-376.
3	WANG C， WU Y， LIU S， et al. Bridging the gap between pre-training and fine-tuning for end-to-end speech translation［C］// Proceedings of the 34th AAAI Conference on Artificial Intelligence： AAAI-20 Technical Tracks 5/ AAAI Technical Track： Natural Language Processing. Palo Alto： AAAI Press， 2020， 34（5）： 9161-9168.
4	JIA Y， JOHNSON M， MACHEREY W， et al. Leveraging weakly supervised data to improve end-to-end speech-to-text translation［C］// Proceedings of the 2019 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2019： 7180-7184.
5	JIA Y， WEISS R J， BIADSY F， et al. Direct speech-to-speech translation with a sequence to-sequence model［C］// Proceedings of the 20th Annual Conference of the International Speech Communication Association. ［S.l.］： ISCA， 2019： 1123-1127.
6	TJANDRA A， SAKTI S， NAKAMRUA S. Speech-to-speech translation between untranscribed unknown languages［C］// Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop. Piscataway： IEEE， 2019： 593-600.
7	ZHANG C， TAN X， REN Y， et al. UWSpeech： speech to speech translation for unwritten languages［C］// Proceedings of the 35th AAAI Conference on Artificial Intelligence： AAAI-21 Technical Tracks 16/ AAAI Technical Track on Speech and Natural Language Processing Ⅲ. Palo Alto： AAAI Press， 2021， 35（16）： 14319-14327.
8	LEE A， CHEN P， WANG C， et al. Direct speech-to speech translation with discrete units［C］// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2022： 3327-3339.
9	JIA Y， RAMANOVICH M T， REMEZ T， et al. Translatotron 2： high-quality direct speech-to-speech translation with voice preservation［C］// Proceedings of the 39th International Conference on Machine Learning. New York： ACM， 2022： 10120-10134.
10	LEE A， GONG H， DUQUENNE P A， et al. Textless speech-to-speech translation on real data［C］// Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg： ACL， 2022： 860-872.
11	JIA Y， DING Y， BAPNA A， et al. Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation［C］// Proceedings of the 23th Annual Conference of the International Speech Communication Association. ［S.l.］： ISCA， 2022： 1721-1725.
12	HUANG R， LIU J， LIU H， et al. TranSpeech： speech-to-speech translation with bilateral perturbation［EB/OL］. ［2023-10-13］. .
13	DONG Q， YUE F， KO T， et al. Leveraging pseudo-labeled data to improve direct speech-to-speech translation［C］// Proceedings of the 23th Annual Conference of the International Speech Communication Association. ［S.l.］： ISCA， 2022： 1781-1785.
14	INAGUMA H， POPURI S， KULIKOV I， et al. UnitY： two-pass direct speech-to-speech translation with discrete units［C］// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2023： 15655-15680.
15	CHEN P， TRAN K， YANG Y， et al. Speech-to-speech translation for a real-world unwritten language［C］// Findings of the Association for Computational Linguistics： ACL 2023. Stroudsburg： ACL， 2023： 4969-4983.
16	NACHMANI E， LEVKOVITCH A， DING Y， et al. Translatotron 3： speech to speech translation with monolingual data［C］// Proceedings of the 2024 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2024： 10686-10690.
17	DONG Q， HUANG Z， TIAN Q， et al. PolyVoice： language models for speech to speech translation［C/OL］// Proceedings of the Twelfth International Conference on Learning Representations. ［S.l.］： OpenReview.net， 2024 ［2024-12-30］. .
18	RUBENSTEIN P K， ASAWAROENGCHAI C， NGUYEN D D， et al. AudioPaLM： a large language model that can speak and listen［EB/OL］. ［2024-01-10］. .
19	GONG H Y， DONG N， POPURI S， et al. Multilingual speech-to-speech translation into multiple target languages［EB/OL］. ［2023-10-27］. .
20	Communication Seamless， BARRAULT L， Y-A CHUNG， et al. SeamlessM 4T： massively multilingual & multimodal machine translation［EB/OL］. ［2023-11-27］. .
21	KIM S-B， LEE S-H， LEE S-W. TranSentence： speech-to-speech translation via language-agnostic sentence-level speech encoding without language-parallel data［C］// Proceedings of the 2024 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2024： 12722-12726.
22	KANO T， SAKTI S， NAKAMURA S. Transformer-based direct speech-to-speech translation with transcoder［C］// Proceedings of the 2021 IEEE Spoken Language Technology Workshop. Piscataway： IEEE， 2021： 958-965.
23	LIU R， ZHAO Y， XU X. Multi-task self-supervised learning based Tibetan-Chinese speech-to-speech translation［C］// Proceedings of the 2023 International Conference on Asian Language Processing. Piscataway： IEEE， 2023： 45-49.
24	LEWIS P M， SIMONS G F， FENNIG C D. Ethnologue global dataset［DB/OL］. ［2024-01-15］. .
25	ZHANG J， PAN J， YIN X， et al. Direct speech-to-speech translation without textual annotation using bottleneck features［EB/OL］. ［2023-12-07］. .
26	POLYAK A， ADI Y， COPET J， et al. Speech resynthesis from discrete disentangled self-supervised representations［C］// Proceedings of the 22th Annual Conference of the International Speech Communication Association. ［S.l.］： ISCA，2021： 3615-3619.
27	VAN DEN OORD A， VINYALS O， KAVUKCUOGLU K. Neural discrete representation learning［C］// Proceedings of the 31th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 6309-6318.
28	HSU W N， BOLTE B， TSAI Y H H， et al. HuBERT： self-supervised speech representation learning by masked prediction of hidden units［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2021， 29： 3451-3460.
29	ZHANG D， YE R， KO T， et al. DUB： discrete unit back translation for speech translation［C］// Findings of the Association for Computational Linguistics： ACL 2023. Stroudsburg： ACL， 2023： 7147-7164.
30	POST M， KUMAR G， LOPEZ A， et al. Fisher and CALLHOME Spanish — English speech translation［DS/OL］. Philadelphia： Linguistic Data Consortium， 2014 ［2024-10-04］. .
31	SHIMIZU H， NEUBIG G， SAKTI S， et al. Collection of a simultaneous translation corpus for comparative analysis［C］// Proceedings of the 9th International Conference on Language Resources and Evaluation. Paris： ELRA， 2014： 670-673.
32	BOITO M Z， HAVARD W N， GARNERIN M， et al. MaSS： a large and clean multilingual corpus of sentence-aligned spoken utterances extracted from the bible［C］// Proceedings of the 12th Language Resources and Evaluation Conference. Paris： ELRA， 2020： 6486-6493.
33	WANG C， RIVIERE M， LEE A， et al. VoxPopuli： a large-scale multilingual speech corpus for representation learning， semi-supervised learning and interpretation［C］// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing （Volume 1： Long Papers）. Stroudsburg： ACL， 2021： 993-1003.
34	JIA Y， RAMANOVICH M T， WANG Q， et al. CVSS corpus and massively multilingual speech-to-speech translation［C］// Proceedings of the 13th Language Resources and Evaluation Conference. Paris： ELRA， 2022： 6691-6703.
35	CONNEAU A， MA M， KHANUJA S， et al. Fleurs： few-shot learning evaluation of universal representations of speech［C］// Proceedings of the 2022 IEEE Spoken Language Technology Workshop. Piscataway： IEEE， 2023： 798-805.
36	DUQUENNE P A， GONG H， DONG N， et al. SpeechMatrix： a large-scale mined corpus of multilingual speech-to-speech translations［C］// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2023： 16251-16269.
37	WANG C， INAGUMA H， CHEN P J， et al. Simple and effective unsupervised speech translation［C］// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2023： 10771-10784.
38	NGUYEN X P， POPURI S， WANG C H， et al. Improving speech- to-speech translation through unlabeled text［C］// Proceedings of the 2023 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2023： 1-5.
39	POPURI S， CHEN P J， WANG C， et al. Enhanced direct speech- to-speech translation using self-supervised pre-training and data augmentation［C］// Proceedings of the 23th Annual Conference of the International Speech Communication Association. ［S.l.］： ISCA， 2022： 5195-5199.
40	BAEVSKI A， ZHOU H， MOHAMED A， et al. wav2vec 2.0： a framework for self-supervised learning of speech representations［C］// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2020： 12449-12460.
41	LIU Y， GU J， GOYAL N， et al. Multilingual denoising pre-training for neural machine translation［J］. Transactions of the Association for Computational Linguistics， 2020， 8： 726-742.
42	CHUNG Y A， ZHANG Y， HAN W， et al. w2v-BERT： combining contrastive learning and masked language modeling for self-supervised speech pre-training［C］// Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop. Piscataway： IEEE， 2021： 244-250.
43	BAPNA A， CHERRY C， ZHANG Y， et al. mSLAM： massively multilingual joint pre-training for speech and text［EB/OL］. ［2023-10-04］. .
44	LI X， JIA Y， CHIU C C. Textless direct speech-to-speech translation with discrete speech representation［C］// Proceedings of the 2023 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2023： 1-5.
45	WEI K， ZHOU L， ZHANG Z， et al. Joint pre-training with speech and bilingual text for direct speech to speech translation［C］// Proceedings of the 2023 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2023： 1-5.
46	DIWAN A， SRINIVASAN A， HARWATH D， et al. Textless speech-to-speech translation with limited parallel data ［C］// Findings of the Association for Computational Linguistics： EMNLP 2024. Stroudsburg： ACL， 2024： 16208-16224.
47	LAMPLE G， CONNEAU A， DENOYER L， et al. Unsupervised machine translation using monolingual corpora only［EB/OL］. ［2024-01-11］. .
48	SCHWENK H. Filtering and mining parallel data in a joint multilingual space［C］// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics （Volume 2： Short Papers）. Stroudsburg： ACL， 2018： 228-234.
49	DUQUENNE P A， GONG H Y， SCHWENK H. Multimodal and multilingual embeddings for large-scale speech mining［C］// Proceedings of the 35th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2021： 15748-15761.
50	ARTETXE M， SCHWENK H. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond［J］. Transactions of the Association for Computational Linguistics， 2019， 7： 597-610.
51	BABU A， WANG C， TJANDRA A， et al. XLS-R： self-supervised cross-lingual speech representation learning at scale［C］// Proceedings of the 23th Annual Conference of the International Speech Communication Association. ［S.l.］： ISCA， 2022： 2278-2282.
52	NLLB Team， COSTA-JUSSÀ M R， CROSS J， et al. No language left behind： scaling human-centered machine translation［EB/OL］. ［2023-12-13］. .
53	LEPIKHIN D， LEE H， XU Y， et al. GShard： scaling giant models with conditional computation and automatic sharding［EB/OL］. ［2024-03-04］. .
54	LEWIS M， BHOSALE S， DETTMERS T， et al. BASE layers： simplifying training of large， sparse models［C］// Proceedings of the 38th International Conference on Machine Learning. New York： ACM， 2021： 6265-6274.
55	CONNEAU A， LAMPLE G， RANZATO M， et al. Word translation without parallel data［EB/OL］. ［2024-01-18］. .
56	DURET J， ESTÈVE Y， PARCOLLET T. Learning multilingual expressive speech representation for prosody prediction without parallel data［C/OL］// Proceedings of the 12th Speech Synthesis Workshop. ［S.l.］： OpenReview.net， 2023 ［2024-01-13］. .
57	DURET J， O'BRIEN B， ESTÈVE Y， et al. Enhancing expressivity transfer in textless speech-to-speech translation［C］// Proceedings of the 2023 IEEE Automatic Speech Recognition and Understanding Workshop. Piscataway： IEEE， 2023： 1-8.
58	SONG K， REN Y， LEI Y， et al. StyleS2ST： zero-shot style transfer for direct speech-to-speech translation［C］// Proceedings of the 24th Annual Conference of the International Speech Communication Association. ［S.l.］： ISCA， 2023： 42-46.
59	WANG Y， BAI J， HUANG R， et al. Speech-to-speech translation with discrete-unit-based style transfer［C］// Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics （Volume 4： Student Research Workshop）. Stroudsburg： ACL， 2024： 34-41.
60	ZEGHIDOUR N， LUEBS A， OMRAN A， et al. SoundStream： an end-to-end neural audio codec［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2022， 30： 495-507.
61	MA X， PINO J M， CROSS J， et al. Monotonic multihead attention［C/OL］// Proceedings of the 8th International Conference on Learning Representations. ［S.l.］： OpenReview.net， 2020 ［2023-12-15］. .
62	RAFFEL C， LUONG M T， LIU P J， et al. Online and linear-time attention by enforcing monotonic alignments［C］// Proceedings of the 34th International Conference on Machine Learning. New York： ACM， 2017： 2837-2846.
63	MA X， GONG H， LIU D， et al. Direct simultaneous speech-to-speech translation with variational monotonic multihead attention［EB/OL］. ［2023-12-15］. .
64	KOJIMA T， GU S S， REID M， et al. Large language models are zero-shot reasoners［C］// Proceedings of the 36th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2022： 22199-22213.
65	BORSOS Z， MARINIER R， VINCENT D， et al. AudioLM： a language modeling approach to audio generation［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2022， 31： 2523-2533.
66	BORSOS Z， SHARIFI M， VINCENT D， et al. SoundStorm： efficient parallel audio generation［C/OL］// Proceedings of the 12th International Conference on Learning Representations. ［S.l.］： OpenReview.net， 2024 ［2024-03-21］. .
67	ZHU X F， LV Y J， LEI Y， et al. Vec-Tok Speech： speech vectorization and tokenization for neural speech generation［EB/OL］. ［2023-12-23］. .
68	RAMESH A， PAVLOV M， GOH G， et al. Zero-shot text-to-image generation［C］// Proceedings of the 38th International Conference on Machine Learning. New York： ACM， 2021： 8821-8831.
69	ROMBACH R， BLATTMANN A， LORENZ D， et al. High-resolution image synthesis with latent diffusion models［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 10674-10685.
70	HUANG R， HUANG J， YANG D， et al. Make-an-audio： text-to-audio generation with prompt-enhanced diffusion models［C］// Proceedings of the 40th International Conference on Machine Learning. New York： ACM， 2023： 13916-13932.
71	GUPTA A， YU L， SOHN K， et al. Photorealistic video generation with diffusion models［C］// Proceedings of the 2024 European Conference on Computer Vision， LNCS 15137. Cham： Springer， 2025： 393-411.
72	BLATTMANN A， DOCKHORN T， KULAL S， et al. Stable video diffusion： scaling latent video diffusion models to large datasets［EB/OL］. ［2024-03-07］. .
73	HUANG R， LIU H， CHENG X， et al. AV-TranSpeech： audio-visual robust speech-to-speech translation［C］// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2023： 8590-8604.

年份	主要研究工作
2019	Google提出首个端到端S2ST模型Translatotron，验证了端到端方法的可行性^［5］
2019	Tjandra等^［6］引入VQ-VAE，首次在不使用转录文本的情况下实现了S2ST
2021	Zhang等^［7］提出UWSpeech，采用VQ-VAE的思想，用于无文字系统的语言的翻译
	Meta提出离散单元的概念，用于训练S2UT模型，为后续发展奠定基础^［8］
	Google提出Translatotron 2，采用两阶段解码和多任务学习的训练方式^［9］
	Meta提出Textless模型，基于S2UT模型在真实数据上训练，翻译无文字系统语言^［10］
2022	Google改进Translatotron 2，采用语音-文本联合预训练^［11］
	Huang等^［12］提出TranSpeech，采用双边扰动和非自回归解码
	字节跳动基于Transformer改进Translatotron，提出伪翻译标记方法^［13］
	Meta改进Translatotron 2，并提出UnitY，分解了S2UT，采用两阶段解码方式^［14］
	Meta基于S2UT和UnitY，使用中文作为桥梁，实现闽南语和英语之间的翻译^［15］
2023	Google提出Translatotron 3，首次实现无监督S2ST^［16］
	字节跳动提出PolyVoice，使用2个语言模型实现S2U和U2S的转换^［17］
	Google提出AudioPaLM，利用文本语言模型预测语音^［18］
	Meta实现源语言到多目标语言的S2ST^［19］
	Meta提出SeamlessM4T，支持大规模多语言和多模式机器翻译^［20］
2024	Kim等^［21］提出TranSentence，利用语言无关的句子级语音编码实现无监督S2ST

年份	主要研究工作
2019	Google提出首个端到端S2ST模型Translatotron，验证了端到端方法的可行性^［5］
2019	Tjandra等^［6］引入VQ-VAE，首次在不使用转录文本的情况下实现了S2ST
2021	Zhang等^［7］提出UWSpeech，采用VQ-VAE的思想，用于无文字系统的语言的翻译
	Meta提出离散单元的概念，用于训练S2UT模型，为后续发展奠定基础^［8］
	Google提出Translatotron 2，采用两阶段解码和多任务学习的训练方式^［9］
	Meta提出Textless模型，基于S2UT模型在真实数据上训练，翻译无文字系统语言^［10］
2022	Google改进Translatotron 2，采用语音-文本联合预训练^［11］
	Huang等^［12］提出TranSpeech，采用双边扰动和非自回归解码
	字节跳动基于Transformer改进Translatotron，提出伪翻译标记方法^［13］
	Meta改进Translatotron 2，并提出UnitY，分解了S2UT，采用两阶段解码方式^［14］
	Meta基于S2UT和UnitY，使用中文作为桥梁，实现闽南语和英语之间的翻译^［15］
2023	Google提出Translatotron 3，首次实现无监督S2ST^［16］
	字节跳动提出PolyVoice，使用2个语言模型实现S2U和U2S的转换^［17］
	Google提出AudioPaLM，利用文本语言模型预测语音^［18］
	Meta实现源语言到多目标语言的S2ST^［19］
	Meta提出SeamlessM4T，支持大规模多语言和多模式机器翻译^［20］
2024	Kim等^［21］提出TranSentence，利用语言无关的句子级语音编码实现无监督S2ST

数据集	语言数	总时长/h	目标语音来源
Fisher^［30］	2（西班牙语-英语）	127	TTS合成
STC^［31］	2（英语-日语）	31	同声传译
MaSS^［32］	8（56个方向）	150	人工构建
VoxPopuli^［33］	15（210个方向）	17 300	同声传译
CVSS（C+T）^［34］	X-En（21个方向）	3 800	TTS合成
FLEURS^［35］	102	1 400	人工构建
SpeechMatrix^［36］	17（272个方向）	418 000	语音数据挖掘

数据集	语言数	总时长/h	目标语音来源
Fisher^［30］	2（西班牙语-英语）	127	TTS合成
STC^［31］	2（英语-日语）	31	同声传译
MaSS^［32］	8（56个方向）	150	人工构建
VoxPopuli^［33］	15（210个方向）	17 300	同声传译
CVSS（C+T）^［34］	X-En（21个方向）	3 800	TTS合成
FLEURS^［35］	102	1 400	人工构建
SpeechMatrix^［36］	17（272个方向）	418 000	语音数据挖掘

模型	ASR-BLEU分数
模型	CVSS-C Es-En	Fisher Es-En
Translatotron^［5］	14.1^［14］	25.6^［5］
+Transformer	25.1^［14］	32.0^［13］
+PTL^［13］		43.6^［13］
Translatotron 2^［9］	30.1^［14］	37.0^［13］
+Transformer Decoder	30.8^［14］
+ST预训练	33.4^［14］
+w2v-BERT	35.9^［14］
+mSLAM	36.8^［14］
+TTS数据扩充	37.1^［14］
UWSpeech^［7］		9.4^［13］
S2UT^［8］	29.0^［14］	39.9^［8］
+ST预训练	30.5^［14］
+w2v-BERT+u-mBART	34.8^［14］
UnitY^［14］	32.3^［14］	51.4^［14］
+ST预训练	33.4^［14］
+w2v-BERT+t-mBART	37.2^［14］
Translatotron 3^［16］	14.25^［16］
TranSentence^［21］	18.24^［21］
ST+TTS（级联）	33.3^［14］	45.1^［13］
真实数据参考	88.6^［14］	89.8^［13］

端到端语音到语音翻译的优化方法综述

Review of optimization methods for end-to-end speech-to-speech translation

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 73

相关文章 15

编辑推荐

Metrics

[1]	师凯洲何旋候国义李根李泷杲黄翔. 基于大语言模型的机载产品计量溯源知识图谱构建方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[2]	崔选刘波. 基于动态卷积自编码器的无监督人脸属性编辑方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[3]	陈荟慧孙洪韬关柏良衡中青. 基于NetVLAD特征编码的古籍汉字图像检索算法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[4]	索逸凡刘松华郝秋智. 基于高阶特征聚合的时间序列异常检测方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[5]	黄萍李清邱海枫王程斯黄安子樊龙. 轻量化输电线路缺陷检测方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[6]	邵培荣蔺素珍王彦博. 以人为中心的细节增强虚拟试衣方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[7]	郭纪新张婷. 基于组件协同优化剪枝的Transformer图像去雾[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[8]	李明光陶重犇. 基于Mamba模型的分级跨模态融合三维目标检测方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[9]	余松森何皇薛国鹏崔恒拓. 基于改进FENet的瓷砖色差量化分级方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[10]	肖毓航李贯峰陈昱胤秦晶. 基于图的多视角对比学习小样本关系抽取模型[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[11]	王雪张丽萍闫盛李娜张学飞. 多模态知识图谱补全方法综述[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[12]	白翔李巨川王慧民景超钮键张兴忠程永强. 基于改进Swin Transformer的电力图像检索方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[13]	郭盼盼, 周刚, 卢记仓, 李珠峰, 祝涛杰. 混合信息增强的论文推荐方法[J]. 《计算机应用》唯一官方网站, 2025, 45(6): 1879-1887.
[14]	杨大伟, 徐西海, 宋威. 结合语义增强和感知注意力的关系抽取方法[J]. 《计算机应用》唯一官方网站, 2025, 45(6): 1801-1808.
[15]	吴宗航, 张东, 李冠宇. 基于联合自监督学习的多模态融合推荐算法[J]. 《计算机应用》唯一官方网站, 2025, 45(6): 1858-1868.