Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (5): 1363-1371.DOI: 10.11772/j.issn.1001-9081.2024050666
• China Conference on Data Mining 2024 (CCDM 2024) •
Wei ZONG1,2, Yue ZHAO1,2(), Yin LI1,2, Xiaona XU1,2
Received:
2024-05-23
Revised:
2024-06-26
Accepted:
2024-06-26
Online:
2024-07-25
Published:
2025-05-10
Contact:
Yue ZHAO
About author:
ZONG Wei, born in 2002, M. S. candidate. His research interests include speech translation.Supported by:
通讯作者:
赵悦
作者简介:
宗伟(2002—),男,山东烟台人,硕士研究生,CCF会员,主要研究方向:语音翻译基金资助:
CLC Number:
Wei ZONG, Yue ZHAO, Yin LI, Xiaona XU. Review of optimization methods for end-to-end speech-to-speech translation[J]. Journal of Computer Applications, 2025, 45(5): 1363-1371.
宗伟, 赵悦, 李尹, 徐晓娜. 端到端语音到语音翻译的优化方法综述[J]. 《计算机应用》唯一官方网站, 2025, 45(5): 1363-1371.
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2024050666
年份 | 主要研究工作 |
---|---|
2019 | Google提出首个端到端S2ST模型Translatotron,验证了端到端方法的可行性[ |
Tjandra等[ | |
2021 | Zhang等[ |
Meta提出离散单元的概念,用于训练S2UT模型,为后续发展奠定基础[ | |
Google提出Translatotron 2,采用两阶段解码和多任务学习的训练方式[ | |
Meta提出Textless模型,基于S2UT模型在真实数据上训练,翻译无文字系统语言[ | |
2022 | Google改进Translatotron 2,采用语音-文本联合预训练[ |
Huang等[ | |
字节跳动基于Transformer改进Translatotron,提出伪翻译标记方法[ | |
Meta改进Translatotron 2,并提出UnitY,分解了S2UT,采用两阶段解码方式[ | |
Meta基于S2UT和UnitY,使用中文作为桥梁,实现闽南语和英语之间的翻译[ | |
2023 | Google提出Translatotron 3,首次实现无监督S2ST[ |
字节跳动提出PolyVoice,使用2个语言模型实现S2U和U2S的转换[ | |
Google提出AudioPaLM,利用文本语言模型预测语音[ | |
Meta实现源语言到多目标语言的S2ST[ | |
Meta提出SeamlessM4T,支持大规模多语言和多模式机器翻译[ | |
2024 | Kim等[ |
Tab. 1 Major research efforts in development of S2ST
年份 | 主要研究工作 |
---|---|
2019 | Google提出首个端到端S2ST模型Translatotron,验证了端到端方法的可行性[ |
Tjandra等[ | |
2021 | Zhang等[ |
Meta提出离散单元的概念,用于训练S2UT模型,为后续发展奠定基础[ | |
Google提出Translatotron 2,采用两阶段解码和多任务学习的训练方式[ | |
Meta提出Textless模型,基于S2UT模型在真实数据上训练,翻译无文字系统语言[ | |
2022 | Google改进Translatotron 2,采用语音-文本联合预训练[ |
Huang等[ | |
字节跳动基于Transformer改进Translatotron,提出伪翻译标记方法[ | |
Meta改进Translatotron 2,并提出UnitY,分解了S2UT,采用两阶段解码方式[ | |
Meta基于S2UT和UnitY,使用中文作为桥梁,实现闽南语和英语之间的翻译[ | |
2023 | Google提出Translatotron 3,首次实现无监督S2ST[ |
字节跳动提出PolyVoice,使用2个语言模型实现S2U和U2S的转换[ | |
Google提出AudioPaLM,利用文本语言模型预测语音[ | |
Meta实现源语言到多目标语言的S2ST[ | |
Meta提出SeamlessM4T,支持大规模多语言和多模式机器翻译[ | |
2024 | Kim等[ |
数据集 | 语言数 | 总时长/h | 目标语音来源 |
---|---|---|---|
Fisher[ | 2(西班牙语-英语) | 127 | TTS合成 |
STC[ | 2(英语-日语) | 31 | 同声传译 |
MaSS[ | 8(56个方向) | 150 | 人工构建 |
VoxPopuli[ | 15(210个方向) | 17 300 | 同声传译 |
CVSS(C+T)[ | X-En(21个方向) | 3 800 | TTS合成 |
FLEURS[ | 102 | 1 400 | 人工构建 |
SpeechMatrix[ | 17(272个方向) | 418 000 | 语音数据挖掘 |
Tab. 2 Common open-source datasets
数据集 | 语言数 | 总时长/h | 目标语音来源 |
---|---|---|---|
Fisher[ | 2(西班牙语-英语) | 127 | TTS合成 |
STC[ | 2(英语-日语) | 31 | 同声传译 |
MaSS[ | 8(56个方向) | 150 | 人工构建 |
VoxPopuli[ | 15(210个方向) | 17 300 | 同声传译 |
CVSS(C+T)[ | X-En(21个方向) | 3 800 | TTS合成 |
FLEURS[ | 102 | 1 400 | 人工构建 |
SpeechMatrix[ | 17(272个方向) | 418 000 | 语音数据挖掘 |
模型 | ASR-BLEU分数 | |
---|---|---|
CVSS-C Es-En | Fisher Es-En | |
Translatotron[ | 14.1[ | 25.6[ |
+Transformer | 25.1[ | 32.0[ |
+PTL[ | 43.6[ | |
Translatotron 2[ | 30.1[ | 37.0[ |
+Transformer Decoder | 30.8[ | |
+ST预训练 | 33.4[ | |
+w2v-BERT | 35.9[ | |
+mSLAM | 36.8[ | |
+TTS数据扩充 | 37.1[ | |
UWSpeech[ | 9.4[ | |
S2UT[ | 29.0[ | 39.9[ |
+ST预训练 | 30.5[ | |
+w2v-BERT+u-mBART | 34.8[ | |
UnitY[ | 32.3[ | 51.4[ |
+ST预训练 | 33.4[ | |
+w2v-BERT+t-mBART | 37.2[ | |
Translatotron 3[ | 14.25[ | |
TranSentence[ | 18.24[ | |
ST+TTS(级联) | 33.3[ | 45.1[ |
真实数据参考 | 88.6[ | 89.8[ |
Tab. 3 ASR-BLEU scores of different models on CVSS Es-En and Fisher Es-En datasets
模型 | ASR-BLEU分数 | |
---|---|---|
CVSS-C Es-En | Fisher Es-En | |
Translatotron[ | 14.1[ | 25.6[ |
+Transformer | 25.1[ | 32.0[ |
+PTL[ | 43.6[ | |
Translatotron 2[ | 30.1[ | 37.0[ |
+Transformer Decoder | 30.8[ | |
+ST预训练 | 33.4[ | |
+w2v-BERT | 35.9[ | |
+mSLAM | 36.8[ | |
+TTS数据扩充 | 37.1[ | |
UWSpeech[ | 9.4[ | |
S2UT[ | 29.0[ | 39.9[ |
+ST预训练 | 30.5[ | |
+w2v-BERT+u-mBART | 34.8[ | |
UnitY[ | 32.3[ | 51.4[ |
+ST预训练 | 33.4[ | |
+w2v-BERT+t-mBART | 37.2[ | |
Translatotron 3[ | 14.25[ | |
TranSentence[ | 18.24[ | |
ST+TTS(级联) | 33.3[ | 45.1[ |
真实数据参考 | 88.6[ | 89.8[ |
1 | LAVIE A, WAIBEL A, LEVIN L, et al. JANUS-Ⅲ: speech-to- speech translation in multiple languages[C]// Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing — Volume 1. Piscataway: IEEE, 1997: 99-102. |
2 | NAKAMURA S, MARKOV K, NAKAIWA H, et al. The ATR multilingual speech-to-speech translation system[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2006, 14(2): 365-376. |
3 | WANG C, WU Y, LIU S, et al. Bridging the gap between pre-training and fine-tuning for end-to-end speech translation[C]// Proceedings of the 34th AAAI Conference on Artificial Intelligence: AAAI-20 Technical Tracks 5/ AAAI Technical Track: Natural Language Processing. Palo Alto: AAAI Press, 2020, 34(5): 9161-9168. |
4 | JIA Y, JOHNSON M, MACHEREY W, et al. Leveraging weakly supervised data to improve end-to-end speech-to-text translation[C]// Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2019: 7180-7184. |
5 | JIA Y, WEISS R J, BIADSY F, et al. Direct speech-to-speech translation with a sequence to-sequence model[C]// Proceedings of the 20th Annual Conference of the International Speech Communication Association. [S.l.]: ISCA, 2019: 1123-1127. |
6 | TJANDRA A, SAKTI S, NAKAMRUA S. Speech-to-speech translation between untranscribed unknown languages[C]// Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop. Piscataway: IEEE, 2019: 593-600. |
7 | ZHANG C, TAN X, REN Y, et al. UWSpeech: speech to speech translation for unwritten languages[C]// Proceedings of the 35th AAAI Conference on Artificial Intelligence: AAAI-21 Technical Tracks 16/ AAAI Technical Track on Speech and Natural Language Processing Ⅲ. Palo Alto: AAAI Press, 2021, 35(16): 14319-14327. |
8 | LEE A, CHEN P, WANG C, et al. Direct speech-to speech translation with discrete units[C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg: ACL, 2022: 3327-3339. |
9 | JIA Y, RAMANOVICH M T, REMEZ T, et al. Translatotron 2: high-quality direct speech-to-speech translation with voice preservation[C]// Proceedings of the 39th International Conference on Machine Learning. New York: ACM, 2022: 10120-10134. |
10 | LEE A, GONG H, DUQUENNE P A, et al. Textless speech-to-speech translation on real data[C]// Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2022: 860-872. |
11 | JIA Y, DING Y, BAPNA A, et al. Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation[C]// Proceedings of the 23th Annual Conference of the International Speech Communication Association. [S.l.]: ISCA, 2022: 1721-1725. |
12 | HUANG R, LIU J, LIU H, et al. TranSpeech: speech-to-speech translation with bilateral perturbation[EB/OL]. [2023-10-13]. . |
13 | DONG Q, YUE F, KO T, et al. Leveraging pseudo-labeled data to improve direct speech-to-speech translation[C]// Proceedings of the 23th Annual Conference of the International Speech Communication Association. [S.l.]: ISCA, 2022: 1781-1785. |
14 | INAGUMA H, POPURI S, KULIKOV I, et al. UnitY: two-pass direct speech-to-speech translation with discrete units[C]// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg: ACL, 2023: 15655-15680. |
15 | CHEN P, TRAN K, YANG Y, et al. Speech-to-speech translation for a real-world unwritten language[C]// Findings of the Association for Computational Linguistics: ACL 2023. Stroudsburg: ACL, 2023: 4969-4983. |
16 | NACHMANI E, LEVKOVITCH A, DING Y, et al. Translatotron 3: speech to speech translation with monolingual data[C]// Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2024: 10686-10690. |
17 | DONG Q, HUANG Z, TIAN Q, et al. PolyVoice: language models for speech to speech translation[C/OL]// Proceedings of the Twelfth International Conference on Learning Representations. [S.l.]: OpenReview.net, 2024 [2024-12-30]. . |
18 | RUBENSTEIN P K, ASAWAROENGCHAI C, NGUYEN D D, et al. AudioPaLM: a large language model that can speak and listen[EB/OL]. [2024-01-10]. . |
19 | GONG H Y, DONG N, POPURI S, et al. Multilingual speech-to-speech translation into multiple target languages[EB/OL]. [2023-10-27]. . |
20 | Communication Seamless, BARRAULT L, Y-A CHUNG, et al. SeamlessM 4T: massively multilingual & multimodal machine translation[EB/OL]. [2023-11-27]. . |
21 | KIM S-B, LEE S-H, LEE S-W. TranSentence: speech-to-speech translation via language-agnostic sentence-level speech encoding without language-parallel data[C]// Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2024: 12722-12726. |
22 | KANO T, SAKTI S, NAKAMURA S. Transformer-based direct speech-to-speech translation with transcoder[C]// Proceedings of the 2021 IEEE Spoken Language Technology Workshop. Piscataway: IEEE, 2021: 958-965. |
23 | LIU R, ZHAO Y, XU X. Multi-task self-supervised learning based Tibetan-Chinese speech-to-speech translation[C]// Proceedings of the 2023 International Conference on Asian Language Processing. Piscataway: IEEE, 2023: 45-49. |
24 | LEWIS P M, SIMONS G F, FENNIG C D. Ethnologue global dataset[DB/OL]. [2024-01-15]. . |
25 | ZHANG J, PAN J, YIN X, et al. Direct speech-to-speech translation without textual annotation using bottleneck features[EB/OL]. [2023-12-07]. . |
26 | POLYAK A, ADI Y, COPET J, et al. Speech resynthesis from discrete disentangled self-supervised representations[C]// Proceedings of the 22th Annual Conference of the International Speech Communication Association. [S.l.]: ISCA,2021: 3615-3619. |
27 | VAN DEN OORD A, VINYALS O, KAVUKCUOGLU K. Neural discrete representation learning[C]// Proceedings of the 31th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6309-6318. |
28 | HSU W N, BOLTE B, TSAI Y H H, et al. HuBERT: self-supervised speech representation learning by masked prediction of hidden units[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3451-3460. |
29 | ZHANG D, YE R, KO T, et al. DUB: discrete unit back translation for speech translation[C]// Findings of the Association for Computational Linguistics: ACL 2023. Stroudsburg: ACL, 2023: 7147-7164. |
30 | POST M, KUMAR G, LOPEZ A, et al. Fisher and CALLHOME Spanish — English speech translation[DS/OL]. Philadelphia: Linguistic Data Consortium, 2014 [2024-10-04]. . |
31 | SHIMIZU H, NEUBIG G, SAKTI S, et al. Collection of a simultaneous translation corpus for comparative analysis[C]// Proceedings of the 9th International Conference on Language Resources and Evaluation. Paris: ELRA, 2014: 670-673. |
32 | BOITO M Z, HAVARD W N, GARNERIN M, et al. MaSS: a large and clean multilingual corpus of sentence-aligned spoken utterances extracted from the bible[C]// Proceedings of the 12th Language Resources and Evaluation Conference. Paris: ELRA, 2020: 6486-6493. |
33 | WANG C, RIVIERE M, LEE A, et al. VoxPopuli: a large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Stroudsburg: ACL, 2021: 993-1003. |
34 | JIA Y, RAMANOVICH M T, WANG Q, et al. CVSS corpus and massively multilingual speech-to-speech translation[C]// Proceedings of the 13th Language Resources and Evaluation Conference. Paris: ELRA, 2022: 6691-6703. |
35 | CONNEAU A, MA M, KHANUJA S, et al. Fleurs: few-shot learning evaluation of universal representations of speech[C]// Proceedings of the 2022 IEEE Spoken Language Technology Workshop. Piscataway: IEEE, 2023: 798-805. |
36 | DUQUENNE P A, GONG H, DONG N, et al. SpeechMatrix: a large-scale mined corpus of multilingual speech-to-speech translations[C]// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg: ACL, 2023: 16251-16269. |
37 | WANG C, INAGUMA H, CHEN P J, et al. Simple and effective unsupervised speech translation[C]// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg: ACL, 2023: 10771-10784. |
38 | NGUYEN X P, POPURI S, WANG C H, et al. Improving speech- to-speech translation through unlabeled text[C]// Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2023: 1-5. |
39 | POPURI S, CHEN P J, WANG C, et al. Enhanced direct speech- to-speech translation using self-supervised pre-training and data augmentation[C]// Proceedings of the 23th Annual Conference of the International Speech Communication Association. [S.l.]: ISCA, 2022: 5195-5199. |
40 | BAEVSKI A, ZHOU H, MOHAMED A, et al. wav2vec 2.0: a framework for self-supervised learning of speech representations[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 12449-12460. |
41 | LIU Y, GU J, GOYAL N, et al. Multilingual denoising pre-training for neural machine translation[J]. Transactions of the Association for Computational Linguistics, 2020, 8: 726-742. |
42 | CHUNG Y A, ZHANG Y, HAN W, et al. w2v-BERT: combining contrastive learning and masked language modeling for self-supervised speech pre-training[C]// Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop. Piscataway: IEEE, 2021: 244-250. |
43 | BAPNA A, CHERRY C, ZHANG Y, et al. mSLAM: massively multilingual joint pre-training for speech and text[EB/OL]. [2023-10-04]. . |
44 | LI X, JIA Y, CHIU C C. Textless direct speech-to-speech translation with discrete speech representation[C]// Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2023: 1-5. |
45 | WEI K, ZHOU L, ZHANG Z, et al. Joint pre-training with speech and bilingual text for direct speech to speech translation[C]// Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2023: 1-5. |
46 | DIWAN A, SRINIVASAN A, HARWATH D, et al. Textless speech-to-speech translation with limited parallel data [C]// Findings of the Association for Computational Linguistics: EMNLP 2024. Stroudsburg: ACL, 2024: 16208-16224. |
47 | LAMPLE G, CONNEAU A, DENOYER L, et al. Unsupervised machine translation using monolingual corpora only[EB/OL]. [2024-01-11]. . |
48 | SCHWENK H. Filtering and mining parallel data in a joint multilingual space[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Stroudsburg: ACL, 2018: 228-234. |
49 | DUQUENNE P A, GONG H Y, SCHWENK H. Multimodal and multilingual embeddings for large-scale speech mining[C]// Proceedings of the 35th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2021: 15748-15761. |
50 | ARTETXE M, SCHWENK H. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond[J]. Transactions of the Association for Computational Linguistics, 2019, 7: 597-610. |
51 | BABU A, WANG C, TJANDRA A, et al. XLS-R: self-supervised cross-lingual speech representation learning at scale[C]// Proceedings of the 23th Annual Conference of the International Speech Communication Association. [S.l.]: ISCA, 2022: 2278-2282. |
52 | NLLB Team, COSTA-JUSSÀ M R, CROSS J, et al. No language left behind: scaling human-centered machine translation[EB/OL]. [2023-12-13]. . |
53 | LEPIKHIN D, LEE H, XU Y, et al. GShard: scaling giant models with conditional computation and automatic sharding[EB/OL]. [2024-03-04]. . |
54 | LEWIS M, BHOSALE S, DETTMERS T, et al. BASE layers: simplifying training of large, sparse models[C]// Proceedings of the 38th International Conference on Machine Learning. New York: ACM, 2021: 6265-6274. |
55 | CONNEAU A, LAMPLE G, RANZATO M, et al. Word translation without parallel data[EB/OL]. [2024-01-18]. . |
56 | DURET J, ESTÈVE Y, PARCOLLET T. Learning multilingual expressive speech representation for prosody prediction without parallel data[C/OL]// Proceedings of the 12th Speech Synthesis Workshop. [S.l.]: OpenReview.net, 2023 [2024-01-13]. . |
57 | DURET J, O'BRIEN B, ESTÈVE Y, et al. Enhancing expressivity transfer in textless speech-to-speech translation[C]// Proceedings of the 2023 IEEE Automatic Speech Recognition and Understanding Workshop. Piscataway: IEEE, 2023: 1-8. |
58 | SONG K, REN Y, LEI Y, et al. StyleS2ST: zero-shot style transfer for direct speech-to-speech translation[C]// Proceedings of the 24th Annual Conference of the International Speech Communication Association. [S.l.]: ISCA, 2023: 42-46. |
59 | WANG Y, BAI J, HUANG R, et al. Speech-to-speech translation with discrete-unit-based style transfer[C]// Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop). Stroudsburg: ACL, 2024: 34-41. |
60 | ZEGHIDOUR N, LUEBS A, OMRAN A, et al. SoundStream: an end-to-end neural audio codec[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30: 495-507. |
61 | MA X, PINO J M, CROSS J, et al. Monotonic multihead attention[C/OL]// Proceedings of the 8th International Conference on Learning Representations. [S.l.]: OpenReview.net, 2020 [2023-12-15]. . |
62 | RAFFEL C, LUONG M T, LIU P J, et al. Online and linear-time attention by enforcing monotonic alignments[C]// Proceedings of the 34th International Conference on Machine Learning. New York: ACM, 2017: 2837-2846. |
63 | MA X, GONG H, LIU D, et al. Direct simultaneous speech-to-speech translation with variational monotonic multihead attention[EB/OL]. [2023-12-15]. . |
64 | KOJIMA T, GU S S, REID M, et al. Large language models are zero-shot reasoners[C]// Proceedings of the 36th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2022: 22199-22213. |
65 | BORSOS Z, MARINIER R, VINCENT D, et al. AudioLM: a language modeling approach to audio generation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 31: 2523-2533. |
66 | BORSOS Z, SHARIFI M, VINCENT D, et al. SoundStorm: efficient parallel audio generation[C/OL]// Proceedings of the 12th International Conference on Learning Representations. [S.l.]: OpenReview.net, 2024 [2024-03-21]. . |
67 | ZHU X F, LV Y J, LEI Y, et al. Vec-Tok Speech: speech vectorization and tokenization for neural speech generation[EB/OL]. [2023-12-23]. . |
68 | RAMESH A, PAVLOV M, GOH G, et al. Zero-shot text-to-image generation[C]// Proceedings of the 38th International Conference on Machine Learning. New York: ACM, 2021: 8821-8831. |
69 | ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 10674-10685. |
70 | HUANG R, HUANG J, YANG D, et al. Make-an-audio: text-to-audio generation with prompt-enhanced diffusion models[C]// Proceedings of the 40th International Conference on Machine Learning. New York: ACM, 2023: 13916-13932. |
71 | GUPTA A, YU L, SOHN K, et al. Photorealistic video generation with diffusion models[C]// Proceedings of the 2024 European Conference on Computer Vision, LNCS 15137. Cham: Springer, 2025: 393-411. |
72 | BLATTMANN A, DOCKHORN T, KULAL S, et al. Stable video diffusion: scaling latent video diffusion models to large datasets[EB/OL]. [2024-03-07]. . |
73 | HUANG R, LIU H, CHENG X, et al. AV-TranSpeech: audio-visual robust speech-to-speech translation[C]// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg: ACL, 2023: 8590-8604. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||