《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (5): 1363-1371.DOI: 10.11772/j.issn.1001-9081.2024050666

• 第十届中国数据挖掘会议 •    

端到端语音到语音翻译的优化方法综述

宗伟1,2, 赵悦1,2(), 李尹1,2, 徐晓娜1,2   

  1. 1.民族语言智能分析与安全治理教育部重点实验室(中央民族大学),北京 100081
    2.中央民族大学 信息工程学院,北京 100081
  • 收稿日期:2024-05-23 修回日期:2024-06-26 接受日期:2024-06-26 发布日期:2024-07-25 出版日期:2025-05-10
  • 通讯作者: 赵悦
  • 作者简介:宗伟(2002—),男,山东烟台人,硕士研究生,CCF会员,主要研究方向:语音翻译
    赵悦(1974—),女,辽宁抚顺人,教授,博士,主要研究方向:概率图模型、机器学习、语音信号处理
    李尹(2003—),女,广西南宁人,主要研究方向:语音信号处理
    徐晓娜(1979—),女,河南巩义人,讲师,博士,主要研究方向:语音处理、图像处理、机器学习。
  • 基金资助:
    国家自然科学基金资助项目(61976236)

Review of optimization methods for end-to-end speech-to-speech translation

Wei ZONG1,2, Yue ZHAO1,2(), Yin LI1,2, Xiaona XU1,2   

  1. 1.Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance,Ministry of Education (Minzu University of China),Beijing 100081,China
    2.School of Information Engineering,Minzu University of China,Beijing 100081,China
  • Received:2024-05-23 Revised:2024-06-26 Accepted:2024-06-26 Online:2024-07-25 Published:2025-05-10
  • Contact: Yue ZHAO
  • About author:ZONG Wei, born in 2002, M. S. candidate. His research interests include speech translation.
    ZHAO Yue, born in 1974, Ph. D., professor. Her research interests include probabilistic graphical model, machine learning, speech signal processing.
    LI Yin, born in 2003. Her research interests include speech signal processing.
    XU Xiaona, born in 1979, Ph. D., lecturer. Her research interests include speech processing, image processing, machine learning.
  • Supported by:
    National Natural Science Foundation of China(61976236)

摘要:

语音到语音翻译(S2ST)是智能语音领域中新兴的研究方向,旨在将一种语言的语音准确翻译成另一种语言的语音。随着人们对跨语言交流需求的增加,S2ST受到广泛的关注,相关研究也不断涌现。传统的级联模型在S2ST过程中存在诸多问题,如错误传播、推理延迟和无法翻译无文字系统的语言等,因此如何通过端到端模型实现直接S2ST成为当前研究的重点。在全面调查端到端S2ST的基础上,详细分析和归纳了端到端S2ST的各种模型,综述了已有的相关技术,将端到端S2ST面临的挑战总结为建模负担、数据稀缺和现实应用三类问题,并重点探讨了现有工作是如何解决这三类问题的。大语言模型(LLM)强大的理解和生成能力为S2ST提供了新的可能性,同时也带来了更多的挑战。因此,讨论了LLM在S2ST中的应用,并设想了未来可能的发展方向。

关键词: 端到端语音到语音翻译, 建模负担, 数据稀缺, 现实应用, 语音基石模型

Abstract:

Speech-to-Speech Translation (S2ST) is an emerging research direction in intelligent speech field, aiming to seamlessly translate spoken language from one language into another language. With increasing demands for cross-linguistic communication, S2ST has garnered significant attention, driving continuous research. Traditional cascaded models face numerous challenges in S2ST, including error propagation, inference latency, and inability to translate languages without a writing system. To address these issues, achieving direct S2ST using end-to-end models has become a key research focus. Based on a comprehensive survey of end-to-end S2ST models, a detailed analysis and summary of various end-to-end S2ST models was provided, the existing related technologies were reviewed, and the challenges were summarized into three categories: modeling burden, data scarcity, and real-world application, with a focus on how existing work has addressed these three categories. The extensive comprehension and generative capabilities of Large Language Models (LLMs) offer new possibilities for S2ST, while simultaneously presenting additional challenges. Exploring effective applications of LLMs in S2ST was also discussed, and potential future development directions were looked forward.

Key words: end-to-end Speech-to-Speech Translation (S2ST), modeling burden, data scarcity, real-world application, speech foundation model

中图分类号: