《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (10): 3277-3283.DOI: 10.11772/j.issn.1001-9081.2024091244

• 多媒体计算与计算机仿真 • 上一篇    

基于流的轻量化高质量文本到语音转换方法

文连庆1, 陶冶1(), 田云龙2, 牛丽2, 孙宏霞2   

  1. 1.青岛科技大学 信息科学技术学院,山东 青岛 266061
    2.数字家庭网络国家工程实验室,山东 青岛 266000
  • 收稿日期:2024-09-05 修回日期:2024-10-23 接受日期:2024-10-24 发布日期:2024-11-05 出版日期:2025-10-10
  • 通讯作者: 陶冶
  • 作者简介:文连庆(2000—),男,黑龙江哈尔滨人,硕士研究生,主要研究方向:语音合成、信号处理
    陶冶(1981—),男,山东青岛人,教授,博士,主要研究方向:软件工程、人工智能、人机交互 Email:ye.tao@qust.edu.cn
    田云龙(1981—),男,山东青岛人,高级工程师,硕士,主要研究方向:智慧家庭、大脑数字化转型、新一代人工智能场景化应用
    牛丽(1983—),女,河南延津人,高级工程师,硕士,主要研究方向:智能家电场景生成、语音识别、物联网通信
    孙宏霞(1987—),女,山东青岛人,工程师,主要研究方向:物联网通信。
  • 基金资助:
    国家重点研发计划项目(2023YFF0612100);青岛市关键技术攻关及产业化示范类项目(24-1-2-qljh-19-gx)

Flow-based lightweight high-quality text-to-speech conversion method

Lianqing WEN1, Ye TAO1(), Yunlong TIAN2, Li NIU2, Hongxia SUN2   

  1. 1.School of Information Science and Technology,Qingdao University of Science and Technology,Qingdao Shandong 266061,China
    2.Digital Home Network National Engineering Laboratory,Qingdao Shandong 266000,China
  • Received:2024-09-05 Revised:2024-10-23 Accepted:2024-10-24 Online:2024-11-05 Published:2025-10-10
  • Contact: Ye TAO
  • About author:WEN Lianqing, born in 2000, M. S. candidate. His researchinterests include speech synthesis, signal processing.
    TAO Ye, born in 1981, Ph. D., professor. His research interestsinclude software engineering, artificial intelligence, human-computer interaction.
    TIAN Yunlong, born in 1981, M. S., senior engineer. Hisresearch interests include smart home, digital transformation of brain,scene-based applications of next-generation artificial intelligence
    NIU Li, born in 1983, M. S., senior engineer. Her researchinterests include intelligent home appliance scene generation, speech recognition, internet of things communication
    SUN Hongxia,born in 1987,engineer. Her research interestsinclude internet of things communication.

摘要:

非自回归的文本到语音(NAR-TTS)模型的发展使得快速且高质量的语音合成成为可能。然而,合成语音的韵律仍有待提升,且在文本单元与语音之间存在一对多的问题,导致难以生成具有丰富韵律且高质量的梅尔频谱。此外,现有的NAR-TTS模型中存在大量冗余的神经网络。因此,提出一种基于流的轻量化高质量NAR-TTS方法——AirSpeech。首先,分析文本,得到不同粒度的语音特征编码;其次,采用基于注意力机制的技术对齐这些特征编码,从混合编码中提取韵律信息;在此过程中,利用长短距离注意力(LSRA)机制和单一网络技术使特征提取轻量化;最后,设计基于流的解码器,从而显著降低模型的参数量和峰值内存,并通过引入仿射耦合层(ACL),使解码出的梅尔频谱更细致和自然。实验结果表明,相较于BVAE-TTS和PortaSpeech方法,AirSpeech的结构相似性(SSIM)和平均意见得分(MOS)指标更优,能够兼顾合成语音的高质量和模型的轻量化。

关键词: 语音合成, 多粒度特征提取, 丰富韵律, 流语音解码器, 仿射耦合层, 轻量化

Abstract:

The development of Non-AutoRegressive Text-To-Speech (NAR-TTS) models has made it possible to synthesize high-quality speech rapidly. However, prosody of the synthesized speech still needs improvement, and the one-to-many problem between text units and speeches leads to difficulties in generating Mel spectra with rich prosody and high quality. Additionally, there is a redundancy of neural networks in the existing NAR-TTS models. To address these issues, a high-quality, lightweight NAR-TTS method based on flows, named AirSpeech, was proposed. Firstly, the texts were analyzed to obtain speech feature encodings of different granularities. Secondly, attention mechanism-based techniques were used to align these feature encodings, thereby extracting prosodic information from the mixed encoding. In this process, Long-Short Range Attention (LSRA) mechanisms and single network technology were utilized to make feature extraction lightweight. Finally, a flow-based decoder was designed, which reduced the model’s parameters and peak memory significantly, and by introducing Affine Coupling Layer (ACL), the decoded Mel spectra were more detailed and natural. Experimental results indicate that AirSpeech outperforms BVAE-TTS and PortaSpeech methods in terms of Structural SIMilarity (SSIM) and Mean Opinion Score (MOS) metrics, achieving a balance between high quality of the synthesized speech and lightweight nature of the model.

Key words: speech synthesis, multi-granularity feature extraction, rich prosody, flow-based speech decoder, Affine Coupling Layer (ACL), lightweight

中图分类号: