Search Result

Journals

Publication Years

Keywords

Please wait a minute...

For Selected:

Download Citations
EndNote Ris BibTeX

Toggle Thumbnails

Select

Flow-based lightweight high-quality text-to-speech conversion method

Lianqing WEN, Ye TAO, Yunlong TIAN, Li NIU, Hongxia SUN

Journal of Computer Applications 2025, 45 (10): 3277-3283. DOI: 10.11772/j.issn.1001-9081.2024091244

Abstract （21）

HTML （0）

PDF （1340KB）（452）

Save

The development of Non-AutoRegressive Text-To-Speech （NAR-TTS） models has made it possible to synthesize high-quality speech rapidly. However， prosody of the synthesized speech still needs improvement， and the one-to-many problem between text units and speeches leads to difficulties in generating Mel spectra with rich prosody and high quality. Additionally， there is a redundancy of neural networks in the existing NAR-TTS models. To address these issues， a high-quality， lightweight NAR-TTS method based on flows， named AirSpeech， was proposed. Firstly， the texts were analyzed to obtain speech feature encodings of different granularities. Secondly， attention mechanism-based techniques were used to align these feature encodings， thereby extracting prosodic information from the mixed encoding. In this process， Long-Short Range Attention （LSRA） mechanisms and single network technology were utilized to make feature extraction lightweight. Finally， a flow-based decoder was designed， which reduced the model’s parameters and peak memory significantly， and by introducing Affine Coupling Layer （ACL）， the decoded Mel spectra were more detailed and natural. Experimental results indicate that AirSpeech outperforms BVAE-TTS and PortaSpeech methods in terms of Structural SIMilarity （SSIM） and Mean Opinion Score （MOS） metrics， achieving a balance between high quality of the synthesized speech and lightweight nature of the model.

Table and Figures | Reference | Related Articles | Metrics