Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
Speaker-emotion voice conversion method with limited corpus based on large language model and pre-trained model
Chaofeng LU, Ye TAO, Lianqing WEN, Fei MENG, Xiugong QIN, Yongjie DU, Yunlong TIAN
Journal of Computer Applications    2025, 45 (3): 815-822.   DOI: 10.11772/j.issn.1001-9081.2024010013
Abstract246)   HTML4)    PDF (1966KB)(689)       Save

Aiming at the problems that few people have combined research on speaker conversion and emotional voice conversion, and the emotional corpora of a target speaker in actual scenes are usually small, which are not enough to train strong generalization models from scratch, a Speaker-Emotion Voice Conversion with Limited corpus (LSEVC) was proposed with fusion of large language model and pre-trained emotional speech synthesis model. Firstly, a large language model was used to generate text with required emotion tags. Secondly, a pre-trained emotional speech synthesis model was fine-tuned by using the target speaker corpus to embed into the target speaker. Thirdly, the emotional speech was synthesized from the generated text for data augmentation. Fourthly, the synthesized speech and source target speech were used to co-train speaker-emotion voice conversion model. Finally, to further enhance speaker similarity and emotional similarity of converted speech, the model was fine-tuned by using source target speaker’s emotional speech. Experiments were conducted on publicly available corpora and a Chinese fiction corpus. Experimental results show that the proposed method outperforms CycleGAN-EVC, Seq2Seq-EVC-WA2, SMAL-ET2 and other methods when considering evaluation indicators — Emotional similarity Mean Opinion Score (EMOS), Speaker similarity Mean Opinion Score (SMOS), Mel Cepstral Distortion (MCD), and Word Error Rate (WER).

Table and Figures | Reference | Related Articles | Metrics
Flow-based lightweight high-quality text-to-speech conversion method
Lianqing WEN, Ye TAO, Yunlong TIAN, Li NIU, Hongxia SUN
Journal of Computer Applications    2025, 45 (10): 3277-3283.   DOI: 10.11772/j.issn.1001-9081.2024091244
Abstract47)   HTML0)    PDF (1340KB)(513)       Save

The development of Non-AutoRegressive Text-To-Speech (NAR-TTS) models has made it possible to synthesize high-quality speech rapidly. However, prosody of the synthesized speech still needs improvement, and the one-to-many problem between text units and speeches leads to difficulties in generating Mel spectra with rich prosody and high quality. Additionally, there is a redundancy of neural networks in the existing NAR-TTS models. To address these issues, a high-quality, lightweight NAR-TTS method based on flows, named AirSpeech, was proposed. Firstly, the texts were analyzed to obtain speech feature encodings of different granularities. Secondly, attention mechanism-based techniques were used to align these feature encodings, thereby extracting prosodic information from the mixed encoding. In this process, Long-Short Range Attention (LSRA) mechanisms and single network technology were utilized to make feature extraction lightweight. Finally, a flow-based decoder was designed, which reduced the model’s parameters and peak memory significantly, and by introducing Affine Coupling Layer (ACL), the decoded Mel spectra were more detailed and natural. Experimental results indicate that AirSpeech outperforms BVAE-TTS and PortaSpeech methods in terms of Structural SIMilarity (SSIM) and Mean Opinion Score (MOS) metrics, achieving a balance between high quality of the synthesized speech and lightweight nature of the model.

Table and Figures | Reference | Related Articles | Metrics