Meta-learning adaption for few-shot text-to-speech

doi:10.11772/j.issn.1001-9081.2023050640

Journal of Computer Applications

Received:2023-05-23 Revised:2023-09-07 Online:2023-09-26 Published:2023-09-26
Supported by:
Natural Science Foundation of China;Shanghai Science and Technology Program "Distributed and generative few-shot algorithm and theory research"Shanghai Science and Technology Program "Federated based cross-domain and cross-task incremental learning"

基于元学习自适应的小样本语音合成

吴郅昊¹,迟子秋²,肖婷²,王喆¹

1. 华东理工大学信息科学与工程学院
2. 华东理工大学

通讯作者: 王喆
基金资助:
国家自然科学基金;上海市科技计划项目“分布式小样本学习及小样本生成算法与理论研究”;上海市科技计划项目“联邦框架下跨域/跨任务增量学习方法研究”

Abstract

Abstract: Abstract: Abstract: Few-shot text-to-speech aims to synthesize speech that closely resembles the original speaker using only a small amount of training data. However, this approach faces challenges in quickly adapting to new speakers and improving the similarity of generated speech while ensuring high speech quality. Existing models often overlook changes in model features during different adaptation stages, leading to slow improvement of speech similarity. To address these issues, a meta-learning-guided model for adapting to new speakers is proposed. The model is guided by a meta-feature module during the adaptation process, ensuring the improvement of speech similarity while maintaining the quality of generated speech during the adaptation to a new speaker. Furthermore, the differentiation of adaptation stages is achieved through a step encoder, thereby enhancing the speed of model adaptation to new speakers. The proposed method is evaluated on the Libri-TTS and VCTK datasets using subjective and objective evaluation metrics. The experimental results show that the Mel Cepstral Distortion (MCD) is 7.45 and 6.52, respectively, which surpasses other meta-learning methods in terms of synthesized speech similarity and enables faster adaptation to new speakers.

Key words: Keywords: Few-shot generation, Text-to-speech(TTS), Meta learning, Speaker adaption, Feature extraction

摘要： 摘要: 在小样本条件下的语音合成要求在仅有少量样本的情况下合成与原说话人相似的语音，然而现有的小样本语音合成面临着如下问题：如何快速适配新说话人，并且在保证语音质量的情况下提高生成语音与说话人的相似性。现有的模型在适配新说话人的过程中，很少考虑到在不同适配阶段模型特征的变化规律，导致生成语音不能在保证语音质量的情况下快速提升语音相似性。为了解决上述问题，提出一种使用元学习指导模型适配新说话人的方法，模型中通过元特征模块对适配过程进行指导，以确保适配新说话人过程中提升语音相似度的同时保证生成语音质量；并通过步数编码器区分不同的适配阶段，以提升模型适配新说话人的速度。实验在Libri-TTS与VCTK数据集上通过主观与客观评价指标结合，在不同的适配步数对现有快速适配新说话的方法进行了比较。实验结果表明在梅尔倒谱失真(MCD)分别为7.45、6.52，在合成语音的相似度上要优于其他元学习方法，并且能够更加快速适配新的说话人。

关键词: 关键词: 小样本生成, 语音合成, 元学习, 说话人适配, 特征提取

CLC Number:

TP391

吴郅昊迟子秋肖婷王喆. 基于元学习自适应的小样本语音合成[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2023050640.

[1]	Chenhui CUI, Suzhen LIN, Dawei LI, Xiaofei LU, Jie WU. Infrared dim small target tracking method based on Siamese network and Transformer [J]. Journal of Computer Applications, 2024, 44(2): 563-571.
[2]	Mu LI, Yuheng YANG, Xizheng KE. Emotion recognition model based on hybrid-mel gama frequency cross-attention transformer modal [J]. Journal of Computer Applications, 2024, 44(1): 86-93.
[3]	Yuning ZHANG, Abudukelimu ABULIZI, Tisheng MEI, Chun XU, Maierdana MAIMAITIREYIMU, Halidanmu ABUDUKELIMU, Yutao HOU. Anomaly detection method for skeletal X-ray images based on self-supervised feature extraction [J]. Journal of Computer Applications, 2024, 44(1): 175-181.
[4]	Yuelin TIAN, Ruizhang HUANG, Lina REN. Scholar fine-grained information extraction method fused with local semantic features [J]. Journal of Computer Applications, 2023, 43(9): 2707-2714.
[5]	Xiaomin ZHOU, Fei TENG, Yi ZHANG. Automatic international classification of diseases coding model based on meta-network [J]. Journal of Computer Applications, 2023, 43(9): 2721-2726.
[6]	Hui WANG, Jianhong LI. Few-shot recognition method of 3D models based on Transformer [J]. Journal of Computer Applications, 2023, 43(6): 1750-1758.
[7]	Xianlan WANG, Jinkun ZHOU, Nan MU, Chen WANG. Cross-view geo-localization method based on multi-task joint learning [J]. Journal of Computer Applications, 2023, 43(5): 1625-1635.
[8]	Bin WANG, Tian XIANG, Yidong LYU, Xiaofan WANG. Adaptive multi-scale feature channel grouping optimization algorithm based on NSGA‑Ⅱ [J]. Journal of Computer Applications, 2023, 43(5): 1401-1408.
[9]	Mengting GE, Minghua WAN. Feature extraction model based on neighbor supervised locally invariant robust principal component analysis [J]. Journal of Computer Applications, 2023, 43(4): 1013-1020.
[10]	Rong GAO, Jiawei SHEN, Xiongkai SHAO, Xinyun WU. Instance segmentation algorithm based on Fastformer and self-supervised contrastive learning [J]. Journal of Computer Applications, 2023, 43(4): 1062-1070.
[11]	You YANG, Ruhui ZHANG, Pengcheng XU, Kang KANG, Hao ZHAI. Improved U-Net for seal segmentation of Republican archives [J]. Journal of Computer Applications, 2023, 43(3): 943-948.
[12]	Haifeng LI, Fan ZHANG, Minnan PIAO, Huaichao WANG, Nansha LI, Zhongcheng GUI. Automatic detection of targets under airport pavement based on channel and spatial attention [J]. Journal of Computer Applications, 2023, 43(3): 930-935.
[13]	Qing JIA, Laihua WANG, Weisheng WANG. Anomaly detection in video via independently recurrent neural network and variational autoencoder network [J]. Journal of Computer Applications, 2023, 43(2): 507-513.
[14]	Junpeng ZHANG, Yujie SHI, Rui JANG, Jingjing DONG, Changjian QIU. Review on advances in recognition and classification of cognitive impairment based on EEG signals [J]. Journal of Computer Applications, 2023, 43(10): 3297-3308.
[15]	FANG Xin, HUANG Zexin, ZHANG Yuhan, GAO Tian, PAN Jia, FU Zhonghua, GAO Jianqing, LIU Junhua, ZOU Liang. Semi‑supervised end‑to‑end fake speech detection method based on time‑domain waveforms [J]. Journal of Computer Applications, 2023, 43(1): 227-231.