基于元学习自适应的小样本语音合成

doi:10.11772/j.issn.1001-9081.2023050640

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (5): 1629-1635.DOI: 10.11772/j.issn.1001-9081.2023050640

所属专题：多媒体计算与计算机仿真

• 多媒体计算与计算机仿真 • 上一篇下一篇

基于元学习自适应的小样本语音合成

吴郅昊, 迟子秋, 肖婷, 王喆()

华东理工大学信息科学与工程学院，上海 200237

收稿日期:2023-05-24 修回日期:2023-09-07 接受日期:2023-09-13 发布日期:2023-09-26 出版日期:2024-05-10
通讯作者: 王喆
作者简介:吴郅昊（1999—），男，浙江慈溪人，硕士研究生，主要研究方向：深度学习、小样本学习
迟子秋（1995—），男，黑龙江鹤岗人，博士，主要研究方向：深度学习、小样本学习
肖婷（1990—），女，四川遂宁人，讲师，博士，主要研究方向：机器学习、医学图像处理
第一联系人：王喆（1981—），男，教授，博士，CCF会员，主要研究方向：模式识别、机器学习、数据挖掘及大数据分析、图像处理与分析。
基金资助:
上海市科技计划项目(21511100800);国家自然科学基金资助项目(62076094)

Meta-learning adaption for few-shot text-to-speech

Zhihao WU, Ziqiu CHI, Ting XIAO, Zhe WANG()

School of Information Science and Engineering，East China University of Science and Technology，Shanghai 200237，China

Received:2023-05-24 Revised:2023-09-07 Accepted:2023-09-13 Online:2023-09-26 Published:2024-05-10
Contact: Zhe WANG
About author:WU Zhihao， born in 1999， M. S. candidate. His research interests include deep learning， few-shot learning.
CHI Ziqiu， born in 1995， Ph. D. His research interests include deep learning， few-shot learning.
XIAO Ting， born in 1990， Ph. D.， lecturer. Her research interests include artificial intelligence， medical image processing.
Supported by:
Shanghai Science and Technology Program(21511100800);National Natural Science Foundation of China(62076094)

摘要/Abstract

摘要：

在小样本条件下的语音合成（TTS）要求在仅有少量样本的情况下合成与原说话人相似的语音，然而现有的小样本语音合成面临如下问题：如何快速适配新说话人，并且在保证语音质量的情况下提高生成语音与说话人的相似性。现有模型在适配新说话人的过程中，很少考虑到在不同适配阶段模型特征的变化规律，导致生成语音不能在保证语音质量的情况下快速提升语音相似性。为了解决上述问题，提出一种使用元学习指导模型适配新说话人的方法，模型中通过元特征模块对适配过程进行指导，在适配新说话人过程中提升语音相似度的同时保证生成语音质量；并通过步数编码器区分不同的适配阶段，以提升模型适配新说话人的速度。在Libri-TTS与VCTK数据集上通过主观与客观评价指标，在不同的适配步数下对现有快速适配新说话人的方法进行了比较，实验结果表明所提方法动态时间规整的梅尔倒谱失真（DTW-MCD）分别为7.450 2与6.524 3，在合成语音的相似度上优于其他元学习方法，并且能够更快适配新的说话人。

关键词: 小样本生成, 语音合成, 元学习, 说话人适配, 特征提取

Abstract:

Few-shot Text-To-Speech （TTS） aims to synthesize speech that closely resembles the original speaker using only a small amount of training data. However， this approach faces challenges in quickly adapting to new speakers and improving the similarity between generated speech and speakers while ensuring high speech quality. Existing models often overlook changes in model features during different adaptation stages， leading to slow improvement of speech similarity. To address these issues， a meta-learning-guided model for adapting to new speakers was proposed. The model was guided by a meta-feature module during the adaptation process， ensuring the improvement of speech similarity while maintaining the quality of generated speech during the adaptation to new speakers. Furthermore， the differentiation of adaptation stages was achieved through a step encoder， thereby enhancing the speed of model adaptation to new speakers. The proposed method was evaluated on the Libri-TTS and VCTK datasets using subjective and objective evaluation metrics. Experimental results show that the Dynamic Time Warping-Mel Cepstral Distortion （DTW-MCD） of the proposed model are 7.450 2 and 6.524 3， respectively. It surpasses other meta-learning methods in terms of synthesized speech similarity and enables faster adaptation to new speakers.

Key words: few-shot generation, Text-To-Speech (TTS), meta-learning, speaker adaption, feature extraction

中图分类号:

TP391

吴郅昊, 迟子秋, 肖婷, 王喆. 基于元学习自适应的小样本语音合成[J]. 计算机应用, 2024, 44(5): 1629-1635.

Zhihao WU, Ziqiu CHI, Ting XIAO, Zhe WANG. Meta-learning adaption for few-shot text-to-speech[J]. Journal of Computer Applications, 2024, 44(5): 1629-1635.

图/表 7

图1 总体框架

Fig. 1 Overall framework

图2 元特征提取模块

Fig. 2 Meta-feature extraction module

图3 LibriTTS-test-clean和VCTK数据集不同模块特征提取的说话人相似度评价结果

Fig. 3 Speaker similarity matrices with different feature extraction modules on LibriTTS-test-clean and VCTK datasets

表1 是否使用步数编码器的说话人相似度评价结果

Tab. 1 Speaker similarity matrices with or without step encoder

Step	使用步数编码器		不使用步数编码器
Step	LibriTTS-test-clean	VCTK	LibriTTS-test-clean	VCTK
1	0.653 0	0.535 5	0.616 1	0.589 8
5	0.786 0	0.658 3	0.687 4	0.618 0
10	0.822 2	0.689 1	0.714 7	0.614 8
20	0.850 8	0.707 0	0.738 5	0.627 7
50	0.867 2	0.715 6	0.776 5	0.645 2
100	0.872 4	0.713 2	0.798 8	0.664 9

图4 LibriTTS-test-clean和VCTK数据集不同微调方式说话人相似度评价结果

Fig. 4 Speaker similarity matrices with different fine-tuning methods on LibriTTS-test-clean and VCTK datasets

表2 适配100步后DTW-MCD评价结果

Tab. 2 Evaluation results of DTW-MCD after 100 adaption steps

模型	LibriTTS-test-clean	VCTK
Baseline	7.677 2±0.938 5	7.022 9±1.026 3
Meta-TTS	7.571 4±0.921 8	6.802 5±1.003 5
Meta-adaption	7.450 2±0.861 1	6.524 3±0.905 3

表3 LibriTTS-test-clean和VCTK数据集主观评价结果

Tab. 3 Subject evaluation results on LibriTTS-test-clean and VCTK datasets

模型	LibriTTS-test-clean		VCTK
模型	MOS	SMOS	MOS	SMOS
baseline	2.708±0.250	2.783±0.266	3.067±0.229	2.767±0.335
Meta-TTS	3.058±0.215	3.200±0.252	3.217±0.170	2.950±0.293
Meta- adaption	3.433±0.199	3.508±0.237	3.392±0.181	3.075±0.335

参考文献 26

1	ARIK S Ö， DIAMOS G， GIBIANSKY A， et al. Deep Voice 2： multi-speaker neural text-to-speech［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 2966-2974.
2	CHEN M， TAN X， LI B， et al. AdaSpeech： adaptive text to speech for custom voice［C/OL］// Proceedings of the 9th International Conference on Learning Representations. ［S.l.］： dblp， 2021 ［2023-04-11］. . 10.48550/arXiv.2103.00993
3	WANG T， TAO J， FU R， et al. Spoken content and voice factorization for few-shot speaker adaptation［C］// Proceedings of the 21st Annual Conference of the International Speech Communication Association. Baixas， France： International Speech Communication Association， 2020： 796-800. 10.21437/interspeech.2020-1745
4	ARIK S， CHEN J， PENG K， et al. Neural voice cloning with a few samples［C］// Proceedings of the 32nd International Conference on Neural Information Processing System. Red Hook： Curran Associates Inc.， 2018： 10040-10050.
5	CHOI S， HAN S， KIM D， et al. Attentron： few-shot text-to-speech utilizing attention-based variable-length embedding［C］// Proceedings of the 21st Annual Conference of the International Speech Communication Association. Baixas， France： International Speech Communication Association， 2020： 2007-2011. 10.21437/interspeech.2020-2096
6	C-M CHIEN， LIN J-H， HUANG C-Y， et al. Investigating on incorporating pretrained and learnable speaker representations for multi-speaker multi-style text-to-speech［C］// Proceedings of the 2021 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2021： 8588-8592. 10.1109/icassp39728.2021.9413880
7	CAI Z， ZHANG C， LI M. From speaker verification to multi-speaker speech synthesis， deep transfer with feedback constraint［C］// Proceedings of the 21st Annual Conference of the International Speech Communication Association. Baixas， France： International Speech Communication Association， 2020： 3974-3978. 10.21437/interspeech.2020-1032
8	AZAVI A. VAN DEN OORD， VINYALS O. Generating diverse high-fidelity images with VQ-VAE-2［C］// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2019： 14866-14876.
9	RONNEBERGER O， FISCHER P， BROX T. U-Net： convolutional networks for biomedical image segmentation［C］// Proceedings of the 2015 International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham： Springer， 2015： 234-241. 10.1007/978-3-319-24574-4_28
10	WANG T， TAO J， FU R， et al. Bi-level speaker supervision for one-shot speech synthesis［C］// Proceedings of the 21st Annual Conference of the International Speech Communication Association. Baixas， France： International Speech Communication Association， 2020： 3989-3993. 10.21437/interspeech.2020-1737
11	HUYBRECHTS G， MERRITT T， COMINI G， et al. Low-resource expressive text-to-speech using data augmentation［C］// Proceedings of the 2021 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2021： 6593-6597. 10.1109/icassp39728.2021.9413466
12	HUANG S-F， LIN C-J， LIU D-R， et al. Meta-TTS： meta-learning for few-shot speaker adaptive text-to-speech［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2022， 30： 1558-1571. 10.1109/taslp.2022.3167258
13	VAN DEN OORD A， DIELEMAN S， ZEN H， et al. WaveNet： a generative model for raw audio［C/OL］// Proceedings of the 9th ISCA Workshop on Speech Synthesis Workshop. ［S.l.］： ISCA， 2016 ［2023-05-01］. . 10.21437/ssw.2016
14	WANG Y， SKERRY-RYAN R J， STANTON D， et al. Tacotron： towards end-to-end speech synthesis［C］// Proceedings of the 18th Annual Conference of the International Speech Communication Association. Baixas， France： International Speech Communication Association， 2017： 4006-4010. 10.21437/interspeech.2017-1452
15	SKERRY-RYAN R J， BATTENBERG E， XIAO Y， et al. Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron［C/OL］// Proceedings of the 35th International Conference on Machine Learning. ［S.l.］： ICML， 2018［2023-05-01］. .
16	REN Y， RUAN Y， TAN X， et al. FastSpeech： fast， robust and controllable text to speech［C］// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2019： 3171-3180.
17	REN Y， HU C， TAN X， et al. FastSpeech 2： fast and high-quality end-to-end text-to-speech［C/OL］// Proceedings of the 9th International Conference on Learning Representations. ［S.l.］： ICLR， 2021［2023-05-01］. .
18	VINYALS O， BLUNDELL C， LILLICRAP T， et al. Matching networks for one shot learning［C］// Proceedings of the 30th International Conference on Neural Information Processing System. Red Hook： Curran Associates Inc.， 2016： 3637-3645.
19	SNELL J， SWERSKY K， ZEMEL R. Prototypical networks for few-shot learning［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 4080-4090.
20	ORESHKIN B N， RODRIGUEZ P， LACOSTE A. TADAM： task dependent adaptive metric for improved few-shot learning［C］// Proceedings of the 32nd International Conference on Neural Information Processing System. Red Hook： Curran Associates Inc.， 2018： 719-729.
21	REZENDE D J， MOHAMED S， DANIHELKA I， et al. One-shot generalization in deep generative models［C］// Proceedings of the 33rd International Conference on Machine Learning. New York： JMLR， 2016： 1521-1529.
22	BARTUNOV S， VETROV D. Few-shot generative modelling with generative matching networks［C］// Proceedings of the 21st International Conference on Artificial Intelligence and Statistics. New York： JMLR， 2018： 670-678.
23	REED S， CHEN Y， PAINE T， et al. Few-shot autoregressive density estimation： towards learning to learn distributions［C/OL］// Proceedings of the 6th International Conference on Learning Representations. ［S.l.］： ICLR， 2018 ［2023-05-01］. .
24	CHEN Y， ASSAEL Y， SHILLINGFORD B， et al. Sample efficient adaptive text-to-speech［C/OL］// Proceedings of the 7th International Conference on Learning Representations. ［S.l.］： ICLR， 2019 ［2023-05-01］. .
25	HU Q， MARCHI E， WINARSKY D， et al. Neural text-to-speech adaptation from low quality public recordings［C］// Proceedings of the 10th ISCA Speech Synthesis Workshop. Baixas， France： International Speech Communication Association， 2019： 24-28. 10.21437/ssw.2019-5
26	KONG J， KIM J， BAE J. HiFi-GAN： generative adversarial networks for efficient and high fidelity speech synthesis［C］// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2020： 17022-17033.

[1]	黄云川, 江永全, 黄骏涛, 杨燕. 基于元图同构网络的分子毒性预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2964-2969.
[2]	杨鑫, 陈雪妮, 吴春江, 周世杰. 结合变种残差模型和Transformer的城市公路短时交通流预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2947-2951.
[3]	付帅, 郭小英, 白茹意, 闫涛, 陈斌. 改进的CloFormer模型与有序回归相结合的年龄评估方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2372-2380.
[4]	陈彤, 杨丰玉, 熊宇, 严荭, 邱福星. 基于多尺度频率通道注意力融合的声纹库构建方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2407-2413.
[5]	龙伍丹, 彭博, 胡节, 申颖, 丁丹妮. 基于加强特征提取的道路病害检测算法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2264-2270.
[6]	刘瑞华, 郝子赫, 邹洋杨. 基于多层级精细特征融合的步态识别算法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2250-2257.
[7]	时旺军, 王晶, 宁晓军, 林友芳. 小样本场景下的元迁移学习睡眠分期模型[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1445-1451.
[8]	崔晨辉, 蔺素珍, 李大威, 禄晓飞, 武杰. 基于孪生网络和Transformer的红外弱小目标跟踪方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 563-571.
[9]	黄雨鑫, 黄贻望, 黄辉. 基于浅层网络预测的元标签校正方法[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3364-3370.
[10]	刘涛, 鞠事宏, 高一萌. 基于改进YOLOv8n的无人机视角下小目标检测算法[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3603-3609.
[11]	范艺扬, 张洋, 曾尚, 曾渝, 付茂栗. 基于分解和频域特征提取的多变量长时间序列预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3442-3448.
[12]	赵培, 乔焰, 胡荣耀, 袁新宇, 李敏悦, 张本初. 基于多域特征提取的多变量时间序列异常检测[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3419-3426.
[13]	花晓雨, 李冬芬, 付优, 毕可骏, 应时, 王瑞锦. 结合层次图神经网络与长短期记忆的产业链风险评估预警模型[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3223-3231.
[14]	李牧, 杨宇恒, 柯熙政. 基于混合特征提取与跨模态特征预测融合的情感识别模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 86-93.
[15]	张雨宁, 阿布都克力木·阿布力孜, 梅悌胜, 徐春, 麦尔达娜·买买提热依木, 哈里旦木·阿布都克里木, 侯钰涛. 基于自监督特征提取的骨骼X线影像异常检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 175-181.

基于元学习自适应的小样本语音合成

Meta-learning adaption for few-shot text-to-speech

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献 26

相关文章 15

编辑推荐

Metrics