Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (5): 1629-1635.DOI: 10.11772/j.issn.1001-9081.2023050640
Special Issue: 多媒体计算与计算机仿真
• Multimedia computing and computer simulation • Previous Articles Next Articles
Zhihao WU, Ziqiu CHI, Ting XIAO, Zhe WANG()
Received:
2023-05-24
Revised:
2023-09-07
Accepted:
2023-09-13
Online:
2023-09-26
Published:
2024-05-10
Contact:
Zhe WANG
About author:
WU Zhihao, born in 1999, M. S. candidate. His research interests include deep learning, few-shot learning.Supported by:
通讯作者:
王喆
作者简介:
吴郅昊(1999—),男,浙江慈溪人,硕士研究生,主要研究方向:深度学习、小样本学习基金资助:
CLC Number:
Zhihao WU, Ziqiu CHI, Ting XIAO, Zhe WANG. Meta-learning adaption for few-shot text-to-speech[J]. Journal of Computer Applications, 2024, 44(5): 1629-1635.
吴郅昊, 迟子秋, 肖婷, 王喆. 基于元学习自适应的小样本语音合成[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1629-1635.
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2023050640
Step | 使用步数编码器 | 不使用步数编码器 | ||
---|---|---|---|---|
LibriTTS-test-clean | VCTK | LibriTTS-test-clean | VCTK | |
1 | 0.653 0 | 0.535 5 | 0.616 1 | 0.589 8 |
5 | 0.786 0 | 0.658 3 | 0.687 4 | 0.618 0 |
10 | 0.822 2 | 0.689 1 | 0.714 7 | 0.614 8 |
20 | 0.850 8 | 0.707 0 | 0.738 5 | 0.627 7 |
50 | 0.867 2 | 0.715 6 | 0.776 5 | 0.645 2 |
100 | 0.872 4 | 0.713 2 | 0.798 8 | 0.664 9 |
Tab. 1 Speaker similarity matrices with or without step encoder
Step | 使用步数编码器 | 不使用步数编码器 | ||
---|---|---|---|---|
LibriTTS-test-clean | VCTK | LibriTTS-test-clean | VCTK | |
1 | 0.653 0 | 0.535 5 | 0.616 1 | 0.589 8 |
5 | 0.786 0 | 0.658 3 | 0.687 4 | 0.618 0 |
10 | 0.822 2 | 0.689 1 | 0.714 7 | 0.614 8 |
20 | 0.850 8 | 0.707 0 | 0.738 5 | 0.627 7 |
50 | 0.867 2 | 0.715 6 | 0.776 5 | 0.645 2 |
100 | 0.872 4 | 0.713 2 | 0.798 8 | 0.664 9 |
模型 | LibriTTS-test-clean | VCTK |
---|---|---|
Baseline | 7.677 2±0.938 5 | 7.022 9±1.026 3 |
Meta-TTS | 7.571 4±0.921 8 | 6.802 5±1.003 5 |
Meta-adaption | 7.450 2±0.861 1 | 6.524 3±0.905 3 |
Tab. 2 Evaluation results of DTW-MCD after 100 adaption steps
模型 | LibriTTS-test-clean | VCTK |
---|---|---|
Baseline | 7.677 2±0.938 5 | 7.022 9±1.026 3 |
Meta-TTS | 7.571 4±0.921 8 | 6.802 5±1.003 5 |
Meta-adaption | 7.450 2±0.861 1 | 6.524 3±0.905 3 |
模型 | LibriTTS-test-clean | VCTK | ||
---|---|---|---|---|
MOS | SMOS | MOS | SMOS | |
baseline | 2.708±0.250 | 2.783±0.266 | 3.067±0.229 | 2.767±0.335 |
Meta-TTS | 3.058±0.215 | 3.200±0.252 | 3.217±0.170 | 2.950±0.293 |
Meta- adaption | 3.433±0.199 | 3.508±0.237 | 3.392±0.181 | 3.075±0.335 |
Tab. 3 Subject evaluation results on LibriTTS-test-clean and VCTK datasets
模型 | LibriTTS-test-clean | VCTK | ||
---|---|---|---|---|
MOS | SMOS | MOS | SMOS | |
baseline | 2.708±0.250 | 2.783±0.266 | 3.067±0.229 | 2.767±0.335 |
Meta-TTS | 3.058±0.215 | 3.200±0.252 | 3.217±0.170 | 2.950±0.293 |
Meta- adaption | 3.433±0.199 | 3.508±0.237 | 3.392±0.181 | 3.075±0.335 |
1 | ARIK S Ö, DIAMOS G, GIBIANSKY A, et al. Deep Voice 2: multi-speaker neural text-to-speech[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 2966-2974. |
2 | CHEN M, TAN X, LI B, et al. AdaSpeech: adaptive text to speech for custom voice[C/OL]// Proceedings of the 9th International Conference on Learning Representations. [S.l.]: dblp, 2021 [2023-04-11]. . 10.48550/arXiv.2103.00993 |
3 | WANG T, TAO J, FU R, et al. Spoken content and voice factorization for few-shot speaker adaptation[C]// Proceedings of the 21st Annual Conference of the International Speech Communication Association. Baixas, France: International Speech Communication Association, 2020: 796-800. 10.21437/interspeech.2020-1745 |
4 | ARIK S, CHEN J, PENG K, et al. Neural voice cloning with a few samples[C]// Proceedings of the 32nd International Conference on Neural Information Processing System. Red Hook: Curran Associates Inc., 2018: 10040-10050. |
5 | CHOI S, HAN S, KIM D, et al. Attentron: few-shot text-to-speech utilizing attention-based variable-length embedding[C]// Proceedings of the 21st Annual Conference of the International Speech Communication Association. Baixas, France: International Speech Communication Association, 2020: 2007-2011. 10.21437/interspeech.2020-2096 |
6 | C-M CHIEN, LIN J-H, HUANG C-Y, et al. Investigating on incorporating pretrained and learnable speaker representations for multi-speaker multi-style text-to-speech[C]// Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2021: 8588-8592. 10.1109/icassp39728.2021.9413880 |
7 | CAI Z, ZHANG C, LI M. From speaker verification to multi-speaker speech synthesis, deep transfer with feedback constraint[C]// Proceedings of the 21st Annual Conference of the International Speech Communication Association. Baixas, France: International Speech Communication Association, 2020: 3974-3978. 10.21437/interspeech.2020-1032 |
8 | AZAVI A. VAN DEN OORD, VINYALS O. Generating diverse high-fidelity images with VQ-VAE-2[C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2019: 14866-14876. |
9 | RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation[C]// Proceedings of the 2015 International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer, 2015: 234-241. 10.1007/978-3-319-24574-4_28 |
10 | WANG T, TAO J, FU R, et al. Bi-level speaker supervision for one-shot speech synthesis[C]// Proceedings of the 21st Annual Conference of the International Speech Communication Association. Baixas, France: International Speech Communication Association, 2020: 3989-3993. 10.21437/interspeech.2020-1737 |
11 | HUYBRECHTS G, MERRITT T, COMINI G, et al. Low-resource expressive text-to-speech using data augmentation[C]// Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2021: 6593-6597. 10.1109/icassp39728.2021.9413466 |
12 | HUANG S-F, LIN C-J, LIU D-R, et al. Meta-TTS: meta-learning for few-shot speaker adaptive text-to-speech[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30: 1558-1571. 10.1109/taslp.2022.3167258 |
13 | VAN DEN OORD A, DIELEMAN S, ZEN H, et al. WaveNet: a generative model for raw audio[C/OL]// Proceedings of the 9th ISCA Workshop on Speech Synthesis Workshop. [S.l.]: ISCA, 2016 [2023-05-01]. . 10.21437/ssw.2016 |
14 | WANG Y, SKERRY-RYAN R J, STANTON D, et al. Tacotron: towards end-to-end speech synthesis[C]// Proceedings of the 18th Annual Conference of the International Speech Communication Association. Baixas, France: International Speech Communication Association, 2017: 4006-4010. 10.21437/interspeech.2017-1452 |
15 | SKERRY-RYAN R J, BATTENBERG E, XIAO Y, et al. Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron[C/OL]// Proceedings of the 35th International Conference on Machine Learning. [S.l.]: ICML, 2018[2023-05-01]. . |
16 | REN Y, RUAN Y, TAN X, et al. FastSpeech: fast, robust and controllable text to speech[C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2019: 3171-3180. |
17 | REN Y, HU C, TAN X, et al. FastSpeech 2: fast and high-quality end-to-end text-to-speech[C/OL]// Proceedings of the 9th International Conference on Learning Representations. [S.l.]: ICLR, 2021[2023-05-01]. . |
18 | VINYALS O, BLUNDELL C, LILLICRAP T, et al. Matching networks for one shot learning[C]// Proceedings of the 30th International Conference on Neural Information Processing System. Red Hook: Curran Associates Inc., 2016: 3637-3645. |
19 | SNELL J, SWERSKY K, ZEMEL R. Prototypical networks for few-shot learning[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 4080-4090. |
20 | ORESHKIN B N, RODRIGUEZ P, LACOSTE A. TADAM: task dependent adaptive metric for improved few-shot learning[C]// Proceedings of the 32nd International Conference on Neural Information Processing System. Red Hook: Curran Associates Inc., 2018: 719-729. |
21 | REZENDE D J, MOHAMED S, DANIHELKA I, et al. One-shot generalization in deep generative models[C]// Proceedings of the 33rd International Conference on Machine Learning. New York: JMLR, 2016: 1521-1529. |
22 | BARTUNOV S, VETROV D. Few-shot generative modelling with generative matching networks[C]// Proceedings of the 21st International Conference on Artificial Intelligence and Statistics. New York: JMLR, 2018: 670-678. |
23 | REED S, CHEN Y, PAINE T, et al. Few-shot autoregressive density estimation: towards learning to learn distributions[C/OL]// Proceedings of the 6th International Conference on Learning Representations. [S.l.]: ICLR, 2018 [2023-05-01]. . |
24 | CHEN Y, ASSAEL Y, SHILLINGFORD B, et al. Sample efficient adaptive text-to-speech[C/OL]// Proceedings of the 7th International Conference on Learning Representations. [S.l.]: ICLR, 2019 [2023-05-01]. . |
25 | HU Q, MARCHI E, WINARSKY D, et al. Neural text-to-speech adaptation from low quality public recordings[C]// Proceedings of the 10th ISCA Speech Synthesis Workshop. Baixas, France: International Speech Communication Association, 2019: 24-28. 10.21437/ssw.2019-5 |
26 | KONG J, KIM J, BAE J. HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 17022-17033. |
[1] | Xin YANG, Xueni CHEN, Chunjiang WU, Shijie ZHOU. Short-term traffic flow prediction of urban highway based on variant residual model and Transformer [J]. Journal of Computer Applications, 2024, 44(9): 2947-2951. |
[2] | Shuai FU, Xiaoying GUO, Ruyi BAI, Tao YAN, Bin CHEN. Age estimation method combining improved CloFormer model and ordinal regression [J]. Journal of Computer Applications, 2024, 44(8): 2372-2380. |
[3] | Tong CHEN, Fengyu YANG, Yu XIONG, Hong YAN, Fuxing QIU. Construction method of voiceprint library based on multi-scale frequency-channel attention fusion [J]. Journal of Computer Applications, 2024, 44(8): 2407-2413. |
[4] | Wudan LONG, Bo PENG, Jie HU, Ying SHEN, Danni DING. Road damage detection algorithm based on enhanced feature extraction [J]. Journal of Computer Applications, 2024, 44(7): 2264-2270. |
[5] | Ruihua LIU, Zihe HAO, Yangyang ZOU. Gait recognition algorithm based on multi-layer refined feature fusion [J]. Journal of Computer Applications, 2024, 44(7): 2250-2257. |
[6] | Chenhui CUI, Suzhen LIN, Dawei LI, Xiaofei LU, Jie WU. Infrared dim small target tracking method based on Siamese network and Transformer [J]. Journal of Computer Applications, 2024, 44(2): 563-571. |
[7] | Wenjie YAN, Dongyue DANG. Broad quantum state tomography model based on adaptive feature extraction [J]. Journal of Computer Applications, 2024, 44(12): 3861-3866. |
[8] | Yiyang FAN, Yang ZHANG, Shang ZENG, Yu ZENG, Maoli FU. Multivariate long-term series forecasting model based on decomposition and frequency domain feature extraction [J]. Journal of Computer Applications, 2024, 44(11): 3442-3448. |
[9] | Pei ZHAO, Yan QIAO, Rongyao HU, Xinyu YUAN, Minyue LI, Benchu ZHANG. Multivariate time series anomaly detection based on multi-domain feature extraction [J]. Journal of Computer Applications, 2024, 44(11): 3419-3426. |
[10] | Tao LIU, Shihong JU, Yimeng GAO. Small object detection algorithm from drone perspective based on improved YOLOv8n [J]. Journal of Computer Applications, 2024, 44(11): 3603-3609. |
[11] | Xiaoyu HUA, Dongfen LI, You FU, Kejun BI, Shi YING, Ruijin WANG. Industrial chain risk assessment and early warning model combining hierarchical graph neural network and long short-term memory [J]. Journal of Computer Applications, 2024, 44(10): 3223-3231. |
[12] | Yuning ZHANG, Abudukelimu ABULIZI, Tisheng MEI, Chun XU, Maierdana MAIMAITIREYIMU, Halidanmu ABUDUKELIMU, Yutao HOU. Anomaly detection method for skeletal X-ray images based on self-supervised feature extraction [J]. Journal of Computer Applications, 2024, 44(1): 175-181. |
[13] | Mu LI, Yuheng YANG, Xizheng KE. Emotion recognition model based on hybrid-mel gama frequency cross-attention transformer modal [J]. Journal of Computer Applications, 2024, 44(1): 86-93. |
[14] | Xiaomin ZHOU, Fei TENG, Yi ZHANG. Automatic international classification of diseases coding model based on meta-network [J]. Journal of Computer Applications, 2023, 43(9): 2721-2726. |
[15] | Yuelin TIAN, Ruizhang HUANG, Lina REN. Scholar fine-grained information extraction method fused with local semantic features [J]. Journal of Computer Applications, 2023, 43(9): 2707-2714. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||