| 1 | ARIK S Ö, DIAMOS G, GIBIANSKY A, et al. Deep Voice 2: multi-speaker neural text-to-speech[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 2966-2974. | 
																													
																						| 2 | CHEN M, TAN X, LI B, et al. AdaSpeech: adaptive text to speech for custom voice[C/OL]// Proceedings of the 9th International Conference on Learning Representations. [S.l.]: dblp, 2021 [2023-04-11]. .  10.48550/arXiv.2103.00993 | 
																													
																						| 3 | WANG T, TAO J, FU R, et al. Spoken content and voice factorization for few-shot speaker adaptation[C]// Proceedings of the 21st Annual Conference of the International Speech Communication Association. Baixas, France: International Speech Communication Association, 2020: 796-800.  10.21437/interspeech.2020-1745 | 
																													
																						| 4 | ARIK S, CHEN J, PENG K, et al. Neural voice cloning with a few samples[C]// Proceedings of the 32nd International Conference on Neural Information Processing System. Red Hook: Curran Associates Inc., 2018: 10040-10050. | 
																													
																						| 5 | CHOI S, HAN S, KIM D, et al. Attentron: few-shot text-to-speech utilizing attention-based variable-length embedding[C]// Proceedings of the 21st Annual Conference of the International Speech Communication Association. Baixas, France: International Speech Communication Association, 2020: 2007-2011.  10.21437/interspeech.2020-2096 | 
																													
																						| 6 | C-M CHIEN, LIN J-H, HUANG C-Y, et al. Investigating on incorporating pretrained and learnable speaker representations for multi-speaker multi-style text-to-speech[C]// Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2021: 8588-8592.  10.1109/icassp39728.2021.9413880 | 
																													
																						| 7 | CAI Z, ZHANG C, LI M. From speaker verification to multi-speaker speech synthesis, deep transfer with feedback constraint[C]// Proceedings of the 21st Annual Conference of the International Speech Communication Association. Baixas, France: International Speech Communication Association, 2020: 3974-3978.  10.21437/interspeech.2020-1032 | 
																													
																						| 8 | AZAVI A. VAN DEN OORD, VINYALS O. Generating diverse high-fidelity images with VQ-VAE-2[C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2019: 14866-14876. | 
																													
																						| 9 | RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation[C]// Proceedings of the 2015 International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer, 2015: 234-241.  10.1007/978-3-319-24574-4_28 | 
																													
																						| 10 | WANG T, TAO J, FU R, et al. Bi-level speaker supervision for one-shot speech synthesis[C]// Proceedings of the 21st Annual Conference of the International Speech Communication Association. Baixas, France: International Speech Communication Association, 2020: 3989-3993.  10.21437/interspeech.2020-1737 | 
																													
																						| 11 | HUYBRECHTS G, MERRITT T, COMINI G, et al. Low-resource expressive text-to-speech using data augmentation[C]// Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2021: 6593-6597.  10.1109/icassp39728.2021.9413466 | 
																													
																						| 12 | HUANG S-F, LIN C-J, LIU D-R, et al. Meta-TTS: meta-learning for few-shot speaker adaptive text-to-speech[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, 30: 1558-1571.  10.1109/taslp.2022.3167258 | 
																													
																						| 13 | VAN DEN OORD A, DIELEMAN S, ZEN H, et al. WaveNet: a generative model for raw audio[C/OL]// Proceedings of the 9th ISCA Workshop on Speech Synthesis Workshop. [S.l.]: ISCA, 2016 [2023-05-01]. .  10.21437/ssw.2016 | 
																													
																						| 14 | WANG Y, SKERRY-RYAN R J, STANTON D, et al. Tacotron: towards end-to-end speech synthesis[C]// Proceedings of the 18th Annual Conference of the International Speech Communication Association. Baixas, France: International Speech Communication Association, 2017: 4006-4010.  10.21437/interspeech.2017-1452 | 
																													
																						| 15 | SKERRY-RYAN R J, BATTENBERG E, XIAO Y, et al. Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron[C/OL]// Proceedings of the 35th International Conference on Machine Learning. [S.l.]: ICML, 2018[2023-05-01]. . | 
																													
																						| 16 | REN Y, RUAN Y, TAN X, et al. FastSpeech: fast, robust and controllable text to speech[C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2019: 3171-3180. | 
																													
																						| 17 | REN Y, HU C, TAN X, et al. FastSpeech 2: fast and high-quality end-to-end text-to-speech[C/OL]// Proceedings of the 9th International Conference on Learning Representations. [S.l.]: ICLR, 2021[2023-05-01]. . | 
																													
																						| 18 | VINYALS O, BLUNDELL C, LILLICRAP T, et al. Matching networks for one shot learning[C]// Proceedings of the 30th International Conference on Neural Information Processing System. Red Hook: Curran Associates Inc., 2016: 3637-3645. | 
																													
																						| 19 | SNELL J, SWERSKY K, ZEMEL R. Prototypical networks for few-shot learning[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 4080-4090. | 
																													
																						| 20 | ORESHKIN B N, RODRIGUEZ P, LACOSTE A. TADAM: task dependent adaptive metric for improved few-shot learning[C]// Proceedings of the 32nd International Conference on Neural Information Processing System. Red Hook: Curran Associates Inc., 2018: 719-729. | 
																													
																						| 21 | REZENDE D J, MOHAMED S, DANIHELKA I, et al. One-shot generalization in deep generative models[C]// Proceedings of the 33rd International Conference on Machine Learning. New York: JMLR, 2016: 1521-1529. | 
																													
																						| 22 | BARTUNOV S, VETROV D. Few-shot generative modelling with generative matching networks[C]// Proceedings of the 21st International Conference on Artificial Intelligence and Statistics. New York: JMLR, 2018: 670-678. | 
																													
																						| 23 | REED S, CHEN Y, PAINE T, et al. Few-shot autoregressive density estimation: towards learning to learn distributions[C/OL]// Proceedings of the 6th International Conference on Learning Representations. [S.l.]: ICLR, 2018 [2023-05-01]. . | 
																													
																						| 24 | CHEN Y, ASSAEL Y, SHILLINGFORD B, et al. Sample efficient adaptive text-to-speech[C/OL]// Proceedings of the 7th International Conference on Learning Representations. [S.l.]: ICLR, 2019 [2023-05-01]. . | 
																													
																						| 25 | HU Q, MARCHI E, WINARSKY D, et al. Neural text-to-speech adaptation from low quality public recordings[C]// Proceedings of the 10th ISCA Speech Synthesis Workshop. Baixas, France: International Speech Communication Association, 2019: 24-28.  10.21437/ssw.2019-5 | 
																													
																						| 26 | KONG J, KIM J, BAE J. HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 17022-17033. |