[1] FUNG P, SCHULTZ T. Multilingual spoken language processing[J]. IEEE Signal Processing Magazine, 2008, 25(3):89-97. [2] HUNT A J, BLACK A W. Unit selection in a concatenative speech synthesis system using a large speech database[C]// Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscataway, NJ: IEEE, 1996: 373-376. [3] CAMPBELL N, BLACK A W. Prosody and the selection of source units for concatenative synthesis[M]// Progress in Speech Synthesis. New York: Springer, 1997: 279-292. [4] ZE H, SENIOR A, SCHUSTER M. Statistical parametric speech synthesis using deep neural networks[C]// Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ: IEEE, 2013:7962-7966. [5] TOKUDA K, NANKAKU Y, TODA T, et al. Speech synthesis based on hidden Markov models[J]. Proceedings of the IEEE, 2013, 101(5): 1234-1252. [6] ZEN H, TOKUDA K, BLACK A W. Statistical parametric speech synthesis[J]. Speech Communication, 2009, 51(11):1039-1064. [7] OORD A V D, DIELEMAN, ZEN H, et al. WaveNet: a generative model for raw audio[J/OL]. arXiv Preprint, 2016, 2016: arXiv:1609.03499(2016-09-12)[2016-09-19]. https://arxiv.org/abs/1609.03499. [8] ARIK S O, CHRZANOWSKI M, COATES A, et al. Deep Voice: real-time neural text-to-speech[J/OL]. arXiv Preprint, 2017, 2017: arXiv:1702.07825(2017-02-25)[2017-03-07]. https://arxiv.org/abs/1702.07825. [9] SOTELO J, MEHRI S, KUMAR K, et al. Char2Wav: end-to-end speech synthesis[EB/OL].[2018-06-20]. http://mila.umontreal.ca/wp-content/uploads/2017/02/end-end-speech.pdf. [10] WANG Y, SKERRY-RYAN R, STANTON D, et al. Tacotron: towards end-to-end speech synthesis[J/OL]. arXiv Preprint, 2017, 2017: arXiv:1703.10135(2017-03-29)[2017-04-06]. https://arxiv.org/abs/1703.10135. [11] GRIFFIN D, LIM J S. Signal estimation from modified short-time Fourier transform[J]. IEEE Transactions on Acoustics Speech and Signal Processing, 1984, 32(2):236-243. [12] CHOROWSKI J K, BAHDANAU D, SERDYUK D, et al. Attention-based models for speech recognition[C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2015: 577-585. [13] BAHDANAU D, CHOROWSKI J, SERDYUK D, et al. End-to-end attention-based large vocabulary speech recognition[C]// Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ: IEEE, 2016: 4945-4949. [14] CHAN W, JAITLY N, LE Q, et al. Listen, attend and spell: a neural network for large vocabulary conversational speech recognition[C]// Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ: IEEE, 2016:4960-4964. [15] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015:3156-3164. [16] VINYALS O, KAISER L, KOO T, et al. Grammar as a foreign language[C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2014: 2773-2781. [17] LEE J, CHO K, HOFMANN T. Fully character-level neural machine translation without explicit segmentation[J/OL]. arXiv Preprint, 2017, 2017: arXiv:1610.03017(2016-10-10)[2017-05-13]. https://arxiv.org/abs/1610.03017. [18] SRIVASTAVA R K, GREFF K, SCHMIDHUBER J. Highway networks[J/OL]. arXiv Preprint, 2015, 2015: arXiv:1505.00387(2015-03-03)[2015-11-03]. https://arxiv.org/abs/1505.00387. [19] ERRO D, SAINZ I, NAVAS E, et al. Harmonics plus noise model based vocoder for statistical parametric speech synthesis[J]. IEEE Journal of Selected Topics in Signal Processing, 2014, 8(2):184-194. [20] AOKI N. Development of a rule-based speech synthesis system for the Japanese language using a MELP vocoder[C]// Proceedings of the 200010th European Signal Processing Conference. Piscataway, NJ: IEEE, 2000: 1-4. [21] GUNDUZHAN E, MOMTAHAN K. Linear prediction based packet loss concealment algorithm for PCM coded speech[J]. IEEE Transactions on Speech and Audio Processing, 2001, 9(8): 778-785. |