基于WaveNet的端到端语音合成方法

doi:10.11772/j.issn.1001-9081.2018102131

计算机应用 ›› 2019, Vol. 39 ›› Issue (5): 1325-1329.DOI: 10.11772/j.issn.1001-9081.2018102131

基于WaveNet的端到端语音合成方法

邱泽宇, 屈丹, 张连海

战略支援部队信息工程大学信息系统工程学院, 郑州 450000

收稿日期:2018-10-23 修回日期:2018-12-12 出版日期:2019-05-10 发布日期:2019-05-14
通讯作者: 邱泽宇
作者简介:邱泽宇(1995-),男,河南驻马店人,硕士研究生,主要研究方向:智能信息处理、语音合成;屈丹(1974-),女,吉林长春人,副教授,博士,主要研究方向:语音信号处理、智能信息处理、人工智能、信号分析;张连海(1971-),男,山东菏泽人,副教授,硕士,主要研究方向:语音信号处理、智能信息处理、人工智能、信号分析。
基金资助:
国家自然科学基金资助项目（61673395）。

End-to-end speech synthesis based on WaveNet

QIU Zeyu, QU Dan, ZHANG Lianhai

College of Information Systems Engineering, PLA Strategic Force Information Engineering University, Zhengzhou Henan 450000, China

Received:2018-10-23 Revised:2018-12-12 Online:2019-05-10 Published:2019-05-14
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61673395).

摘要/Abstract

摘要： 针对端到端语音合成系统中Griffin-Lim算法恢复相位信息合成语音保真度较低、人工处理痕迹明显的问题，提出了一种基于WaveNet网络架构的端到端语音合成方法。以序列映射Seq2Seq结构为基础，首先将输入文本转化为one-hot向量，然后引入注意力机制获取梅尔声谱图，最后利用WaveNet后端处理网络重构语音信号的相位信息，从而将梅尔频谱特征逆变换为时域波形样本。实验的测试语料为LJSpeech-1.0和THchs-30，针对英语、汉语两个语种进行了实验，实验结果表明平均意见得分（MOS）分别为3.31、3.02，在合成自然度方面优于采用Griffin-Lim算法的端到端语音合成系统以及参数式语音合成系统。

关键词: 语音合成, 端到端, Seq2Seq, Griffin-Lim算法, WaveNet

Abstract: Griffin-Lim algorithm is widely used in end-to-end speech synthesis with phase estimation, which always produces obviously artificial speech with low fidelity. Aiming at this problem, a system for end-to-end speech synthesis based on WaveNet network architecture was proposed. Based on Seq2Seq (Sequence-to-Sequence) structure, firstly the input text was converted into a one-hot vector, then, the attention mechanism was introduced to obtain a Mel spectrogram, finally WaveNet network was used to reconstruct phase information to generate time-domain waveform samples from the Mel spectrogram features. Aiming at English and Chinese, the proposed method achieves a Mean Opinion Score (MOS) of 3.31 on LJSpeech-1.0 corpus and 3.02 on THchs-30 corpus, which outperforms the end-to-end systems based on Griffin-Lim algorithm and parametric systems in terms of naturalness.

Key words: speech syhthesis, end-to-end, Sequence-to-Sequence (Seq2Seq), Griffin-Lim algorithm, WaveNet

中图分类号:

TN912.33

邱泽宇, 屈丹, 张连海. 基于WaveNet的端到端语音合成方法[J]. 计算机应用, 2019, 39(5): 1325-1329.

QIU Zeyu, QU Dan, ZHANG Lianhai. End-to-end speech synthesis based on WaveNet[J]. Journal of Computer Applications, 2019, 39(5): 1325-1329.

参考文献

[1] FUNG P, SCHULTZ T. Multilingual spoken language processing[J]. IEEE Signal Processing Magazine, 2008, 25(3):89-97.
[2] HUNT A J, BLACK A W. Unit selection in a concatenative speech synthesis system using a large speech database[C]// Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscataway, NJ: IEEE, 1996: 373-376.
[3] CAMPBELL N, BLACK A W. Prosody and the selection of source units for concatenative synthesis[M]// Progress in Speech Synthesis. New York: Springer, 1997: 279-292.
[4] ZE H, SENIOR A, SCHUSTER M. Statistical parametric speech synthesis using deep neural networks[C]// Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ: IEEE, 2013:7962-7966.
[5] TOKUDA K, NANKAKU Y, TODA T, et al. Speech synthesis based on hidden Markov models[J]. Proceedings of the IEEE, 2013, 101(5): 1234-1252.
[6] ZEN H, TOKUDA K, BLACK A W. Statistical parametric speech synthesis[J]. Speech Communication, 2009, 51(11):1039-1064.
[7] OORD A V D, DIELEMAN, ZEN H, et al. WaveNet: a generative model for raw audio[J/OL]. arXiv Preprint, 2016, 2016: arXiv:1609.03499(2016-09-12)[2016-09-19]. https://arxiv.org/abs/1609.03499.
[8] ARIK S O, CHRZANOWSKI M, COATES A, et al. Deep Voice: real-time neural text-to-speech[J/OL]. arXiv Preprint, 2017, 2017: arXiv:1702.07825(2017-02-25)[2017-03-07]. https://arxiv.org/abs/1702.07825.
[9] SOTELO J, MEHRI S, KUMAR K, et al. Char2Wav: end-to-end speech synthesis[EB/OL].[2018-06-20]. http://mila.umontreal.ca/wp-content/uploads/2017/02/end-end-speech.pdf.
[10] WANG Y, SKERRY-RYAN R, STANTON D, et al. Tacotron: towards end-to-end speech synthesis[J/OL]. arXiv Preprint, 2017, 2017: arXiv:1703.10135(2017-03-29)[2017-04-06]. https://arxiv.org/abs/1703.10135.
[11] GRIFFIN D, LIM J S. Signal estimation from modified short-time Fourier transform[J]. IEEE Transactions on Acoustics Speech and Signal Processing, 1984, 32(2):236-243.
[12] CHOROWSKI J K, BAHDANAU D, SERDYUK D, et al. Attention-based models for speech recognition[C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2015: 577-585.
[13] BAHDANAU D, CHOROWSKI J, SERDYUK D, et al. End-to-end attention-based large vocabulary speech recognition[C]// Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ: IEEE, 2016: 4945-4949.
[14] CHAN W, JAITLY N, LE Q, et al. Listen, attend and spell: a neural network for large vocabulary conversational speech recognition[C]// Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ: IEEE, 2016:4960-4964.
[15] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015:3156-3164.
[16] VINYALS O, KAISER L, KOO T, et al. Grammar as a foreign language[C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge, MA: MIT Press, 2014: 2773-2781.
[17] LEE J, CHO K, HOFMANN T. Fully character-level neural machine translation without explicit segmentation[J/OL]. arXiv Preprint, 2017, 2017: arXiv:1610.03017(2016-10-10)[2017-05-13]. https://arxiv.org/abs/1610.03017.
[18] SRIVASTAVA R K, GREFF K, SCHMIDHUBER J. Highway networks[J/OL]. arXiv Preprint, 2015, 2015: arXiv:1505.00387(2015-03-03)[2015-11-03]. https://arxiv.org/abs/1505.00387.
[19] ERRO D, SAINZ I, NAVAS E, et al. Harmonics plus noise model based vocoder for statistical parametric speech synthesis[J]. IEEE Journal of Selected Topics in Signal Processing, 2014, 8(2):184-194.
[20] AOKI N. Development of a rule-based speech synthesis system for the Japanese language using a MELP vocoder[C]// Proceedings of the 200010th European Signal Processing Conference. Piscataway, NJ: IEEE, 2000: 1-4.
[21] GUNDUZHAN E, MOMTAHAN K. Linear prediction based packet loss concealment algorithm for PCM coded speech[J]. IEEE Transactions on Speech and Audio Processing, 2001, 9(8): 778-785.

基于WaveNet的端到端语音合成方法

End-to-end speech synthesis based on WaveNet

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	郭帅, 苏旸. 基于数据流的加密流量分类方法[J]. 计算机应用, 2021, 41(5): 1386-1391.
[2]	吴赛赛, 梁晓贺, 谢能付, 周爱莲, 郝心宁. 面向领域实体关系联合抽取的标注方法[J]. 计算机应用, 2021, 41(10): 2858-2863.
[3]	林志兴, 王立可. 基于深度特征和Seq2Seq模型的网络态势预测方法[J]. 计算机应用, 2020, 40(8): 2241-2247.
[4]	胡学敏, 童秀迟, 郭琳, 张若晗, 孔力. 基于深度视觉注意神经网络的端到端自动驾驶模型[J]. 计算机应用, 2020, 40(7): 1926-1931.
[5]	陈修凯, 陆志华, 周宇. 基于卷积编解码器和门控循环单元的语音分离算法[J]. 计算机应用, 2020, 40(7): 2137-2141.
[6]	杨健, 李振鹏, 苏鹏. 语音分割与端点检测研究综述[J]. 计算机应用, 2020, 40(1): 1-7.
[7]	贾永超, 何小卫, 郑忠龙. 融合重检测机制的卷积回归网络目标跟踪算法[J]. 计算机应用, 2019, 39(8): 2247-2251.
[8]	文凯, 谭笑. 基于用户偏好与副本阈值的端到端缓存算法[J]. 计算机应用, 2019, 39(7): 2051-2055.
[9]	潘沛克, 王艳, 罗勇, 周激流. 基于U-net模型的全自动鼻咽肿瘤MR图像分割[J]. 计算机应用, 2019, 39(4): 1183-1188.
[10]	陶涛, 周喜, 马博, 赵凡. 基于双向LSTM的Seq2Seq模型在加油站时序数据异常检测中的应用[J]. 计算机应用, 2019, 39(3): 924-929.
[11]	丁建立, 李洋, 王家亮. 基于双编码器的短文本自动摘要方法[J]. 计算机应用, 2019, 39(12): 3476-3481.
[12]	王康, 董元菲. 基于角度间隔嵌入特征的端到端声纹识别模型[J]. 计算机应用, 2019, 39(10): 2937-2941.
[13]	姚煜, RYAD Chellali. 基于双向长短时记忆联结时序分类和加权有限状态转换器的端到端中文语音识别系统[J]. 计算机应用, 2018, 38(9): 2495-2499.
[14]	范晓波, 李兴明. 基于线性松弛方法的网络故障链路诊断[J]. 计算机应用, 2018, 38(7): 2005-2008.
[15]	冬欣松, 郑建超, 蔡跃明, 尹廷辉, 张潇毅. 端到端通信中基于时间转换能量采集的计算迁移方案[J]. 计算机应用, 2018, 38(12): 3535-3540.