计算机应用 ›› 2019, Vol. 39 ›› Issue (5): 1325-1329.DOI: 10.11772/j.issn.1001-9081.2018102131

• 人工智能 • 上一篇    下一篇

基于WaveNet的端到端语音合成方法

邱泽宇, 屈丹, 张连海   

  1. 战略支援部队信息工程大学 信息系统工程学院, 郑州 450000
  • 收稿日期:2018-10-23 修回日期:2018-12-12 出版日期:2019-05-10 发布日期:2019-05-14
  • 通讯作者: 邱泽宇
  • 作者简介:邱泽宇(1995-),男,河南驻马店人,硕士研究生,主要研究方向:智能信息处理、语音合成;屈丹(1974-),女,吉林长春人,副教授,博士,主要研究方向:语音信号处理、智能信息处理、人工智能、信号分析;张连海(1971-),男,山东菏泽人,副教授,硕士,主要研究方向:语音信号处理、智能信息处理、人工智能、信号分析。
  • 基金资助:
    国家自然科学基金资助项目(61673395)。

End-to-end speech synthesis based on WaveNet

QIU Zeyu, QU Dan, ZHANG Lianhai   

  1. College of Information Systems Engineering, PLA Strategic Force Information Engineering University, Zhengzhou Henan 450000, China
  • Received:2018-10-23 Revised:2018-12-12 Online:2019-05-10 Published:2019-05-14
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61673395).

摘要: 针对端到端语音合成系统中Griffin-Lim算法恢复相位信息合成语音保真度较低、人工处理痕迹明显的问题,提出了一种基于WaveNet网络架构的端到端语音合成方法。以序列映射Seq2Seq结构为基础,首先将输入文本转化为one-hot向量,然后引入注意力机制获取梅尔声谱图,最后利用WaveNet后端处理网络重构语音信号的相位信息,从而将梅尔频谱特征逆变换为时域波形样本。实验的测试语料为LJSpeech-1.0和THchs-30,针对英语、汉语两个语种进行了实验,实验结果表明平均意见得分(MOS)分别为3.31、3.02,在合成自然度方面优于采用Griffin-Lim算法的端到端语音合成系统以及参数式语音合成系统。

关键词: 语音合成, 端到端, Seq2Seq, Griffin-Lim算法, WaveNet

Abstract: Griffin-Lim algorithm is widely used in end-to-end speech synthesis with phase estimation, which always produces obviously artificial speech with low fidelity. Aiming at this problem, a system for end-to-end speech synthesis based on WaveNet network architecture was proposed. Based on Seq2Seq (Sequence-to-Sequence) structure, firstly the input text was converted into a one-hot vector, then, the attention mechanism was introduced to obtain a Mel spectrogram, finally WaveNet network was used to reconstruct phase information to generate time-domain waveform samples from the Mel spectrogram features. Aiming at English and Chinese, the proposed method achieves a Mean Opinion Score (MOS) of 3.31 on LJSpeech-1.0 corpus and 3.02 on THchs-30 corpus, which outperforms the end-to-end systems based on Griffin-Lim algorithm and parametric systems in terms of naturalness.

Key words: speech syhthesis, end-to-end, Sequence-to-Sequence (Seq2Seq), Griffin-Lim algorithm, WaveNet

中图分类号: