计算机应用 ›› 2020, Vol. 40 ›› Issue (7): 2137-2141.DOI: 10.11772/j.issn.1001-9081.2019111968

• 虚拟现实与多媒体计算 • 上一篇    下一篇

基于卷积编解码器和门控循环单元的语音分离算法

陈修凯, 陆志华, 周宇   

  1. 宁波大学 信息科学与工程学院, 浙江 宁波 315211
  • 收稿日期:2019-11-19 修回日期:2020-03-10 出版日期:2020-07-10 发布日期:2020-05-19
  • 通讯作者: 周宇
  • 作者简介:陈修凯(1994-),男,安徽马鞍山人,硕士研究生,主要研究方向:语音信号处理、室内声源定位;陆志华(1983-),男,浙江金华人,副教授,博士,主要研究方向:语音信号处理、多运动目标的实时跟踪;周宇(1960-),男,山东威海人,教授,硕士,主要研究方向:信号处理、网络与信息安全。
  • 基金资助:
    国家自然科学基金青年科学基金资助项目(61801255)。

Speech separation algorithm based on convolutional encoder decoder and gated recurrent unit

CHEN Xiukai, LU Zhihua, ZHOU Yu   

  1. College of Information Science and Engineering, Ningbo University, Ningbo Zhejiang 315211, China
  • Received:2019-11-19 Revised:2020-03-10 Online:2020-07-10 Published:2020-05-19
  • Supported by:
    This work is partially supported by the Youth Program of the National Natural Science Foundation of China (61801255).

摘要: 在大部分基于深度学习的语音分离和语音增强算法中,把傅里叶变换后的频谱特征作为神经网络的输入特征,并未考虑到语音信号中的相位信息。然而过去的一些研究表明,尤其是在低信噪比(SNR)条件下,相位信息对于提高语音质量是必不可少的。针对这个问题,提出了一种基于卷积编解码器网络和门控循环单元(CED-GRU)的语音分离算法。首先,利用原始波形既包含幅值信息也包含相位信息的特点,在输入端以混合语音信号的原始波形作为输入特征;其次,通过结合卷积编解码器(CED)网络和门控循环单元(GRU)网络,可以有效解决语音信号中存在的时序问题。提出的改进算法在男性和男性、男性和女性、女性和女性的语音质量的感知评价(PESQ)和短时目标可懂度(STOI)方面,与基于排列不变训练(PIT)算法、基于深度聚类(DC)算法、基于深度吸引网络(DAN)算法相比,分别提高了1.16和0.29、1.37和0.27、1.08和0.3;0.87和0.21、1.11和0.22、0.81和0.24;0.64和0.24、1.01和0.34、0.73和0.29个百分点。实验结果表明,基于CED-GRU的语音分离系统在实际应用中具有较大的价值。

关键词: 卷积神经网络, 卷积编解码器, 门控循环单元, 端到端, 语音分离

Abstract: In most speech separation and speech enhancement algorithms based on deep learning, the spectrum feature after Fourier transform is used as the input feature of the neural network, without considering the phase information in the speech signal. However, some previous studies show that phase information is essential to improve speech quality, especially at low Signal-to-Noise Ratio (SNR). To solve this problem, a speech separation algorithm based on Convolutional Encoder Decoder network and Gated Recurrent Unit (CED-GRU) network was proposed. Firstly, based on the characteristic that the original waveform contains both amplitude information and phase information, the original waveform of the mixed speech signal was used as the input feature. Secondly, the timing problem in speech signal was able to be effectively solved by combining the Convolutional Encoder Decoder (CED) network and the Gated Recurrent Unit (GRU) network. Compared with Permutation Invariant Training (PIT) algorithm, DC (Deep Clustering) algorithm, Deep Attractor Network (DAN) algorithm, the improved algorithm has the Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility (STOI) of men and men, men and women, women and women increased by 1.16 and 0.29, 1.37 and 0.27, 1.08 and 0.3; 0.87 and 0.21, 1.11 and 0.22, 0.81 and 0.24; 0.64 and 0.24, 1.01 and 0.34, 0.73 and 0.29 percentage points. The experimental results show that the speech separation system based on CED-GRU has great value in practical application.

Key words: Convolutional Neural Network (CNN), Convolutional Encoder Decoder (CED), Gated Recurrent Unit (GRU), end-to-end, speech separation

中图分类号: