Speech separation algorithm based on convolutional encoder decoder and gated recurrent unit

doi:10.11772/j.issn.1001-9081.2019111968

Abstract

Abstract: In most speech separation and speech enhancement algorithms based on deep learning, the spectrum feature after Fourier transform is used as the input feature of the neural network, without considering the phase information in the speech signal. However, some previous studies show that phase information is essential to improve speech quality, especially at low Signal-to-Noise Ratio (SNR). To solve this problem, a speech separation algorithm based on Convolutional Encoder Decoder network and Gated Recurrent Unit (CED-GRU) network was proposed. Firstly, based on the characteristic that the original waveform contains both amplitude information and phase information, the original waveform of the mixed speech signal was used as the input feature. Secondly, the timing problem in speech signal was able to be effectively solved by combining the Convolutional Encoder Decoder (CED) network and the Gated Recurrent Unit (GRU) network. Compared with Permutation Invariant Training (PIT) algorithm, DC (Deep Clustering) algorithm, Deep Attractor Network (DAN) algorithm, the improved algorithm has the Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility (STOI) of men and men, men and women, women and women increased by 1.16 and 0.29, 1.37 and 0.27, 1.08 and 0.3; 0.87 and 0.21, 1.11 and 0.22, 0.81 and 0.24; 0.64 and 0.24, 1.01 and 0.34, 0.73 and 0.29 percentage points. The experimental results show that the speech separation system based on CED-GRU has great value in practical application.

Key words: Convolutional Neural Network (CNN), Convolutional Encoder Decoder (CED), Gated Recurrent Unit (GRU), end-to-end, speech separation

摘要： 在大部分基于深度学习的语音分离和语音增强算法中，把傅里叶变换后的频谱特征作为神经网络的输入特征，并未考虑到语音信号中的相位信息。然而过去的一些研究表明，尤其是在低信噪比（SNR）条件下，相位信息对于提高语音质量是必不可少的。针对这个问题，提出了一种基于卷积编解码器网络和门控循环单元（CED-GRU）的语音分离算法。首先，利用原始波形既包含幅值信息也包含相位信息的特点，在输入端以混合语音信号的原始波形作为输入特征；其次，通过结合卷积编解码器（CED）网络和门控循环单元（GRU）网络，可以有效解决语音信号中存在的时序问题。提出的改进算法在男性和男性、男性和女性、女性和女性的语音质量的感知评价（PESQ）和短时目标可懂度（STOI）方面，与基于排列不变训练（PIT）算法、基于深度聚类（DC）算法、基于深度吸引网络（DAN）算法相比，分别提高了1.16和0.29、1.37和0.27、1.08和0.3；0.87和0.21、1.11和0.22、0.81和0.24；0.64和0.24、1.01和0.34、0.73和0.29个百分点。实验结果表明，基于CED-GRU的语音分离系统在实际应用中具有较大的价值。

关键词: 卷积神经网络, 卷积编解码器, 门控循环单元, 端到端, 语音分离

CLC Number:

TN912.3

CHEN Xiukai, LU Zhihua, ZHOU Yu. Speech separation algorithm based on convolutional encoder decoder and gated recurrent unit[J]. Journal of Computer Applications, 2020, 40(7): 2137-2141.

陈修凯, 陆志华, 周宇. 基于卷积编解码器和门控循环单元的语音分离算法[J]. 计算机应用, 2020, 40(7): 2137-2141.

References

[1] CHERRY E C. Some experiments on the recognition of speech, with one and with two ears[J]. The Journal of the Acoustical Society of America,1953,25(5):975-979.
[2] CHEERY E C. On Human Communication[M]. Cambridge:MIT Press,1957:15-18.
[3] HUANG P S,KIM M,HASEGAWA-JOHNSON M,et al. Joint op-timization of masks and deep recurrent neural networks for monaural source separation[J]. IEEE/ACM Transactions on Audio,Speech, and Language Processing,2015,23(12):2136-2147.
[4] ZHANG X,WANG D. A deep ensemble learning method for mon-aural speech separation[J]. IEEE/ACM Transactions on Audio, Speech,and Language Processing,2016,24(5):967-977.
[5] LUO Y,CHEN Z,MESGARANI N. Speaker-independent speech separation with deep attractor network[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing,2018,26(4):787-796.
[6] ZHAN G,HUANG Z,YING D,et al. Improvement of mask-based speech source separation using DNN[C]//Proceedings of the 10th International Symposium on Chinese Spoken Language Processing. Piscataway:IEEE,2016:1-5.
[7] LI X,WU X,CHEN J. A spectral-change-aware loss function for DNN-based speech separation[C]//Proceedings of the 2019 IEEE International Conference on Acoustics,Speech and Signal Process-ing. Piscataway:IEEE,2019:6870-6874.
[8] SUN Y,XIAN Y,WANG W,et al. Monaural source separation in complex domain with long short-term memory neural network[J]. IEEE Journal of Selected Topics in Signal Processing,2019,13(2):359-369.
[9] PALIWAL K,WÓJCICKI K,SHANNON B. The importance of phase in speech enhancement[J]. Speech Communication,2011, 53(4):465-494.
[10] PASCUAL S,BONAFONTE A,SERRÀ J. SEGAN:speech en-hancement generative adversarial network[C]//Proceedings of the 2017 IEEE International Conference on Acoustics,Speech and Sig-nal Processing. Piscataway:IEEE,2017:3642-3646.
[11] TAN K,WANG D. A convolutional recurrent neural network for real-time speech enhancement[C]//Proceedings of the 2018 IEEE International Conference on Acoustics,Speech and Signal Process-ing. Piscataway:IEEE,2018:3229-3233.
[12] CHO K,MERRIËNBOER B V,GULCEHRE C,et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA:Association for Computational Linguistics,2014:1724-1734.
[13] 范存航, 刘斌, 陶建华, 等. 一种基于卷积神经网络的端到端语音分离方法[J]. 信号处理,2019,35(4):542-548.(FAN C H, LIU B,TAO J H,et al. An end-to-end speech separation method based on convolutional neural network[J]. Journal of Signal Pro-cessing,2019,35(4):542-548.)
[14] LUO Y,MESGARANI N. TasNet:time-domain audio separation network for real-time single-channel speech separation[C]//Pro-ceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway:IEEE,2018:696-700.
[15] 李娟娟, 王丹, 李子晋. 基于深层声学特征的端到端语音分离[J]. 计算机系统应用,2019,28(10):1-7. (LI J J,WANG D, LI Z J. End-to-end speech separation based on deep acoustic fea-ture[J]. Computer Systems and Applications,2019,28(10):1-7.
[16] GAROFOLO J S,LAMEL L F,FISHER W M,et al. DARPA TI-MIT acoustic-phonetic continuous speech corpus CD-ROM:NIST speech disc 1-1.1[R]. Gaithersburg,MD:National Institute of Standards and Technology,1993.
[17] RIX A W,BEERENDS J G,HOLLIER M P,et al. Perceptual Evaluation of Speech Quality(PESQ)-a new method for speech quality assessment of telephone networks and codecs[C]//Proceed-ings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscataway:IEEE, 2001:749-752.
[18] TAAL C H,HENDRIKS R C,HEUSDENS R,et al. A short-time objective intelligibility measure for time-frequency weighted noisy speech[C]//Proceedings of the 2010 IEEE International Confer-ence on Acoustics,Speech,and Signal Processing. Piscataway:IEEE,2010:4214-4217.
[19] VINCENT E,GRIBONVAL R,FEVOTTE C. Performance mea-surement in blind audio source separation[J]. IEEE Transactions on Audio,Speech and Language Processing,2006,14(4):1462-1469.
[20] KOLBÆK M,YU D,TAN Z,et al. Multitalker speech separation with utterance-level permutation invariant training of deep recur-rent neural networks[J]. IEEE/ACM Transactions on Audio, Speech,and Language Processing,2017,25(10):1901-1913.
[21] HERSHEY J R,CHEN Z,LE ROUX J,et al. Deep clustering:discriminative embeddings for segmentation and separation[C]//Proceedings of the 2016 IEEE International Conference on Acous-tics,Speech,and Signal Processing. Piscataway:IEEE,2016:31-35.
[22] CHEN Z,LUO Y,MESGARANI N. Deep attractor network for single-microphone speaker separation[C]//Proceedings of the 2017 IEEE International Conference on Acoustics,Speech,and Signal Processing. Piscataway:IEEE,2017:246-250.