Abstract:In most speech separation and speech enhancement algorithms based on deep learning, the spectrum feature after Fourier transform is used as the input feature of the neural network, without considering the phase information in the speech signal. However, some previous studies show that phase information is essential to improve speech quality, especially at low Signal-to-Noise Ratio (SNR). To solve this problem, a speech separation algorithm based on Convolutional Encoder Decoder network and Gated Recurrent Unit (CED-GRU) network was proposed. Firstly, based on the characteristic that the original waveform contains both amplitude information and phase information, the original waveform of the mixed speech signal was used as the input feature. Secondly, the timing problem in speech signal was able to be effectively solved by combining the Convolutional Encoder Decoder (CED) network and the Gated Recurrent Unit (GRU) network. Compared with Permutation Invariant Training (PIT) algorithm, DC (Deep Clustering) algorithm, Deep Attractor Network (DAN) algorithm, the improved algorithm has the Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility (STOI) of men and men, men and women, women and women increased by 1.16 and 0.29, 1.37 and 0.27, 1.08 and 0.3; 0.87 and 0.21, 1.11 and 0.22, 0.81 and 0.24; 0.64 and 0.24, 1.01 and 0.34, 0.73 and 0.29 percentage points. The experimental results show that the speech separation system based on CED-GRU has great value in practical application.
[1] CHERRY E C. Some experiments on the recognition of speech, with one and with two ears[J]. The Journal of the Acoustical Society of America,1953,25(5):975-979. [2] CHEERY E C. On Human Communication[M]. Cambridge:MIT Press,1957:15-18. [3] HUANG P S,KIM M,HASEGAWA-JOHNSON M,et al. Joint op-timization of masks and deep recurrent neural networks for monaural source separation[J]. IEEE/ACM Transactions on Audio,Speech, and Language Processing,2015,23(12):2136-2147. [4] ZHANG X,WANG D. A deep ensemble learning method for mon-aural speech separation[J]. IEEE/ACM Transactions on Audio, Speech,and Language Processing,2016,24(5):967-977. [5] LUO Y,CHEN Z,MESGARANI N. Speaker-independent speech separation with deep attractor network[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing,2018,26(4):787-796. [6] ZHAN G,HUANG Z,YING D,et al. Improvement of mask-based speech source separation using DNN[C]//Proceedings of the 10th International Symposium on Chinese Spoken Language Processing. Piscataway:IEEE,2016:1-5. [7] LI X,WU X,CHEN J. A spectral-change-aware loss function for DNN-based speech separation[C]//Proceedings of the 2019 IEEE International Conference on Acoustics,Speech and Signal Process-ing. Piscataway:IEEE,2019:6870-6874. [8] SUN Y,XIAN Y,WANG W,et al. Monaural source separation in complex domain with long short-term memory neural network[J]. IEEE Journal of Selected Topics in Signal Processing,2019,13(2):359-369. [9] PALIWAL K,WÓJCICKI K,SHANNON B. The importance of phase in speech enhancement[J]. Speech Communication,2011, 53(4):465-494. [10] PASCUAL S,BONAFONTE A,SERRÀ J. SEGAN:speech en-hancement generative adversarial network[C]//Proceedings of the 2017 IEEE International Conference on Acoustics,Speech and Sig-nal Processing. Piscataway:IEEE,2017:3642-3646. [11] TAN K,WANG D. A convolutional recurrent neural network for real-time speech enhancement[C]//Proceedings of the 2018 IEEE International Conference on Acoustics,Speech and Signal Process-ing. Piscataway:IEEE,2018:3229-3233. [12] CHO K,MERRIËNBOER B V,GULCEHRE C,et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA:Association for Computational Linguistics,2014:1724-1734. [13] 范存航, 刘斌, 陶建华, 等. 一种基于卷积神经网络的端到端语音分离方法[J]. 信号处理,2019,35(4):542-548.(FAN C H, LIU B,TAO J H,et al. An end-to-end speech separation method based on convolutional neural network[J]. Journal of Signal Pro-cessing,2019,35(4):542-548.) [14] LUO Y,MESGARANI N. TasNet:time-domain audio separation network for real-time single-channel speech separation[C]//Pro-ceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway:IEEE,2018:696-700. [15] 李娟娟, 王丹, 李子晋. 基于深层声学特征的端到端语音分离[J]. 计算机系统应用,2019,28(10):1-7. (LI J J,WANG D, LI Z J. End-to-end speech separation based on deep acoustic fea-ture[J]. Computer Systems and Applications,2019,28(10):1-7. [16] GAROFOLO J S,LAMEL L F,FISHER W M,et al. DARPA TI-MIT acoustic-phonetic continuous speech corpus CD-ROM:NIST speech disc 1-1.1[R]. Gaithersburg,MD:National Institute of Standards and Technology,1993. [17] RIX A W,BEERENDS J G,HOLLIER M P,et al. Perceptual Evaluation of Speech Quality(PESQ)-a new method for speech quality assessment of telephone networks and codecs[C]//Proceed-ings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscataway:IEEE, 2001:749-752. [18] TAAL C H,HENDRIKS R C,HEUSDENS R,et al. A short-time objective intelligibility measure for time-frequency weighted noisy speech[C]//Proceedings of the 2010 IEEE International Confer-ence on Acoustics,Speech,and Signal Processing. Piscataway:IEEE,2010:4214-4217. [19] VINCENT E,GRIBONVAL R,FEVOTTE C. Performance mea-surement in blind audio source separation[J]. IEEE Transactions on Audio,Speech and Language Processing,2006,14(4):1462-1469. [20] KOLBÆK M,YU D,TAN Z,et al. Multitalker speech separation with utterance-level permutation invariant training of deep recur-rent neural networks[J]. IEEE/ACM Transactions on Audio, Speech,and Language Processing,2017,25(10):1901-1913. [21] HERSHEY J R,CHEN Z,LE ROUX J,et al. Deep clustering:discriminative embeddings for segmentation and separation[C]//Proceedings of the 2016 IEEE International Conference on Acous-tics,Speech,and Signal Processing. Piscataway:IEEE,2016:31-35. [22] CHEN Z,LUO Y,MESGARANI N. Deep attractor network for single-microphone speaker separation[C]//Proceedings of the 2017 IEEE International Conference on Acoustics,Speech,and Signal Processing. Piscataway:IEEE,2017:246-250.