基于卷积编解码器和门控循环单元的语音分离算法

doi:10.11772/j.issn.1001-9081.2019111968

计算机应用 ›› 2020, Vol. 40 ›› Issue (7): 2137-2141.DOI: 10.11772/j.issn.1001-9081.2019111968

• 虚拟现实与多媒体计算 • 上一篇下一篇

基于卷积编解码器和门控循环单元的语音分离算法

陈修凯, 陆志华, 周宇

宁波大学信息科学与工程学院, 浙江宁波 315211

收稿日期:2019-11-19 修回日期:2020-03-10 发布日期:2020-05-19 出版日期:2020-07-10
通讯作者: 周宇
作者简介:陈修凯(1994-),男,安徽马鞍山人,硕士研究生,主要研究方向:语音信号处理、室内声源定位;陆志华(1983-),男,浙江金华人,副教授,博士,主要研究方向:语音信号处理、多运动目标的实时跟踪;周宇(1960-),男,山东威海人,教授,硕士,主要研究方向:信号处理、网络与信息安全。
基金资助:
国家自然科学基金青年科学基金资助项目（61801255）。

Speech separation algorithm based on convolutional encoder decoder and gated recurrent unit

CHEN Xiukai, LU Zhihua, ZHOU Yu

College of Information Science and Engineering, Ningbo University, Ningbo Zhejiang 315211, China

Received:2019-11-19 Revised:2020-03-10 Online:2020-05-19 Published:2020-07-10
Supported by:
This work is partially supported by the Youth Program of the National Natural Science Foundation of China (61801255).

摘要/Abstract

摘要： 在大部分基于深度学习的语音分离和语音增强算法中，把傅里叶变换后的频谱特征作为神经网络的输入特征，并未考虑到语音信号中的相位信息。然而过去的一些研究表明，尤其是在低信噪比（SNR）条件下，相位信息对于提高语音质量是必不可少的。针对这个问题，提出了一种基于卷积编解码器网络和门控循环单元（CED-GRU）的语音分离算法。首先，利用原始波形既包含幅值信息也包含相位信息的特点，在输入端以混合语音信号的原始波形作为输入特征；其次，通过结合卷积编解码器（CED）网络和门控循环单元（GRU）网络，可以有效解决语音信号中存在的时序问题。提出的改进算法在男性和男性、男性和女性、女性和女性的语音质量的感知评价（PESQ）和短时目标可懂度（STOI）方面，与基于排列不变训练（PIT）算法、基于深度聚类（DC）算法、基于深度吸引网络（DAN）算法相比，分别提高了1.16和0.29、1.37和0.27、1.08和0.3；0.87和0.21、1.11和0.22、0.81和0.24；0.64和0.24、1.01和0.34、0.73和0.29个百分点。实验结果表明，基于CED-GRU的语音分离系统在实际应用中具有较大的价值。

关键词: 卷积神经网络, 卷积编解码器, 门控循环单元, 端到端, 语音分离

Abstract: In most speech separation and speech enhancement algorithms based on deep learning, the spectrum feature after Fourier transform is used as the input feature of the neural network, without considering the phase information in the speech signal. However, some previous studies show that phase information is essential to improve speech quality, especially at low Signal-to-Noise Ratio (SNR). To solve this problem, a speech separation algorithm based on Convolutional Encoder Decoder network and Gated Recurrent Unit (CED-GRU) network was proposed. Firstly, based on the characteristic that the original waveform contains both amplitude information and phase information, the original waveform of the mixed speech signal was used as the input feature. Secondly, the timing problem in speech signal was able to be effectively solved by combining the Convolutional Encoder Decoder (CED) network and the Gated Recurrent Unit (GRU) network. Compared with Permutation Invariant Training (PIT) algorithm, DC (Deep Clustering) algorithm, Deep Attractor Network (DAN) algorithm, the improved algorithm has the Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility (STOI) of men and men, men and women, women and women increased by 1.16 and 0.29, 1.37 and 0.27, 1.08 and 0.3; 0.87 and 0.21, 1.11 and 0.22, 0.81 and 0.24; 0.64 and 0.24, 1.01 and 0.34, 0.73 and 0.29 percentage points. The experimental results show that the speech separation system based on CED-GRU has great value in practical application.

Key words: Convolutional Neural Network (CNN), Convolutional Encoder Decoder (CED), Gated Recurrent Unit (GRU), end-to-end, speech separation

中图分类号:

TN912.3

陈修凯, 陆志华, 周宇. 基于卷积编解码器和门控循环单元的语音分离算法[J]. 计算机应用, 2020, 40(7): 2137-2141.

CHEN Xiukai, LU Zhihua, ZHOU Yu. Speech separation algorithm based on convolutional encoder decoder and gated recurrent unit[J]. Journal of Computer Applications, 2020, 40(7): 2137-2141.

参考文献

[1] CHERRY E C. Some experiments on the recognition of speech, with one and with two ears[J]. The Journal of the Acoustical Society of America,1953,25(5):975-979.
[2] CHEERY E C. On Human Communication[M]. Cambridge:MIT Press,1957:15-18.
[3] HUANG P S,KIM M,HASEGAWA-JOHNSON M,et al. Joint op-timization of masks and deep recurrent neural networks for monaural source separation[J]. IEEE/ACM Transactions on Audio,Speech, and Language Processing,2015,23(12):2136-2147.
[4] ZHANG X,WANG D. A deep ensemble learning method for mon-aural speech separation[J]. IEEE/ACM Transactions on Audio, Speech,and Language Processing,2016,24(5):967-977.
[5] LUO Y,CHEN Z,MESGARANI N. Speaker-independent speech separation with deep attractor network[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing,2018,26(4):787-796.
[6] ZHAN G,HUANG Z,YING D,et al. Improvement of mask-based speech source separation using DNN[C]//Proceedings of the 10th International Symposium on Chinese Spoken Language Processing. Piscataway:IEEE,2016:1-5.
[7] LI X,WU X,CHEN J. A spectral-change-aware loss function for DNN-based speech separation[C]//Proceedings of the 2019 IEEE International Conference on Acoustics,Speech and Signal Process-ing. Piscataway:IEEE,2019:6870-6874.
[8] SUN Y,XIAN Y,WANG W,et al. Monaural source separation in complex domain with long short-term memory neural network[J]. IEEE Journal of Selected Topics in Signal Processing,2019,13(2):359-369.
[9] PALIWAL K,WÓJCICKI K,SHANNON B. The importance of phase in speech enhancement[J]. Speech Communication,2011, 53(4):465-494.
[10] PASCUAL S,BONAFONTE A,SERRÀ J. SEGAN:speech en-hancement generative adversarial network[C]//Proceedings of the 2017 IEEE International Conference on Acoustics,Speech and Sig-nal Processing. Piscataway:IEEE,2017:3642-3646.
[11] TAN K,WANG D. A convolutional recurrent neural network for real-time speech enhancement[C]//Proceedings of the 2018 IEEE International Conference on Acoustics,Speech and Signal Process-ing. Piscataway:IEEE,2018:3229-3233.
[12] CHO K,MERRIËNBOER B V,GULCEHRE C,et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA:Association for Computational Linguistics,2014:1724-1734.
[13] 范存航, 刘斌, 陶建华, 等. 一种基于卷积神经网络的端到端语音分离方法[J]. 信号处理,2019,35(4):542-548.(FAN C H, LIU B,TAO J H,et al. An end-to-end speech separation method based on convolutional neural network[J]. Journal of Signal Pro-cessing,2019,35(4):542-548.)
[14] LUO Y,MESGARANI N. TasNet:time-domain audio separation network for real-time single-channel speech separation[C]//Pro-ceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway:IEEE,2018:696-700.
[15] 李娟娟, 王丹, 李子晋. 基于深层声学特征的端到端语音分离[J]. 计算机系统应用,2019,28(10):1-7. (LI J J,WANG D, LI Z J. End-to-end speech separation based on deep acoustic fea-ture[J]. Computer Systems and Applications,2019,28(10):1-7.
[16] GAROFOLO J S,LAMEL L F,FISHER W M,et al. DARPA TI-MIT acoustic-phonetic continuous speech corpus CD-ROM:NIST speech disc 1-1.1[R]. Gaithersburg,MD:National Institute of Standards and Technology,1993.
[17] RIX A W,BEERENDS J G,HOLLIER M P,et al. Perceptual Evaluation of Speech Quality(PESQ)-a new method for speech quality assessment of telephone networks and codecs[C]//Proceed-ings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscataway:IEEE, 2001:749-752.
[18] TAAL C H,HENDRIKS R C,HEUSDENS R,et al. A short-time objective intelligibility measure for time-frequency weighted noisy speech[C]//Proceedings of the 2010 IEEE International Confer-ence on Acoustics,Speech,and Signal Processing. Piscataway:IEEE,2010:4214-4217.
[19] VINCENT E,GRIBONVAL R,FEVOTTE C. Performance mea-surement in blind audio source separation[J]. IEEE Transactions on Audio,Speech and Language Processing,2006,14(4):1462-1469.
[20] KOLBÆK M,YU D,TAN Z,et al. Multitalker speech separation with utterance-level permutation invariant training of deep recur-rent neural networks[J]. IEEE/ACM Transactions on Audio, Speech,and Language Processing,2017,25(10):1901-1913.
[21] HERSHEY J R,CHEN Z,LE ROUX J,et al. Deep clustering:discriminative embeddings for segmentation and separation[C]//Proceedings of the 2016 IEEE International Conference on Acous-tics,Speech,and Signal Processing. Piscataway:IEEE,2016:31-35.
[22] CHEN Z,LUO Y,MESGARANI N. Deep attractor network for single-microphone speaker separation[C]//Proceedings of the 2017 IEEE International Conference on Acoustics,Speech,and Signal Processing. Piscataway:IEEE,2017:246-250.

基于卷积编解码器和门控循环单元的语音分离算法

Speech separation algorithm based on convolutional encoder decoder and gated recurrent unit

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	李云, 王富铕, 井佩光, 王粟, 肖澳. 基于不确定度感知的帧关联短视频事件检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2903-2910.
[2]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[3]	薛桂香, 王辉, 周卫峰, 刘瑜, 李岩. 基于知识图谱和时空扩散图卷积网络的港口交通流量预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2952-2957.
[4]	陈虹, 齐兵, 金海波, 武聪, 张立昂. 融合1D-CNN与BiGRU的类不平衡流量异常检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2493-2499.
[5]	张春雪, 仇丽青, 孙承爱, 荆彩霞. 基于两阶段动态兴趣识别的购买行为预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2365-2371.
[6]	赵宇博, 张丽萍, 闫盛, 侯敏, 高茂. 基于改进分段卷积神经网络和知识蒸馏的学科知识实体间关系抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2421-2429.
[7]	高阳峄, 雷涛, 杜晓刚, 李岁永, 王营博, 闵重丹. 基于像素距离图和四维动态卷积网络的密集人群计数与定位方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2233-2242.
[8]	王东炜, 刘柏辰, 韩志, 王艳美, 唐延东. 基于低秩分解和向量量化的深度网络压缩方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 1987-1994.
[9]	姚迅, 秦忠正, 杨捷. 生成式标签对抗的文本分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1781-1785.
[10]	黄梦源, 常侃, 凌铭阳, 韦新杰, 覃团发. 基于层间引导的低光照图像渐进增强算法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1911-1919.
[11]	李健京, 李贯峰, 秦飞舟, 李卫军. 基于不确定知识图谱嵌入的多关系近似推理模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1751-1759.
[12]	沈君凤, 周星辰, 汤灿. 基于改进的提示学习方法的双通道情感分析模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1796-1806.
[13]	孙敏, 成倩, 丁希宁. 基于CBAM-CGRU-SVM的Android恶意软件检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1539-1545.
[14]	高文烁, 陈晓云. 基于节点结构的点云分类网络[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1471-1478.
[15]	席治远, 唐超, 童安炀, 王文剑. 基于双路时空网络的驾驶员行为识别[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1511-1519.