Abstract:Monaural speech enhancement algorithms obtain enhanced speech by estimating and negating the noise components in speech with noise. However, the over-estimation and the error of the introduction to make up the over-estimation of noise power make detrimental effect on the enhanced speech. To constrain the distortion caused by noise over-estimation, a time-frequency mask estimation and optimization algorithm based on Computational Auditory Scene Analysis (CASA) was proposed. Firstly, Decision Directed (DD) algorithm was used to estimate the priori Signal-to-Noise Ratio (SNR) and calculate the initial mask. Secondly, the Inter-Channel Correlation (ICC) factor between noise and speech with noise in each Gammatone filterbank channel was used to calculate the noise presence probability, the new noise estimation was obtained by the probability combining with the power spectrum of speech with noise, and the over-estimation of the primary estimated noise was decreased. Thirdly, the initial mask was iterated by the optimization algorithm to reduce the error caused by the noise over-estimation and raise the target speech components in the mask, and the new mask was obtained when the iteration stopped with the conditions met. Finally, the optimization method was used to optimize the estimated mask. The enhanced speech was composed by using the new mask. Experimental results demonstrate that the new mask has higher Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility measure (STOI) values of the enhanced speech in comparison with the mask before optimization, improving the intelligibility and listening feeling of speech.
[1] 曹亮, 张天骐, 高洪兴, 等. 基于听觉掩蔽效应的多频带谱减语音增强方法[J]. 计算机工程与设计, 2013, 34(1):235-240. (CAO L, ZHANG T Q, GAO H X, et al. Multi-band spectral subtraction method for speech enhancement based on masking property of human auditory system[J]. Computer Engineering and Design, 2013, 34(1):235-240.) [2] 李季碧, 马永保, 夏杰, 等. 一种基于修正倒谱平滑技术改进的维纳滤波语音增强算法[J]. 重庆邮电大学学报(自然科学版), 2016, 28(4):462-467. (LI J B, MA Y B, XIA J, et al. An improved Wiener filtering speech enhancement algorithm based on modified cepstrum smooth technology[J]. Journal of Chongqing University of Posts and Telecommunications (Natural Science Edition), 2016, 28(4):462-467.) [3] BOROWICZ A, PETROVSKY A. Signal subspace approach for psychoacoustically motivated speech enhancement[J]. Speech communication, 2011, 53(2):210-219. [4] HU K, WANG D. Unvoiced speech segregation from nonspeech interference via CASA and spectral subtraction[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(6):1600-1609. [5] WANG Y, NARAYANAN A, WANG D, et al. On training targets for supervised speech separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 22(12):1849-1858. [6] BAO F, ABDULLA W H. Noise masking method based on an effective ratio mask estimation in Gammatone channels[J]. APSIPA Transactions on Signal and Information Processing, 2018, 7(e5):1-12. [7] SUN M, LI Y, GEMMEKE J F, et al. Speech enhancement under low SNR conditions via noise estimation using sparse and low-rank NMF with Kullback-Leibler divergence[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(7):1233-1242. [8] NAHMA L, YONG P C, DAM H H, et al. Convex combination framework for a priori SNR estimation in speech enhancement[C]//Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ. IEEE, 2017:4975-4979. [9] 蒋毅, 刘润生, 冯振明. 基于听感知特性的双麦克风近讲语音增强算法[J]. 清华大学学报(自然科学版), 2014(9):1179-1183. (JIANG Y, LIU R S, FENG Z M. Dual-microphone speech enhancement algorithm based on the auditory features for a close-talk system[J]. Journal of Tsinghua University (Science and Technology), 2014, 54(9):1179-1183.) [10] BAO F, ABDULLA W H. A new ratio mask representation for CASA-based speech enhancement[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2019, 27(1):7-19. [11] YONG P C, NORDHOLM S, DAM H H, et al. On the optimization of sigmoid function for speech enhancement[C]//Proceedings of the 19th European Signal Processing Conference. Piscataway:IEEE, 2011:211-215. [12] CHEN Z, HOHMANN V. Online monaural speech enhancement based on periodicity analysis and a priori SNR estimation[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2015, 23(11):1904-1916. [13] ZHENG C, TAN Z, PENG R, et al. Guided spectrogram filtering for speech dereverberation[J]. Applied Acoustics, 2018, 134(5):154-159. [14] GAROFOLO J S, LAMEL L F, FISHER W M, et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus[EB/OL].[2019-01-12]. https://catalog.ldc.upenn.edu/LDC93S1. [15] VARGA A, STEENEKEN H J M. Assessment for automatic speech recognition Ⅱ:NOISEX-92:a database and an experiment to study the effect of additive noise on speech recognition systems[J]. Speech Communication, 1993, 12(3):247-251. [16] GERKMANN T, HENDRIKS R C. Unbiased MMSE-based noise power estimation with low complexity and low tracking delay[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(4):1383-1393. [17] International Telecommunications Union (ITU). Perceptual Evaluation of Speech Quality (PESQ):an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs[EB/OL].[2019-01-12]. https://www.itu.int/rec/T-REC-P.862-200102-I/en. [18] TAAL C H, HENDRIKS R C, HEUSDENS R, et al. An algorithm for intelligibility prediction of time-frequency weighted noisy speech[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(7):2125-2136. [19] LOIZOU P C, KIM G. Reasons why current speech-enhancement algorithms do not improve speech intelligibility and suggested solutions[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(1):47-56.