Review of speech segmentation and endpoint detection

doi:10.11772/j.issn.1001-9081.2019061071

Abstract

Abstract: Speech segmentation is an indispensable basic work in speech recognition and speech synthesis, and its quality has a great impact on the following system. Although manual segmentation and labeling is of high accuracy, it is quite time-consuming and laborious, and requires domain experts to deal with. As a result, automatic speech segmentation has become a research hotspot in speech processing. Firstly, aiming at current progress of automatic speech segmentation, several different classification methods of speech segmentation were explained. The alignment-based methods and boundary detection-based methods were introduced respectively, and the neural network speech segmentation methods, which can be applied in the above two frameworks, were expounded in detail. Then, some new speech segmentation technologies based on the methods such as bio-inspiration signal and game theory were introduced, and the performance evaluation metrics widely used in the speech segmentation field were given, and these evaluation metrics were compared and analyzed. Finally, the above contents were summarized and the future important research directions of speech segmentation were put forward.

Key words: speech segmentation, endpoint detection, speech synthesis, signal feature, Artificial Neural Network (ANN)

摘要： 语音分割是语音识别和语音合成中必不可少的基础性工作，其质量对后续系统的影响巨大。使用手工分割和标注虽然精度高，但费时费力，同时需要熟练的领域专家来完成，自动语音分割因此成为语音处理的研究热点。首先针对自动语音分割目前的研究进展，介绍了语音分割的不同分类方法；然后分别介绍了基于对齐的方法和基于边界检测的方法，并详细介绍了可以应用在上述两种框架下的神经网络语音分割方法；接着介绍了基于生物激励信号以及博弈论等方法的新型语音分割技术，并给出了领域内广泛使用的性能评估度量，并对这些评估指标进行比较和分析；最后总结并提出语音分割研究未来发展的重要方向。

关键词: 语音分割, 端点检测, 语音合成, 信号特征, 人工神经网络

CLC Number:

YANG Jian, LI Zhenpeng, SU Peng. Review of speech segmentation and endpoint detection[J]. Journal of Computer Applications, 2020, 40(1): 1-7.

杨健, 李振鹏, 苏鹏. 语音分割与端点检测研究综述[J]. 计算机应用, 2020, 40(1): 1-7.

References

[1] MPORAS I, GANCHEV T, FAKOTAKIS N. Speech segmentation using regression fusion of boundary predictions[J]. Computer Speech and Language, 2010, 24(2):273-288.
[2] PATIL H A, PATEL T, TALESARA S, et al. Algorithms for speech segmentation at syllable-level for text-to-speech synthesis system in Gujarati[C]//Proceedings of the 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation. Piscataway:IEEE, 2013:1-7.
[3] VAN HEMERT J P. Automatic segmentation of speech[J]. IEEE Transactions on Signal Processing, 1991, 39(4):1008-1012.
[4] 张扬,赵晓群,王缔罡.基于时频二维能量特征的汉语音节切分方法[J].计算机应用,2016,36(11):3222-3228.(ZHANG Y, ZHAO X Q, WANG D G. Chinese speech segmentation into syllables based on energies in different times and frequencies[J]. Journal of Computer Applications, 2016, 36(11):3222-3228.)
[5] 张继勇,郑方,杜术,等.连续汉语语音识别中基于归并的音节切分自动机[J].软件学报,1999,10(11):1212-1215.(ZHANG J Y, ZHENG F, DU S, et al. Merging-based syllables detection automaton in continuous Chinese speech recognition[J]. Journal of Software, 1999, 10(11):1212-1215.)
[6] 韩虎.汉语连续语音的音节自动标注算法研究及实现[D].哈尔滨:哈尔滨工业大学,2008:21-44.(HAN H. Research and realization of the automatic syllable marking algorithm for Chinese continuous speech[D]. Harbin:Harbin Institute of Technology, 2008:21-44.)
[7] 张扬,赵晓群,王缔罡.基于音节长度高斯拟合的汉语音音节切分方法[J].计算机应用,2016,36(5):1410-1414.(ZHANG Y, ZHAO X Q, WANG D G. Chinese speech segmentation method based on Gauss distribution of time spans of syllables[J]. Journal of Computer Applications, 2016, 36(5):1410-1414.)
[8] SEDDIQ Y M, ALOTAIBI Y A, SELOUANI S A. Frame distance array algorithm parameter tune-up for TIMIT corpus automatic speech segmentation[C]//Proceedings of the 2015 IEEE International Conference on Electro/Information Technology. Piscataway:IEEE, 2015:241-245.
[9] 李欢欢,王金明,尹海明,等.一种改进的基于Viterbi的语音切分算法[J].通信技术,2015,48(9):1027-1031.(LI H H, WANG J M, YIN H M, et al. An improved speech segmentation algorithm based on Viterbi[J]. Communications Technology, 2015, 48(9):1027-1031.)
[10] PANDA S P, NAYAK A K. Automatic speech segmentation in syllable centric speech recognition system[J]. International Journal of Speech Technology, 2016, 19(1):9-18.
[11] SARMA B D, SHARMA B, SHANMUGAM S A, et al. Exploration of vowel onset and offset points for hybrid speech segmentation[C]//Proceedings of the 2015 IEEE Region 10 Conference. Piscataway:IEEE, 2015:1-6.
[12] BHATI S, NAYAK S, MURTY K S R. Unsupervised segmentation of speech signals using kernel-Gram matrices[C]//Proceedings of the 6th National Conference on Computer Vision, Pattern Recognition, Image Processing, and Graphics, CCIS 841. Singapore:Springer, 2017:139-149.
[13] SINCLAIR M, BELL P, BIRCH A, et al. A semi-Markov model for speech segmentation with an utterance-break prior[C]//Proceedings of the 15th Annual Conference of the International Speech Communication Association. Singapore:ISCA, 2014:2351-2355.
[14] 卓嘎,边巴旺堆,姜军.双门限算法在藏语语音音节分割中的应用分析[J].科学技术与工程,2015,15(14):196-199,204.(ZHUO G, BIANBA W D, JIANG J. Application analysis of Tibetan syllable segregation with double-threshold algorithm[J]. Science Technology and Engineering, 2015, 15(14):196-199, 204.)
[15] 鲁远耀,周妮,肖珂,等.强噪声环境下改进的语音端点检测算法[J].计算机应用,2014,34(5):1386-1390.(LU Y Y, ZHOU N, XIAO K, et al. Improved speech endpoint detection algorithm in strong noise environment[J]. Journal of Computer Applications, 2014, 34(5):1386-1390.)
[16] 段淑斐.一种利用多参数进行实时语音边界检测与音节分割算法[J].太原理工大学学报,2009,40(5):487-489,493.(DUAN S F. A real-time border detection and syllable segmentation of voice based on multi-parameter[J]. Journal of Taiyuan University of Technology, 2009, 40(5):487-489, 493.)
[17] DEEPAK K T, SARMA B D, PRASANNA S R M. Foreground speech segmentation using zero frequency filtered signal[C]//Proceedings of the 13th Annual Conference of the International Speech Communication Association. Portland, Oregon:ISCA, 2012:1510-1513.
[18] KHONGLAH B K, DEEPAK K T, PRASANNA S R M. Indoor/outdoor audio classification using foreground speech segmentation[C]//Proceedings of the 18th Annual Conference of the International Speech Communication Association. Stockholm, Sweden:ISCA, 2017:464-468.
[19] FARAJI N, AHADI S M, SHEIKHZADEH H, et al. Speech segmentation using a hypothesis test based on random matrix theory[C]//Proceedings of the 10th IEEE International Symposium on Signal Processing and Information Technology. Piscataway:IEEE, 2010:309-314.
[20] MARKLUND E, LACERDA F, SCHWARZ I C, et al. Similarities in fundamental frequency in infant speech segmentation models[C]//Proceedings of the 13th Annual Conference of the International Speech Communication Association. Portland, Oregon:ISCA, 2012:1110-1113.
[21] FARAJI N, AHADI S M, SHEIKHZADEH H. Sequential method for speech segmentation based on random matrix theory[J]. IET Signal Processing, 2013, 7(7):625-633.
[22] WANG C, ZHAO J, HUANG R. Research on false points removing of speech segmentation[J]. Applied Mechanics and Materials, 2014, 536/537:136-140.
[23] GALKA J, ZIOLKO M. Wavelets in speech segmentation[C]//Proceedings of the 14th IEEE Mediterranean Electrotechnical Conference. Piscataway:IEEE, 2008:876-879.
[24] CHIT Y W, KHAING S S. Myanmar continuous speech recognition system using fuzzy logic classification in speech segmentation[C]//Proceedings of the 2018 International Conference on Intelligent Information Technology. New York:ACM, 2018:14-17.
[25] GHOSH S, SREENIVAS T. Automatic speech segmentation using probabilistic latent component modeling[C]//Proceedings of the 13th Annual Conference of the International Speech Communication Association. Portland, Oregon:ISCA, 2012:2259-2262.
[26] ILIYA S, MENZIES D, NERI F, et al. Robust impaired speech segmentation using neural network mixture model[C]//Proceedings of the 2014 IEEE International Symposium on Signal Processing and Information Technology. Piscataway:IEEE, 2014:444-449.
[27] STAN A, VALENTINI-BOTINHAO C, ORZA B, et al. Blind speech segmentation using spectrogram image-based features and Mel cepstral coefficients[C]//Proceedings of the 2016 IEEE Spoken Language Technology Workshop. Piscataway:IEEE, 2016:597-602.
[28] LEOW S J, CHNG E S, LEE C H. Language-resource independent speech segmentation using cues from a spectrogram image[C]//Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway:IEEE, 2015:5813-5817.
[29] BENATI N, BAHI H. Spoken term detection based on acoustic speech segmentation[C]//Proceedings of the 7th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications. Piscataway:IEEE, 2016:267-271.
[30] FANTINATO P C, GUIDO R C, CHEN S H, et al. A fractal-based approach for speech segmentation[C]//Proceedings of the 10th IEEE International Symposium on Multimedia. Piscataway:IEEE, 2008:551-555.
[31] 潘峰,丁娜娜,吕鹏,等.基于分形维的语音去噪与音节分割[J].计算机工程与应用,2011,47(14):131-133.(PAN F, DING N N, LYU P, et al. Speech denoising and syllable segmentation based on fractal dimension[J]. Computer Engineering and Applications, 2011, 47(14):131-133.)
[32] HE S, ZHAO H. Automatic syllable segmentation algorithm of Chinese speech based on MF-DFA[J]. Speech Communication, 2017, 92:42-51.
[33] TOLEDANO D T. Neural network boundary refining for automatic speech segmentation[C]//Proceedings of the 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscataway:IEEE, 2000:3438-3441.
[34] VAN VUUREN V Z, TEN BOSCH L, NIESLER T. Unconstrained speech segmentation using deep neural networks[C]//Proceedings of the 2015 International Conference on Pattern Recognition Applications and Methods. Portugal:SciTePress, 2015:248-254.
[35] VAN VUUREN V Z, TEN BOSCH L, NIESLER T. A dynamic programming framework for neural network-based automatic speech segmentation[C]//Proceedings of the 14th Annual Conference of the International Speech Communication Association. Lyon, France:ISCA, 2013:2287-2291.
[36] KERI V, PRAHALLAD K. A comparative study of constrained and unconstrained approaches for segmentation of speech signal[C]//Proceedings of the 11th Annual Conference of the International Speech Communication Association. Florence, Italy:ISCA, 2010:2238-2241.
[37] LEE Y H, YANG J Y, CHO C, et al. Phoneme segmentation using deep learning for speech synthesis[C]//Proceedings of the 2018 Research in Adaptive and Convergent Systems. New York:ACM, 2018:59-61.
[38] AHCōNE A, AISSA A, ABDELKADER D, et al. Automatic segmentation of Arabic speech signals by HMM and ANN[C]//Proceedings of 2016 International Conference on Electrical Sciences and Technologies in Maghreb. Piscataway:IEEE, 2017:1-4.
[39] BABY A, PRAKASH J J, VIGNESH R, et al. Deep learning techniques in tandem with signal processing cues for phonetic segmentation for text to speech synthesis in Indian languages[C]//Proceedings of the 18th Annual Conference of the International Speech Communication Association. San Francisco:ISCA, 2017:3817-3821.
[40] GRAVES A, MOHAMED A R, HINTON G. Speech recognition with deep recurrent neural networks[C]//Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway:IEEE, 2013:6645-6649.
[41] YANG J, LI Z, SU P. An automatic blind syllable segmentation model based on bi-directional LSTM[C]//Proceedings of the 2nd International Conference on Communication Engineering and Technology. Piscataway:IEEE, 2019:109-113.
[42] ABEL A K, HUNTER D, SMITH L S. A biologically inspired onset and offset speech segmentation approach[C]//Proceedings of the 2015 International Joint Conference on Neural Networks. Piscataway:IEEE, 2015:1-8.
[43] REKHA J U, CHATRAPATI K S, BABU A V. Game theoretic approach for automatic speech segmentation and recognition[C]//Proceedings of the IEEE 28th Convention of Electrical and Electronics Engineers in Israel. Piscataway:IEEE, 2014:1-5.
[44] BROGNAUX S, DRUGMAN T. HMM-based speech segmentation:improvements of fully automatic approaches[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(1):5-15.
[45] RÄSÄNEN O J, LAIN U K, ALTOSAAR T. An improved speech segmentation quality measure:the R-value[C]//Proceedings of the 10th Annual Conference of the International Speech Communication Association. Florence, Italy:ISCA, 2009:1851-1854.
[46] ESTEVAN Y P, WAN V, SCHARENBORG O. Finding maximum margin segments in speech[C]//Proceedings of the 2007 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway:IEEE, 2007:IV-937-IV-940.
[47] QIAO Y, SHIMOMURA N, MINEMATSU N. Unsupervised optimal phoneme segmentation:Objectives, algorithm and comparisons[C]//Proceedings of the 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway:IEEE, 2008:3989-3992.
[48] HU G, WANG D. Auditory segmentation based on onset and offset analysis[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(2):396-405.