Journal of Computer Applications
Next Articles
Received:
Revised:
Online:
Published:
孙林嘉,秦磊,康美金,王莹琳
通讯作者:
基金资助:
Abstract: The method based on boundary detection focused on utilizing the abrupt changes in the time and frequency domains rather than the language knowledge to segment speech into syllable. Although satisfactory segmentation results can be achieved by setting various parameters, this method still has some drawbacks, such as poor stability, poor parameter adaptability, weak generalization ability, when dealing with a lot of data and different languages. To address the above issues, the automatic speech segmentation algorithm based on syllable type recognition was proposed. The characteristic of this algorithm was to recognize syllable type in speech data, rather than syllable content. Firstly, syllable types were obtained by using linguistic research findings and syllable composition patterns, which were universal in natural pronunciations across different languages. Then, the acoustic model for each syllable type was established by using Gaussian mixture models and hidden Markov models. Moreover, a channel of feature extraction based on multi-band analysis and significant information fusion was proposed in order to describe syllable attributes. Finally, the Viterbi algorithm was used to determine the speech frames corresponding to the start and end points of syllables, based on the sequences of syllable type identified. The acoustic models of syllable types were trained by using the speech data from three common languages during the experimental phase. Then, the recognition experiments were performed on six languages and dialects, and the average recognition accuracy of over 91%. Compared with Mel Frequency Cepstrum Coefficient (MFCC), the average recognition accuracy was increased by at least 28 percentage points by using the proposed features. Using the tolerance threshold of 20ms, the average segmentation accuracy of over 90% can still be achieved in six languages and dialects. Compared with the four representative algorithms in recent years, the average segmentation accuracy of the proposed algorithm has improved by at least 6 to 13 percentage points. Experimental results show that the proposed algorithm has stronger generalization ability, better stability and higher accuracy.
Key words: speech segmentation, syllable type, acoustic model, multi-band analysis, feature fusion
摘要: 基于边界检测的方法侧重利用时域和频域的突变来将语音数据切分成音节单元,较少关注语言知识在分割中所能发挥的作用。加上此类方法通常需要设置各项参数以获得满意的分割结果,致使其在大数据量和跨语言的环境下存在稳定性差、调整参数难和泛化能力弱的缺点。针对上述问题,提出一种基于音节类型识别的自动语音分割算法。此算法的特点在于所要识别的对象是语音数据中的音节类型,而非具体的音节内容。为此,首先利用语言学研究成果和音节构成规律获得不同语言在自然发音下较为通用的音节类型。然后,采用经典的高斯混合模型和隐马尔可夫模型为每种音节类型构建声学模型。为了更好的描述音节属性,给出一种基于多频带分析的显著信息融合的特征提取通道。最后在所识别音节类型序列的基础上,使用维特比算法确定出对应音节起止点的语音帧。在实验阶段利用三种常见语言的语音数据训练得到音节类型的声学模型,然后在六种语言和方言上执行识别实验,发现平均识别准确率可以达到了91%以上。与梅尔频率倒谱系数(MFCC)相比,使用所提特征获得的平均识别准确率至少提升了28个百分点。使用20ms的容差阈值,在六种语言和方言上依然可以取得90%以上的平均分割准确率。与近年来有代表性的四种算法比较,所提算法的平均分割准确率至少提升了6到13个百分点。实验结果表明,所提算法具有较强的泛化能力和较好的稳定性,能够获得较高的分割准确率。
关键词: 语音分割, 音节类型, 声学模型, 多频带分析, 特征融合
CLC Number:
TP391
孙林嘉 秦磊 康美金 王莹琳. 基于音节类型识别的自动语音分割算法[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2024060748.
0 / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2024060748