《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (6): 2034-2042.DOI: 10.11772/j.issn.1001-9081.2024060748

• 多媒体计算与计算机仿真 • 上一篇    

基于音节类型识别的自动语音分割算法

孙林嘉, 秦磊, 康美金, 王莹琳   

  1. 北京语言大学 语言科学院,北京 100083
  • 收稿日期:2024-06-06 修回日期:2024-09-25 接受日期:2024-09-27 发布日期:2024-10-12 出版日期:2025-06-10
  • 通讯作者: 孙林嘉
  • 作者简介:孙林嘉(1983—),男,山西朔州人,助理研究员,博士,主要研究方向:语言资源建设、语音处理 sunlinjia@Blcu.edu.cn
    秦磊(1999—),女,安徽芜湖人,硕士,主要研究方向:语言资源建设
    康美金(2000—),女,陕西绥德人,硕士研究生,主要研究方向:语言资源建设
    王莹琳(2001—),女,河北石家庄人,硕士研究生,主要研究方向:语言资源建设。
  • 基金资助:
    中国语言资源保护工程专项(YB2005B004);中央高校基本科研业务费专项(23YJ170009)

Automatic speech segmentation algorithm based on syllable type recognition

Linjia SUN, Lei QIN, Meijin KANG, Yinglin WANG   

  1. Faculty of Linguistic Sciences,Beijing Language and Culture University,Beijing 100083,China
  • Received:2024-06-06 Revised:2024-09-25 Accepted:2024-09-27 Online:2024-10-12 Published:2025-06-10
  • Contact: Linjia SUN
  • About author:SUN Linjia, born in 1983, Ph. D., assistant research fellow. His research interests include language resource construction, speech processing.
    QIN Lei, born in 1999, M. S. Her research interests include language resource construction.
    KANG Meijin, born in 2000, M. S. candidate. Her research interests include language resource construction.
    WANG Yinglin, born in 2001, M. S. candidate. Her research interests include language resource construction.
  • Supported by:
    Protection Project of Language Resources of China(YB2005B004);Fundamental Research Funds for the Central Universities(23YJ170009)

摘要:

基于边界检测的方法侧重利用时域和频域的突变来将语音数据切分成音节单元,较少关注语言知识在分割中发挥的作用。同时,此类方法通常需要设置各项参数以获得满意的分割结果,致使这些方法在大数据量和跨语言的环境下存在稳定性差、调整参数难和泛化能力弱的缺点。针对上述问题,提出一种基于音节类型识别的自动语音分割算法。该算法的特点在于所要识别的对象是语音数据中的音节类型,而非具体的音节内容。首先,利用语言学研究成果和音节构成规律获得不同语言在自然发音下较通用的音节类型;其次,采用经典的高斯混合模型(GMM)和隐马尔可夫模型(HMM)为每种音节类型构建声学模型;另外,为了更好地描述音节属性,提出一种基于多频带分析和显著信息融合的特征提取通道;最后,在所识别音节类型序列的基础上,使用维特比算法确定对应音节起止点的语音帧。在实验阶段利用3种常见语言的语音数据训练得到音节类型的声学模型,再在6种语言和方言上进行识别实验。实验结果表明,所提算法的平均识别准确率至少达到了91.93%;与使用梅尔频率倒谱系数(MFCC)相比,使用所提特征获得的平均识别准确率至少提升了27.16个百分点;当容差阈值为20 ms时,在6种语言和方言上依然可以取得90.70%以上的平均分割准确率;相较于近年来有代表性的4种算法,所提算法的平均分割准确率至少提升了5.73个百分点。以上说明所提算法具有较强的泛化能力、较好的稳定性和较高的分割准确率。

关键词: 语音分割, 音节类型, 声学模型, 多频带分析, 特征融合

Abstract:

The methods based on boundary detection focus on utilizing abrupt changes in the time and frequency domains rather than language knowledge to segment speech data into syllable units. At the same time, satisfactory segmentation results only be achieved by setting various parameters in these methods, so that the methods still have some drawbacks, such as poor stability, difficulty in parameter adjustment, and weak generalization ability in cross-language environments with a lot of data. To address the above issues, an automatic speech segmentation algorithm based on syllable type recognition was proposed. The characteristic of the proposed algorithm is to recognize syllable type in speech data, not syllable specific content. Firstly, common syllable types of different languages under natural pronunciation were obtained by using linguistic research findings and syllable composition patterns. Then, the acoustic model for each syllable type was established by using the traditional Gaussian Mixture Model (GMM) and Hidden Markov Model (HMM). Moreover, in order to better describe syllable attributes, a channel of feature extraction based on multi-band analysis and significant information fusion was proposed. Finally, based on the sequences of recognized syllable types, Viterbi algorithm was used to determine the speech frames corresponding to the start and end points of syllables. The acoustic models of syllable types were trained by using the speech data from three common languages during experimental phase, and then the recognition experiments were conducted on six languages and dialects. The experimental results show that the average recognition accuracy is over 91.93%; compared with using Mel Frequency cepstral Coefficient (MFCC), using the proposed features can obtain the average recognition accuracy increased by at least 27.16 percentage points; when the tolerance threshold is 20 ms, the average segmentation accuracy of over 90.70% can still be achieved in six languages and dialects; compared with four representative algorithms in recent years, the proposed algorithm has the average segmentation accuracy improved by at least 5.73 percentage points. The above demonstrates that the proposed algorithm has stronger generalization ability, better stability and higher accuracy.

Key words: speech segmentation, syllable type, acoustic model, multi-band analysis, feature fusion

中图分类号: