Automatic speech segmentation algorithm based on syllable type recognition

doi:10.11772/j.issn.1001-9081.2024060748

Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (6): 2034-2042.DOI: 10.11772/j.issn.1001-9081.2024060748

• Multimedia computing and computer simulation • Previous Articles

Automatic speech segmentation algorithm based on syllable type recognition

Linjia SUN, Lei QIN, Meijin KANG, Yinglin WANG

Faculty of Linguistic Sciences，Beijing Language and Culture University，Beijing 100083，China

Received:2024-06-06 Revised:2024-09-25 Accepted:2024-09-27 Online:2024-10-12 Published:2025-06-10
Contact: Linjia SUN
About author:SUN Linjia， born in 1983， Ph. D.， assistant research fellow. His research interests include language resource construction， speech processing.
QIN Lei， born in 1999， M. S. Her research interests include language resource construction.
KANG Meijin， born in 2000， M. S. candidate. Her research interests include language resource construction.
WANG Yinglin， born in 2001， M. S. candidate. Her research interests include language resource construction.
Supported by:
Protection Project of Language Resources of China(YB2005B004);Fundamental Research Funds for the Central Universities(23YJ170009)

基于音节类型识别的自动语音分割算法

孙林嘉, 秦磊, 康美金, 王莹琳

北京语言大学语言科学院，北京 100083

通讯作者: 孙林嘉
作者简介:孙林嘉（1983—），男，山西朔州人，助理研究员，博士，主要研究方向：语言资源建设、语音处理 sunlinjia@Blcu.edu.cn
秦磊（1999—），女，安徽芜湖人，硕士，主要研究方向：语言资源建设
康美金（2000—），女，陕西绥德人，硕士研究生，主要研究方向：语言资源建设
王莹琳（2001—），女，河北石家庄人，硕士研究生，主要研究方向：语言资源建设。
基金资助:
中国语言资源保护工程专项(YB2005B004);中央高校基本科研业务费专项(23YJ170009)

Abstract

Abstract:

The methods based on boundary detection focus on utilizing abrupt changes in the time and frequency domains rather than language knowledge to segment speech data into syllable units. At the same time， satisfactory segmentation results only be achieved by setting various parameters in these methods， so that the methods still have some drawbacks， such as poor stability， difficulty in parameter adjustment， and weak generalization ability in cross-language environments with a lot of data. To address the above issues， an automatic speech segmentation algorithm based on syllable type recognition was proposed. The characteristic of the proposed algorithm is to recognize syllable type in speech data， not syllable specific content. Firstly， common syllable types of different languages under natural pronunciation were obtained by using linguistic research findings and syllable composition patterns. Then， the acoustic model for each syllable type was established by using the traditional Gaussian Mixture Model （GMM） and Hidden Markov Model （HMM）. Moreover， in order to better describe syllable attributes， a channel of feature extraction based on multi-band analysis and significant information fusion was proposed. Finally， based on the sequences of recognized syllable types， Viterbi algorithm was used to determine the speech frames corresponding to the start and end points of syllables. The acoustic models of syllable types were trained by using the speech data from three common languages during experimental phase， and then the recognition experiments were conducted on six languages and dialects. The experimental results show that the average recognition accuracy is over 91.93%； compared with using Mel Frequency cepstral Coefficient （MFCC）， using the proposed features can obtain the average recognition accuracy increased by at least 27.16 percentage points； when the tolerance threshold is 20 ms， the average segmentation accuracy of over 90.70% can still be achieved in six languages and dialects； compared with four representative algorithms in recent years， the proposed algorithm has the average segmentation accuracy improved by at least 5.73 percentage points. The above demonstrates that the proposed algorithm has stronger generalization ability， better stability and higher accuracy.

Key words: speech segmentation, syllable type, acoustic model, multi-band analysis, feature fusion

摘要：

基于边界检测的方法侧重利用时域和频域的突变来将语音数据切分成音节单元，较少关注语言知识在分割中发挥的作用。同时，此类方法通常需要设置各项参数以获得满意的分割结果，致使这些方法在大数据量和跨语言的环境下存在稳定性差、调整参数难和泛化能力弱的缺点。针对上述问题，提出一种基于音节类型识别的自动语音分割算法。该算法的特点在于所要识别的对象是语音数据中的音节类型，而非具体的音节内容。首先，利用语言学研究成果和音节构成规律获得不同语言在自然发音下较通用的音节类型；其次，采用经典的高斯混合模型（GMM）和隐马尔可夫模型（HMM）为每种音节类型构建声学模型；另外，为了更好地描述音节属性，提出一种基于多频带分析和显著信息融合的特征提取通道；最后，在所识别音节类型序列的基础上，使用维特比算法确定对应音节起止点的语音帧。在实验阶段利用3种常见语言的语音数据训练得到音节类型的声学模型，再在6种语言和方言上进行识别实验。实验结果表明，所提算法的平均识别准确率至少达到了91.93%；与使用梅尔频率倒谱系数（MFCC）相比，使用所提特征获得的平均识别准确率至少提升了27.16个百分点；当容差阈值为20 ms时，在6种语言和方言上依然可以取得90.70%以上的平均分割准确率；相较于近年来有代表性的4种算法，所提算法的平均分割准确率至少提升了5.73个百分点。以上说明所提算法具有较强的泛化能力、较好的稳定性和较高的分割准确率。

关键词: 语音分割, 音节类型, 声学模型, 多频带分析, 特征融合

CLC Number:

TP391.42

Linjia SUN, Lei QIN, Meijin KANG, Yinglin WANG. Automatic speech segmentation algorithm based on syllable type recognition[J]. Journal of Computer Applications, 2025, 45(6): 2034-2042.

孙林嘉, 秦磊, 康美金, 王莹琳. 基于音节类型识别的自动语音分割算法[J]. 《计算机应用》唯一官方网站, 2025, 45(6): 2034-2042.

Figures/Tables 10

References 40

1	林佳庆，李涓子，张鹏. 中国语言资源采录展示平台的关键技术及其应用［J］. 语言文字应用， 2019（4）：26-34.
	LIN J Q， LI J Z， ZHANG P. The key technologies and the applications for China language resources collection and service platform［J］. Applied Linguistics， 2019（4）： 26-34.
2	ARDILA R， BRANSON M， DAVIS K， et al. Common voice： a massively-multilingual speech corpus［C］// Proceedings of the 12th Language Resources and Evaluation Conference. Paris： European Language Resources Association， 2020： 3118-4222.
3	杨健，李振鹏，苏鹏. 语音分割与端点检测研究综述［J］. 计算机应用， 2020， 40（1）：1-7.
	YANG J， LI Z P， SU P. Review of speech segmentation and endpoint detection［J］. Journal of Computer Applications， 2020， 40（1）：1-7.
4	BOERSMA P， WEENINK D. Praat： doing phonetics by computer［EB/OL］. ［2024-05-02］..
5	GOLDMAN J P. EasyAlign： an automatic phonetic alignment tool under Praat［C］// Proceedings of the INTERSPEECH 2011. ［S.l.］： International Speech Communication Association， 2011： 3233-3236.
6	张扬，赵晓群，王缔罡. 基于时频二维能量特征的汉语音节切分方法［J］. 计算机应用， 2016， 36（11）：3222-3228.
	ZHANG Y， ZHAO X Q， WANG D G. Chinese speech segmentation into syllables based on energies in different times and frequencies ［J］. Journal of Computer Applications， 2016， 36（11）： 3222-3228.
7	BROGNAUX S， DRUGMAN T. HMM-based speech segmentation： improvements of fully automatic approaches［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2016， 24（1）： 5-15.
8	McAULIFFE M， SOCOLOF M， MIHUC S， et al. Montreal forced aligner： trainable text-speech alignment using Kaldi［C］// Proceedings of the INTERSPEECH 2017. ［S.l.］： International Speech Communication Association， 2017： 498-502.
9	TEYTAUT Y， ROEBEL A. Phoneme-to-audio alignment with recurrent neural networks for speaking and singing voice［C］// Proceedings of the INTERSPEECH 2021. ［S.l.］： International Speech Communication Association， 2021： 61-65.
10	LANDSIEDEL C， EDLUND J， EYBEN F， et al. Syllabification of conversational speech using bidirectional long-short-term memory neural networks ［C］// Proceedings of the 2011 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2011： 5256-5259.
11	PANDIA K， MURTHY H A. Acoustic unit discovery using transient and steady-state regions in speech and its applications［J］. Journal of Phonetics， 2021， 88： No.101081.
12	HYAFIL A， CERNAK M. Neuromorphic based oscillatory device for incremental syllable boundary detection［C］// Proceedings of the INTERSPEECH 2015. ［S.l.］： International Speech Communication Association， 2015： 1191-1195.
13	KREUK F， SHEENA Y， KESHET J， et al. Phoneme boundary detection using learnable segmental features ［C］// Proceedings of the 2020 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2020： 8089-8093.
14	WANG D， NARAYANAN S S. Robust speech rate estimation for spontaneous speech［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2007， 15（8）： 2190-2201.
15	SHANKAR R， VENKATARAMAN A. Weakly supervised syllable segmentation by vowel-consonant peak classification［C］// Proceedings of the INTERSPEECH 2019. ［S.l.］： International Speech Communication Association， 2019： 644-648.
16	OBIN N， LAMARE F， ROEBEL A. Syll-O-Matic： an adaptive time-frequency representation for the automatic segmentation of speech into syllables［C］// Proceedings of the 2013 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2013： 6699-6703.
17	RÄSÄNEN O， DOYLE G， FRANK M C. Pre-linguistic segmentation of speech into syllable-like units［J］. Cognition， 2018， 171： 130-150.
18	李洺宇，金小峰. 朝鲜语语音音节自动切分算法的研究［J］. 延边大学学报（自然科学版）， 2019， 45（2）： 128-135.
	LI M Y， JIN X F. Research on automatic segmentation algorithm of Korean speech syllable［J］. Journal of Yanbian University （Natural Science Edition）， 2019， 45（2）： 128-135.
19	王彤，易绵竹. 基于元音检测的俄语语音音节端点检测［J］. 郑州大学学报（理学版）， 2017， 49（4）：34-39.
	WANG T， YI M Z. Syllable endpoint detection in Russian speech based on vowel segmentation［J］. Journal of Zhengzhou University （Natural Science Edition）， 2017， 49（4）：34-39.
20	KUMARI R， DEV A， KUMAR A. An efficient syllable-based speech segmentation model using fuzzy and threshold-based boundary detection［J］. International Journal of Computational Intelligence and Applications， 2022， 21（2）： No.2250007.
21	李琦，张二华. 连续汉语语音的自动切分研究［J］. 计算机与数字工程， 2023， 51（4）：959-964.
	LI Q， ZHANG E H. Research on automatic segmentation of continuous Chinese speech［J］. Computer and Digital Engineering， 2023， 51（4）：959-964.
22	LALEYE F A A， EZIN E C， MOTAMED C. Automatic text-independent syllable segmentation using singularity exponents and Rényi entropy［J］. Journal of Signal Processing Systems， 2017， 88（3）： 439-451.
23	LALEYE F A A， EZIN E C， MOTAMED C. Automatic boundary detection based on entropy measures for text-independent syllable segmentation［J］. Multimedia Tools and Applications， 2017， 76（15）： 16347-16368.
24	HE S， ZHAO H. Automatic syllable segmentation algorithm of Chinese speech based on MF-DFA［J］. Speech Communication， 2017， 92： 42-51.
25	PANDA S P， NAYAK A K. Automatic speech segmentation in syllable centric speech recognition system［J］. International Journal of Speech Technology， 2016， 19（1）： 9-18.
26	GEETHA K， VADIVEL R. Syllable segmentation of Tamil speech signals using vowel onset point and spectral transition measure ［J］. Automatic Control and Computer Sciences， 2018， 52（1）：25-31.
27	KARIM R M， SUYANTO. Optimizing parameters of automatic speech segmentation into syllable units［J］. International Journal of Intelligent Systems and Applications， 2019， 11（5）：9-17.
28	北京语言大学. 一种无监督的音频与文本自动对齐方法及装置： 202310855904.1［P］. 2023-11-07.
	Beijing Language and Culture University. An unsupervised method and device for automatic alignment of audio and text： 202310855904.1［P］. 2023-11-07.
29	张雪，袁佩君，王莹，等. 知觉相关的神经振荡-外界节律同步化现象［J］. 生物化学与生物物理进展， 2016， 43（4）：308-315.
	ZHANG X， YUAN P J， WANG Y， et al. Neural entrainment and perception［J］. Progress in Biochemistry and Biophysics， 2016， 43（4）：308-315.
30	端木三. 英汉音节分析及数量对比［J］. 语言科学， 2021， 20（6）：561-588.
	DUANMU S. Syllable analysis and syllable inventories in English and Chinese［J］. Linguistic Sciences， 2021， 20（6）：561-588.
31	MADDIESON I. Syllable structure［EB/OL］. ［2024-03-15］..
32	WESTER M. Syllable classification using articulatory-acoustic features［C］// Proceedings of the 8th European Conference on Speech Communication and Technology. ［S.l.］： International Speech Communication Association， 2003： 233-236.
33	Team HTK. Hidden Markov model toolkit［EB/OL］. ［2024-03-15］..
34	MA N. An efficient implementation of gammatone filters ［EB/OL］. ［2024-03-19］..
35	GROSS J， HOOGENBOOM N， THUT G， et al. Speech rhythms and multiplexed oscillatory sensory coding in the human brain ［J］. PLoS Biology， 2013， 11（12）： No.e1001752.
36	HYAFIL A， FONTOLAN L， KABDEBON C， et al. Speech encoding by coupled cortical theta and gamma oscillations［J］. eLife， 2015， 4： No.e06213.
37	OBIN N. Cries and whispers classification of vocal effort in expressive speech［C］// Proceedings of the INTERSPEECH 2012. ［S.l.］： International Speech Communication Association， 2012： 2234-2237.
38	GAROFOLO J S， LAMEL L F， FISHER W M， et al. TIMIT acoustic-phonetic continuous speech corpus［DS/OL］. ［2024-02-17］. .
39	KOLOBOV R， OKHAPKINA O， OMELCHISHINA O， et al. MediaSpeech： multilanguage ASR benchmark and dataset［EB/OL］. ［2024-03-18］..
40	WANG D， ZHANG X. THCHS-30： a free Chinese speech corpus［EB/OL］. ［2024-03-20］..

语言和方言	数据来源	文件数	文件时长/h	分布占比/%
语言和方言	数据来源	文件数	文件时长/h	模型学习	语音分割
英语	TIMIT	6 300	5	60	40
法语	MediaSpeech	2 498	10	60	40
西班牙语	MediaSpeech	2 507	10	0	100
汉语普通话	THCHS	10 000	40	60	40
汉语晋方言	调查采集	8 000	11	0	100
汉语吴方言	调查采集	8 000	11	0	100

语言和方言	数据来源	文件数	文件时长/h	分布占比/%
语言和方言	数据来源	文件数	文件时长/h	模型学习	语音分割
英语	TIMIT	6 300	5	60	40
法语	MediaSpeech	2 498	10	60	40
西班牙语	MediaSpeech	2 507	10	0	100
汉语普通话	THCHS	10 000	40	60	40
汉语晋方言	调查采集	8 000	11	0	100
汉语吴方言	调查采集	8 000	11	0	100

音节类型	第1音位状态数	第2音位状态数	第3音位状态数	第4音位状态数
V	4	0	0	0
CV	5	4	0	0
VC	4	4	0	0
VV	3	3	0	0
CVC	5	4	4	0
CVV	5	3	3	0
VVC	3	3	4	0
VVV	3	3	3	0
CVVC	5	3	3	4
CVVV	5	3	3	3

音节类型	第1音位状态数	第2音位状态数	第3音位状态数	第4音位状态数
V	4	0	0	0
CV	5	4	0	0
VC	4	4	0	0
VV	3	3	0	0
CVC	5	4	4	0
CVV	5	3	3	0
VVC	3	3	4	0
VVV	3	3	3	0
CVVC	5	3	3	4
CVVV	5	3	3	3

语言和方言	不同时长			引入噪声	使用MFCC
语言和方言	1.5 s	4.0 s	7.0 s	引入噪声	使用MFCC
英语	94.02	93.69	93.28	92.86	70.21
法语	94.17	94.06	94.23	92.37	72.35
汉语普通话	93.89	93.17	93.54	92.63	70.36
均值1	94.03	93.64	93.68	92.62	70.97
西班牙语	89.61	89.75	89.27	87.29	54.89
汉语吴方言	90.98	90.24	90.36	88.08	56.03
汉语晋方言	90.23	90.71	90.92	88.85	55.27
均值2	90.27	90.23	90.18	88.07	55.40
均值3	92.15	91.94	91.93	90.35	63.19

Automatic speech segmentation algorithm based on syllable type recognition

基于音节类型识别的自动语音分割算法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 10

References 40

Related Articles 15

Recommended Articles

Metrics

语言和方言	不同容差阈值			不同先验知识（容差阈值=20 ms）
语言和方言	20 ms	30 ms	40 ms	音节数	类型序列
均值	90.70	91.12	91.55	91.20	91.43
英语	92.54	92.96	93.35	93.06	93.13
法语	91.71	92.44	93.27	92.56	92.87
汉语普通话	93.03	93.76	94.15	93.47	93.84
西班牙语	88.14	88.34	88.77	88.57	88.82
汉语吴方言	89.32	89.51	89.92	89.72	89.96
汉语晋方言	89.48	89.68	89.85	89.81	89.93

语言和方言	本文算法	峰值分类算法	振荡调幅算法	迭代优化算法	多级切分算法
均值	90.70	84.97	83.36	82.45	77.62
英语	92.54	88.38	86.03	84.67	72.44
法语	91.71	87.65	85.84	83.58	71.86
汉语普通话	93.03	82.27	80.69	81.82	85.46
西班牙语	88.14	87.39	85.62	83.28	71.39
汉语吴方言	89.32	81.86	80.72	80.43	81.75
汉语晋方言	89.48	82.26	81.24	80.94	82.81

[1]	Xiang WANG, Qianqian CUI, Xiaoming ZHANG, Jianchao WANG, Zhenzhou WANG, Jialin SONG. Wireless capsule endoscopy image classification model based on improved ConvNeXt [J]. Journal of Computer Applications, 2025, 45(6): 2016-2024.
[2]	Zonghang WU, Dong ZHANG, Guanyu LI. Multimodal fusion recommendation algorithm based on joint self-supervised learning [J]. Journal of Computer Applications, 2025, 45(6): 1858-1868.
[3]	Ying HUANG, Shengmei GAO, Guang CHEN, Su LIU. Low-light image enhancement network combining signal-to-noise ratio guided dual-branch structure and histogram equalization [J]. Journal of Computer Applications, 2025, 45(6): 1971-1979.
[4]	Yali YANG, Ying LI, Yutao ZHANG, Peihua SONG. Review of multi-modal research methods for face recognition [J]. Journal of Computer Applications, 2025, 45(5): 1645-1657.
[5]	Yang ZHOU, Hui LI. Remote sensing image building extraction network based on dual promotion of semantic and detailed features [J]. Journal of Computer Applications, 2025, 45(4): 1310-1316.
[6]	Shiyue GUO, Jianwu DANG, Yangping WANG, Jiu YONG. 3D hand pose estimation combining attention mechanism and multi-scale feature fusion [J]. Journal of Computer Applications, 2025, 45(4): 1293-1299.
[7]	Yiding WANG, Zehao WANG, Yaoli LI, Shaoqing CAI, Yuan YUAN. Multi-scale 2D-Adaboost microscopic image recognition algorithm of Chinese medicinal materials powder [J]. Journal of Computer Applications, 2025, 45(4): 1325-1332.
[8]	Zhongwei ZHANG, Jun WANG, Shudong LIU, Zhiheng WANG. Object detection in remote sensing image based on multi-scale feature fusion and weighted boxes fusion [J]. Journal of Computer Applications, 2025, 45(2): 633-639.
[9]	Qiurun HE, Jie HU, Bo PENG, Tianyuan LI. Fabric defect detection algorithm based on context information and multi-scale feature fusion [J]. Journal of Computer Applications, 2025, 45(2): 640-646.
[10]	Handa MA, Yadong WU. Multi-domain spatiotemporal hierarchical graph neural network for air quality prediction [J]. Journal of Computer Applications, 2025, 45(2): 444-452.
[11]	Rui LI, Guanfeng LI, Dezhou HU, Wenxin GAO. Knowledge graph multi-hop reasoning model fusing path and subgraph features [J]. Journal of Computer Applications, 2025, 45(1): 32-39.
[12]	Pengcheng SONG, Lijun GUO, Rong ZHANG. Weakly supervised video anomaly detection with local-global temporal dependency [J]. Journal of Computer Applications, 2025, 45(1): 240-246.
[13]	Shang LIU, Yuwei ZHOU, Rao DAI, Linfang DONG, Meng LIU. Small target detection algorithm in remote sensing images integrating attention and contextual information [J]. Journal of Computer Applications, 2025, 45(1): 292-300.
[14]	Yexin PAN, Zhe YANG. Optimization model for small object detection based on multi-level feature bidirectional fusion [J]. Journal of Computer Applications, 2024, 44(9): 2871-2877.
[15]	Ruihua LIU, Zihe HAO, Yangyang ZOU. Gait recognition algorithm based on multi-layer refined feature fusion [J]. Journal of Computer Applications, 2024, 44(7): 2250-2257.