Speech classification model based on improved Inception network

doi:10.11772/j.issn.1001-9081.2022010047

Abstract

Abstract:

Aiming at the complicated process of extracting audio features by traditional audio classification models， and problems of the existing neural network models such as overfitting， low classification accuracy， and vanishing gradient， a speech classification model based on improved Inception network was proposed. Firstly， in order to avoid the vanishing gradient while increasing the depth of the network， the residual skip connection idea in Residual Network （ResNet） was added into the model to improve the traditional Inception V2 model. Secondly， the size of the convolution kernel in the Inception module was optimized， and the deep features of Log-Mel spectrogram of the original speech were extracted by using different sizes of convolutions， so that the model was able to select the appropriate convolution to process the data through self-learning. At the same time， the model was improved in depth and width dimensions in order to increase the classification accuracy. Finally， the trained network model was used to classify and predict the speech data， and the classification result was obtained through the Softmax function. Experimental results on Tsinghua University Chinese speech database THCHS-30 and ambient sound dataset UrbanSound8K show that the classification accuracy of the improved Inception network model on the above two datasets is 92.76% and 93.34% respectively. Compared with models such as Visual Geometry Group （VGG16）， InceptionV2 and GoogLeNe， the classification accuracy of the proposed model is the best， with a maximum increase of 27.30 percentage points. It can be seen that the proposed model has stronger feature fusion ability and more accurate classification results， can solve problems such as overfitting and vanishing gradient.

Key words: speech classification, convolutional neural network, residual skip connection, Log-Mel spectrogram, depth feature

摘要：

针对传统音频分类模型提取音频特征的过程繁琐，以及现有神经网络模型存在过拟合、分类精度不高、梯度消失等问题，提出一种基于改进Inception网络的语音分类模型。首先，在模型中加入ResNet中的残差跳连思想以改进传统的InceptionV2模型，使网络模型在加深的同时避免梯度消失；其次，优化Inception模块中的卷积核大小，并利用不同尺寸卷积对原始语音的Log-Mel谱图进行深度特征提取，使模型通过自主学习的方式选择合适的卷积处理数据；同时，在深度与宽度两个维度改进模型以提高分类精度；最后，利用训练好的网络模型对语音数据进行分类预测，并通过Softmax函数得到分类结果。在清华大学汉语语音数据集THCHS-30与环境声音数据集UrbanSound8K数据集上的实验结果表明，改进的Inception网络模型在上述两个数据集上分类准确率分别为92.76%与93.34%。相较于VGG16、InceptionV2、GoogLeNet等模型，所提模型的分类准确率取得了最优，最多提高了27.30个百分点。所提模型具有更强的特征融合能力和更准确的分类结果，能够解决过拟合、梯度消失等问题。

关键词: 语音分类, 卷积神经网络, 残差跳连, 对数梅尔谱图, 深度特征

CLC Number:

TN912.3

Qiuyu ZHANG, Yukun WANG. Speech classification model based on improved Inception network[J]. Journal of Computer Applications, 2023, 43(3): 909-915.

张秋余, 王煜坤. 基于改进Inception网络的语音分类模型[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 909-915.

Figures/Tables 13

References 25

1	MUSHTAQ Z， SU S F， TRAN Q V. Spectral images based environmental sound classification using CNN with meaningful data augmentation［J］. Applied Acoustics， 2021， 172： No.107581. 10.1016/j.apacoust.2020.107581
2	TLEMSANI R， NEGGAZ N. A hybrid evolutionary neural networks training applied to phonetic classification［J］. Algerian Journal of Research and Technology， 2021， 5（1）： 1-10.
3	付炜，杨洋. 基于卷积神经网络和随机森林的音频分类方法［J］. 计算机应用， 2018， 38（S2）： 58-62.
	FU W， YANG Y. Audio classification method based on convolutional neural network and random forest［J］. Journal of Computer Applications， 2018， 38（S2）： 58-62.
4	CHIT Y W， HLAING W E， KHAING M M. Myanmar continuous speech recognition system using convolutional neural network［J］. International Journal of Image， Graphics and Signal Processing， 2021， 13（2）： 44-52. 10.5815/ijigsp.2021.02.04
5	BALLESTEROS D M， RODRIGUEZ-ORTEGA Y， RENZA D， et al. Deep4SNet： deep learning for fake speech classification［J］. Expert Systems with Applications， 2021， 184： No.115465. 10.1016/j.eswa.2021.115465
6	杨立东，张壮壮. 改进卷积神经网络的音频场景分类研究［J］. 现代电子技术， 2021， 44（3）： 91-94.
	YANG L D， ZHANG Z Z. Research on acoustic scene classification based on improved convolutional neural network［J］. Modern Electronics Technique， 2021， 44（3）： 91-94.
7	TOKOZUME Y， HARADA T. Learning environmental sounds with end-to-end convolutional neural network［C］// Proceedings of the 2017 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2017： 2721-2725. 10.1109/icassp.2017.7952651
8	PONS J， SERRA X. Randomly weighted CNNs for （music） audio classification［C］// Proceeding of the 2019 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2019： 336-340. 10.1109/icassp.2019.8682912
9	JIN X， WU L， LI X D， et al. ILGNet： inception modules with connected local and global features for efficient image aesthetic quality classification using domain adaptation［J］. IET Computer Vision， 2019， 13（2）： 206-212. 10.1049/iet-cvi.2018.5249
10	MEGHANA A S， SUDHAKAR S， ARUMUGAM G， et al. Age and gender prediction using convolution， ResNet50 and inception ResNetV2［J］. International Journal of Advanced Trends in Computer Science and Engineering， 2020， 9（2）： 1328-1334. 10.30534/ijatcse/2020/65922020
11	熊华煜，余勤，任品，等. 基于机器学习的音频分类［J］. 计算机工程与设计， 2021， 42（1）： 156-160.
	XIONG H Y， YU Q， REN P， et al. Audio classification based on machine learning［J］. Computer Engineering and Design， 2021， 42（1）： 156-160.
12	PICZAK K J. Environmental sound classification with convolutional neural networks［C］// Proceeding of the IEEE 25th International Workshop on Machine Learning for Signal Processing. Piscataway： IEEE， 2015： 1-6. 10.1109/mlsp.2015.7324337
13	SALAMON J， BELLO J P. Deep convolutional neural networks and data augmentation for environmental sound classification［J］. IEEE Signal Processing Letters， 2017， 24（3）： 279-283. 10.1109/lsp.2017.2657381
14	LU L， YANG Y H， JING Y Z， et al. Shallow convolutional neural networks for acoustic scene classification［J］. Wuhan University Journal of Natural Sciences， 2018， 23（2）：178-184. 10.1007/s11859-018-1308-z
15	PASEDDULA C， GANGASHETTY S V. Late fusion framework for Acoustic Scene Classification using LPCC， SCMC， and log-Mel band energies with Deep Neural Networks［J］. Applied Acoustics， 2021， 172： No.107568. 10.1016/j.apacoust.2020.107568
16	SZEGEDY C， LIU W， JIA Y Q， et al. Going deeper with convolutions［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2015： 1-9. 10.1109/cvpr.2015.7298594
17	HE K M， ZHANG X Y， REN S Q， et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778. 10.1109/cvpr.2016.90
18	KHAN S H， HAYAT M， PORIKLI F. Regularization of deep neural networks with spectral dropout［J］. Neural Networks， 2019， 110： 82-90. 10.1016/j.neunet.2018.09.009
19	SINGARIMBUN R N， NABABAN E B， SITOMPUL O S. Adaptive moment estimation to minimize square error in backpropagation algorithm［C］// Proceedings of the 2019 International Conference of Computer Science and Information Technology. Piscataway： IEEE， 2019： 1-7. 10.1109/icosnikom48755.2019.9111563
20	WANG D， ZHANG X W. THCHS-30： a free Chinese speech corpus［EB/OL］. （2015-12-10）［2021-11-20］..
21	BOCK S， GOPPOLD J， WEIβ M. An improvement of the convergence proof of the ADAM-Optimizer［EB/OL］. （2018-04-27）［2021-11-20］..
22	ABDOLI S， CARDINAL P， KOERICH A L. End-to-end environmental sound classification using a 1D convolutional neural network［J］. Expert Systems with Applications， 2019， 136： 252-263. 10.1016/j.eswa.2019.06.040
23	CHEN Y， GUO Q， LIANG X Y， et al. Environmental sound classification with dilated convolutions［J］. Applied Acoustics， 2019， 148： 123-132. 10.1016/j.apacoust.2018.12.019
24	LI S B， YAO Y， HU J， et al. An ensemble stacked convolutional neural network model for environmental event sound recognition［J］. Applied Sciences， 2018， 8（7）： No.1152. 10.3390/app8071152
25	BODDAPATI V， PETEF A， RASMUSSON J， et al. Classifying environmental sounds using image recognition networks［J］. Procedia Computer Science， 2017， 112： 2048-2056. 10.1016/j.procs.2017.08.250

网络层	类别	卷积尺寸	步长	深度	激活函数
输入	输入层	—	—	3	ReLU
卷积1	卷积层	3×3	1	16	ReLU
改进Inception	卷积+池化	1×1，1×3，3×3	2	16	ReLU
改进Inception	卷积+池化	1×1，1×3，3×3	1	64	ReLU
改进Inception	卷积+池化	1×1，1×3，3×3	2	32	ReLU
改进Inception	卷积+池化	1×1，1×3，3×3	1	128	ReLU
全局池化	池化	3×3	1	—	—
输出	输出	—	—	—	Softmax

网络层	类别	卷积尺寸	步长	深度	激活函数
输入	输入层	—	—	3	ReLU
卷积1	卷积层	3×3	1	16	ReLU
改进Inception	卷积+池化	1×1，1×3，3×3	2	16	ReLU
改进Inception	卷积+池化	1×1，1×3，3×3	1	64	ReLU
改进Inception	卷积+池化	1×1，1×3，3×3	2	32	ReLU
改进Inception	卷积+池化	1×1，1×3，3×3	1	128	ReLU
全局池化	池化	3×3	1	—	—
输出	输出	—	—	—	Softmax

参数名	参数值	效果
rotation_range	40	指定数值，将数据在0至此数值内随机角度旋转
width_shift_range	0.2	水平方向随机平移，平移最大距离为参数值乘图像宽度
height_shift_range	0.2	垂直方向随机平移，平移最大距离为参数值乘图像高
shear_range	0.2	错切交错，让所有点的x（或y）轴不变，y（或x）轴按参数值比例平移
zoom_range	0.2	在长宽两个方向分别进行按参数值进行缩放操作
horizontal_flip	True	随机对图片执行水平翻转操作
fill_mode	nearest	采用默认方式对平移、缩放、错切操作之后的数据进行填充

参数名	参数值	效果
rotation_range	40	指定数值，将数据在0至此数值内随机角度旋转
width_shift_range	0.2	水平方向随机平移，平移最大距离为参数值乘图像宽度
height_shift_range	0.2	垂直方向随机平移，平移最大距离为参数值乘图像高
shear_range	0.2	错切交错，让所有点的x（或y）轴不变，y（或x）轴按参数值比例平移
zoom_range	0.2	在长宽两个方向分别进行按参数值进行缩放操作
horizontal_flip	True	随机对图片执行水平翻转操作
fill_mode	nearest	采用默认方式对平移、缩放、错切操作之后的数据进行填充

迭代次数	准确率/%	迭代次数	准确率/%
15	83.24	35	89.92
20	85.00	40	92.55
25	88.30	45	93.03
30	87.20	50	93.48