《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (3): 909-915.DOI: 10.11772/j.issn.1001-9081.2022010047

• 多媒体计算与计算机仿真 • 上一篇    

基于改进Inception网络的语音分类模型

张秋余, 王煜坤()   

  1. 兰州理工大学 计算机与通信学院,兰州 730050
  • 收稿日期:2022-01-17 修回日期:2022-06-08 接受日期:2022-06-10 发布日期:2022-07-11 出版日期:2023-03-10
  • 通讯作者: 王煜坤
  • 作者简介:张秋余(1966—),男,河北辛集人,研究员,主要研究方向:网络与信息安全、智能信息处理、模式识别
    王煜坤(1996—),男,甘肃兰州人,硕士研究生,主要研究方向:网络与信息安全、机器学习。
  • 基金资助:
    国家自然科学基金资助项目(61862041)

Speech classification model based on improved Inception network

Qiuyu ZHANG, Yukun WANG()   

  1. School of Computer and Communication,Lanzhou University of Technology,Lanzhou Gansu 730050,China
  • Received:2022-01-17 Revised:2022-06-08 Accepted:2022-06-10 Online:2022-07-11 Published:2023-03-10
  • Contact: Yukun WANG
  • About author:ZHANG Qiuyu, born in 1966, research fellow. His research interests include network and information security, intelligent information processing, pattern recognition.
  • Supported by:
    National Natural Science Foundation of China(61862041)

摘要:

针对传统音频分类模型提取音频特征的过程繁琐,以及现有神经网络模型存在过拟合、分类精度不高、梯度消失等问题,提出一种基于改进Inception网络的语音分类模型。首先,在模型中加入ResNet中的残差跳连思想以改进传统的InceptionV2模型,使网络模型在加深的同时避免梯度消失;其次,优化Inception模块中的卷积核大小,并利用不同尺寸卷积对原始语音的Log-Mel谱图进行深度特征提取,使模型通过自主学习的方式选择合适的卷积处理数据;同时,在深度与宽度两个维度改进模型以提高分类精度;最后,利用训练好的网络模型对语音数据进行分类预测,并通过Softmax函数得到分类结果。在清华大学汉语语音数据集THCHS-30与环境声音数据集UrbanSound8K数据集上的实验结果表明,改进的Inception网络模型在上述两个数据集上分类准确率分别为92.76%与93.34%。相较于VGG16、InceptionV2、GoogLeNet等模型,所提模型的分类准确率取得了最优,最多提高了27.30个百分点。所提模型具有更强的特征融合能力和更准确的分类结果,能够解决过拟合、梯度消失等问题。

关键词: 语音分类, 卷积神经网络, 残差跳连, 对数梅尔谱图, 深度特征

Abstract:

Aiming at the complicated process of extracting audio features by traditional audio classification models, and problems of the existing neural network models such as overfitting, low classification accuracy, and vanishing gradient, a speech classification model based on improved Inception network was proposed. Firstly, in order to avoid the vanishing gradient while increasing the depth of the network, the residual skip connection idea in Residual Network (ResNet) was added into the model to improve the traditional Inception V2 model. Secondly, the size of the convolution kernel in the Inception module was optimized, and the deep features of Log-Mel spectrogram of the original speech were extracted by using different sizes of convolutions, so that the model was able to select the appropriate convolution to process the data through self-learning. At the same time, the model was improved in depth and width dimensions in order to increase the classification accuracy. Finally, the trained network model was used to classify and predict the speech data, and the classification result was obtained through the Softmax function. Experimental results on Tsinghua University Chinese speech database THCHS-30 and ambient sound dataset UrbanSound8K show that the classification accuracy of the improved Inception network model on the above two datasets is 92.76% and 93.34% respectively. Compared with models such as Visual Geometry Group (VGG16), InceptionV2 and GoogLeNe, the classification accuracy of the proposed model is the best, with a maximum increase of 27.30 percentage points. It can be seen that the proposed model has stronger feature fusion ability and more accurate classification results, can solve problems such as overfitting and vanishing gradient.

Key words: speech classification, convolutional neural network, residual skip connection, Log-Mel spectrogram, depth feature

中图分类号: