基于多尺度时序感知网络的课堂语音情感识别方法

doi:10.11772/j.issn.1001-9081.2023050663

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (5): 1636-1643.DOI: 10.11772/j.issn.1001-9081.2023050663

• 多媒体计算与计算机仿真 • 上一篇

基于多尺度时序感知网络的课堂语音情感识别方法

周菊香¹^,², 刘金生¹(), 甘健侯¹^,², 吴迪¹, 李子杰¹

^1.民族教育信息化教育部重点实验室(云南师范大学), 昆明 650500
^2.云南省智慧教育重点实验室(云南师范大学), 昆明 650500

收稿日期:2023-05-29 修回日期:2023-08-16 接受日期:2023-09-12 发布日期:2023-09-19 出版日期:2024-05-10
通讯作者: 刘金生
作者简介:周菊香（1986—），女，陕西蓝田人，副教授，博士，CCF会员，主要研究方向：智慧教育、计算机视觉、语音识别
甘健侯（1976—），男，云南凤庆人，教授，博士，CCF会员，主要研究方向：智慧教育、知识图谱
吴迪（1995—），男，河南商丘人，博士研究生，CCF会员，主要研究方向：智慧教育、计算机视觉
李子杰（1998—），男，云南昆明人，博士研究生，CCF会员，主要研究方向：智慧教育、自然语言处理。
第一联系人：刘金生（1998—），男，山东潍坊人，硕士研究生，主要研究方向：智慧教育、语音识别
基金资助:
国家自然科学基金资助项目(62107034);云南省科技厅科技计划项目(202101AT070095);云南省中老泰教育数字化国际联合研发中心项目(202203AP140006)

Classroom speech emotion recognition method based on multi-scale temporal-aware network

Juxiang ZHOU¹^,², Jinsheng LIU¹(), Jianhou GAN¹^,², Di WU¹, Zijie LI¹

^1.Key Laboratory of Educational Informalization for Nationalities （Yunnan Normal University），Ministry of Education，Kunming Yunnan 650500，China
^2.Yunnan Key Laboratory of Smart Education （Yunnan Normal University），Kunming Yunnan 650500，China

Received:2023-05-29 Revised:2023-08-16 Accepted:2023-09-12 Online:2023-09-19 Published:2024-05-10
Contact: Jinsheng LIU
About author:ZHOU Juxiang， born in 1986， Ph. D.， associate professor. Her research interests include smart education， computer vision， speech recognition.
GAN Jianhou， born in 1976， Ph. D.， professor. His research interests include smart education， knowledge graph.
WU Di， born in 1995， Ph. D. candidate. His research interests include smart education， computer vision.
LI Zijie， born in 1998， Ph. D. candidate. His research interests include smart education， natural language processing.
Supported by:
National Natural Science Foundation of China(62107034);Yunnan Scientific and Technological Program(202101AT070095);Project of Yunnan International Joint R&D Center of China-Laos-Thailand Educational Digitalization(202203AP140006)

摘要/Abstract

摘要：

语音情感识别近年来在多场景智能系统中得到了广泛应用，也为实现智慧课堂环境下的教学行为智能分析提供了可能。通过课堂语音情感识别技术可以自动识别课堂教学中教师和学生的情感状态，帮助教师了解自己的授课风格并及时掌握学生的课堂学习状态，从而达到精准施教的目的。针对课堂语音情感识别任务，首先，收集中小学的课堂实录教学视频，提取音频并进行人工切分和标注，构建了包含6类情感的中小学教学语音情感语料库；其次，基于时序卷积网络（TCN）和交叉门控机制（cross-gated mechanism）设计了双路时序卷积通道，以提取多尺度交叉融合特征；最后，采用动态权重融合策略调整不同尺度特征的贡献度，减少非重要特征对识别结果的干扰，进一步增强模型的表征和学习能力。实验结果表明，所提方法在多个公共数据集上优于TIM-Net（Temporal-aware bI-direction Multi-scale Network）、GM-TCNet（Gated Multi-scale Temporal Convolutional Network）和CTL-MTNet（CapsNet and Transfer Learning-based Mixed Task Net）等先进模型，在真实课堂语音情感识别任务上未加权平均召回率（UAR）和加权平均召回率（WAR）分别达90.58%和90.45%。

关键词: 语音情感识别, 课堂语音, 时序卷积网络, 交叉门控卷积, 梅尔频率倒谱系数

Abstract:

Speech emotion recognition has been widely used in multi-scenario intelligent systems in recent years， and it also provides the possibility to realize intelligent analysis of teaching behaviors in smart classroom environments. Classroom speech emotion recognition technology can be used to automatically recognize the emotional states of teachers and students during classroom teaching， help teachers understand their own teaching styles and grasp students’ classroom learning status in a timely manner， thereby achieving the purpose of precise teaching. For the classroom speech emotion recognition task， firstly， classroom teaching videos were collected from primary and secondary schools， the audio was extracted， and manually segmented and annotated to construct a primary and secondary school teaching speech emotion corpus containing six emotion categories. Secondly， based on the Temporal Convolutional Network （TCN） and cross-gated mechanism， dual temporal convolution channels were designed to extract multi-scale cross-fusion features. Finally， a dynamic weight fusion strategy was adopted to adjust the contributions of features at different scales， reduce the interference of non-important features on the recognition results， and further enhance the representation and learning ability of the model. Experimental results show that the proposed method is superior to advanced models such as TIM-Net （Temporal-aware bI-direction Multi-scale Network）， GM-TCNet （Gated Multi-scale Temporal Convolutional Network）， and CTL-MTNet （CapsNet and Transfer Learning-based Mixed Task Net） on multiple public datasets， and its UAR （Unweighted Average Recall） and WAR （Weighted Average Recall） reach 90.58% and 90.45% respectively in real classroom speech emotion recognition task.

Key words: speech emotion recognition, classroom speech, temporal convolutional network, cross-gated convolution, Mel-Frequency Cepstral Coefficient (MFCC)

中图分类号:

TP391

周菊香, 刘金生, 甘健侯, 吴迪, 李子杰. 基于多尺度时序感知网络的课堂语音情感识别方法[J]. 计算机应用, 2024, 44(5): 1636-1643.

Juxiang ZHOU, Jinsheng LIU, Jianhou GAN, Di WU, Zijie LI. Classroom speech emotion recognition method based on multi-scale temporal-aware network[J]. Journal of Computer Applications, 2024, 44(5): 1636-1643.

图/表 21

参考文献 38

1	MARSLEN-WILSON W. The perception of speech： from sound to meaning［J］. Perception， 2008， 363（1493）： 915-1122. 10.1098/rstb.2007.2195
2	ZOU H， SI Y， CHEN C， et al. Speech emotion recognition with co-attention based multi-level acoustic information［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2022： 7367-7371. 10.1109/icassp43922.2022.9747095
3	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need ［EB/OL］. （2017-12-06）［2023-09-12］. .
4	ZHONG Y， HU Y， HUANG H， et al. A lightweight model based on separable convolution for speech emotion recognition［C］// Proceedings of the Interspeech 2020. ［S.l.］： International Speech Communication Association， 2020： 3331-3335. 10.21437/interspeech.2020-2408
5	LIU L-Y， LIU W-Z， FENG L. SDTF-Net： static and dynamic time-frequency network for speech emotion recognition［J］. Speech Communication， 2023， 148： 1-8. 10.1016/j.specom.2023.01.008
6	YUAN Q. A classroom emotion recognition model based on a convolutional neural network speech emotion algorithm［J］. Occupational Therapy International， 2022， 2022： 9563877. 10.1155/2022/9563877
7	LAI L. English flipped classroom teaching mode based on emotion recognition technology［J］. Frontiers in Psychology， 2022， 13： 945273. 10.3389/fpsyg.2022.945273
8	钱婷.基于传统课堂的教师话语情感识别研究［D］.武汉：华中师范大学，2019：26-39. 10.35745/ecei2019v2.102
	QIAN T. Research on teacher’s discourse emotion recognition based on traditional classroom ［D］. Wuhan： Central China Normal University， 2019： 26-39. 10.35745/ecei2019v2.102
9	王旭阳.基于深度学习的教师语音情感识别研究［D］.武汉：华中师范大学，2021：33-42.
	WANG X Y. Research on teacher’s speech emotion recognition based on deep learning ［D］. Wuhan： Central China Normal University， 2021： 33-42.
10	LIKITHA M S， GUPTA S R R， HASITHA K， et al. Speech based human emotion recognition using MFCC［C］// Proceedings of the 2017 International Conference on Wireless Communications， Signal Processing and Networking. Piscataway： IEEE， 2017： 2257-2260. 10.1109/wispnet.2017.8300161
11	YE J， WEN X-C， WEI Y， et al. Temporal modeling matters： a novel temporal emotional modeling approach for speech emotion recognition［C］// Proceedings of the 2023 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2023： 1-5. 10.1109/icassp49357.2023.10096370
12	PATNAIK S. Speech emotion recognition by using complex MFCC and deep sequential model［J］. Multimedia Tools and Applications， 2023， 82： 11897-11922. 10.1007/s11042-022-13725-y
13	ZHANG J， YAN W， ZHANG Y. A new speech feature fusion method with cross gate parallel CNN for speaker recognition ［EB/OL］. ［2023-09-12］. . 10.1109/access.2023.3294274
14	WANG J， XUE M， CULHANE R， et al. Speech emotion recognition with dual-sequence LSTM architecture［C］// Proceedings of the 2020 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2020： 6474-6478. 10.1109/icassp40776.2020.9054629
15	ANDAYANI F， THENG L B， TSUN M T， et al. Hybrid LSTM-transformer model for emotion recognition from speech audio files［J］. IEEE Access， 2022， 10： 36018-36027. 10.1109/access.2022.3163856
16	CHEN W， XING X， XU X， et al. DST： deformable speech Transformer for emotion recognition［C］// Proceedings of the 2023 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2023： 1-5. 10.1109/icassp49357.2023.10096966
17	CHEN S， XING X， ZHANG W， et al. DWFormer： dynamic window transformer for speech emotion recognition［C］// Proceedings of the 2023 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2023： 1-5. 10.1109/icassp49357.2023.10094651
18	BAI S， KOLTER J Z， KOLTUN V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling ［EB/OL］. （2018-04-19）［2023-09-12］. .
19	TAO H， SHAN S， HU Z， et al. Strong generalized speech emotion recognition based on effective data augmentation［J］. Entropy， 2022， 25（1）： 68. 10.3390/e25010068
20	McFEE B， RAFFEL C， LIANG D， et al. librosa： audio and music signal analysis in Python［C］// Proceedings of the 14th Python in Science Conference. Austin： Scientific Computing with Python， 2015： 18-24. 10.25080/majora-7b98e3ed-003
21	BURKHARDT F， PAESCHKE A， ROLFES M， et al. A database of German emotional speech［C］// Proceedings of the INTERSPEECH 2005. ［S.l.］： International Speech Communication Association， 2005： 1517-1520. 10.21437/interspeech.2005-446
22	COSTANTINI G， IADEROLA I， PAOLONI A， et al. EMOVO corpus： an Italian emotional speech database［C］// Proceedings of the 9th International Conference on Language Resources and Evaluation. Paris： European Language Resources Association， 2014： 3501-3504.
23	BUSSO C， BULUT M， LEE C-C， et al. IEMOCAP： interactive emotional dyadic motion capture database［J］. Language Resources and Evaluation， 2008， 42： 335-359. 10.1007/s10579-008-9076-6
24	LIVINGSTONE S R， RUSSO F A. The Ryerson Audio-Visual Database of Emotional Speech and Song （RAVDESS）： a dynamic， multimodal set of facial and vocal expressions in North American English［J］. PLoS ONE， 2018， 13（5）： e0196391. 10.1371/journal.pone.0196391
25	JACKSON P， HAQ S ． Surrey Audio-Visual Expressed Emotion （SAVEE） database ［DB/OL］. （2015-01-05）［2023-09-12］. ．
26	CHEN L， SU W， FENG Y， et al. Two-layer fuzzy multiple random forest for speech emotion recognition in human-robot interaction［J］. Information Sciences， 2020， 509： 150-163. 10.1016/j.ins.2019.09.005
27	GAO M， DONG J， ZHOU D， et al. End-to-end speech emotion recognition based on one-dimensional convolutional neural network［C］// Proceedings of the 2019 3rd International Conference on Innovation in Artificial Intelligence. New York： ACM， 2019： 78-82. 10.1145/3319921.3319963
28	YE J-X， WEN X-C， WANG X-Z， et al. GM-TCNet： gated multi-scale temporal convolutional network using emotion causality for speech emotion recognition［J］. Speech Communication， 2022， 145： 21-35. 10.1016/j.specom.2022.07.005
29	ASSUNÇÃO G， MENEZES P， PERDIGÃO F. Speaker awareness for speech emotion recognition［J］. International Journal of Online and Biomedical Engineering， 2020， 16（4）： 15-22. 10.3991/ijoe.v16i04.11870
30	OZER I. Pseudo-colored rate map representation for speech emotion recognition［J］. Biomedical Signal Processing and Control， 2021， 66： 102502. 10.1016/j.bspc.2021.102502
31	TUNCER T， DOGAN S， ACHARYA U R. Automated accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques［J］. Knowledge-Based Systems， 2021， 211： 106547. 10.1016/j.knosys.2020.106547
32	WEN X-C， YE J-X， LUO Y， et al. CTL-MTNet： a novel CapsNet and transfer learning-based mixed task net for single-corpus and cross-corpus speech emotion recognition［C］// Proceedings of the 31st International Joint Conference on Artificial Intelligence. California： IJCAI， 2022： 2305-2311. 10.24963/ijcai.2022/320
33	SATT A， ROZENBERG S， HOORY R. Efficient emotion recognition from speech using deep learning on spectrograms［C］// Proceedings of the Interspeech 2017. ［S.l.］： International Speech Communication Association， 2017： 1089-1093. 10.21437/interspeech.2017-200
34	NEDIYANCHATH A， PARAMASIVAM P， YENIGALLA P. Multi-head attention for speech emotion recognition with auxiliary learning of gender recognition［C］// Proceedings of the 2020 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2020： 7179-7183. 10.1109/icassp40776.2020.9054073
35	ZHU W， LI X. Speech emotion recognition with global-aware fusion on multi-scale feature representation［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2022： 6437-6441. 10.1109/icassp43922.2022.9747517
36	MUSTAQEEM， SAJJAD M， KWON S. Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM［J］. IEEE Access， 2020， 8： 79861-79875. 10.1109/access.2020.2990405
37	KANWAL S， ASGHAR S. Speech emotion recognition using clustering based GA-optimized feature set［J］. IEEE Access， 2021， 9： 125830-125842. 10.1109/access.2021.3111659
38	RAJAPAKSHE T， RANA R， KHALIFA S， et al. A novel policy for pre-trained deep reinforcement learning for speech emotion recognition［C］// Proceedings of the 2022 Australasian Computer Science Week. New York： ACM， 2022： 96-105. 10.1145/3511616.3513104

学段	年级	教师性别
小学	三年级	女
小学	五年级	男
初中	七年级	女
初中	七年级	女
高中	高二	女
高中	高二	男

学段	年级	教师性别
小学	三年级	女
小学	五年级	男
初中	七年级	女
初中	七年级	女
高中	高二	女
高中	高二	男

情感类别	在课堂中的解释
高昂	教师和学生都包含；课堂氛围浓厚，声音较大，例如教师或学生富有感情地大声朗读、学生齐声大声朗读等。
紧张	多为学生情感；出现次数较少，例如当学生回答问题时因紧张导致回答声音小、磕磕绊绊等。
平静	教师和学生都包含；教师正常地讲课，学生正常地回答问题，没有明显的情感波动。
疑问	多为教师情感；老师对学生进行提问时的疑问语气，例如“哪位同学可以来读一下这段话呢？”
满意	多为教师情感；当教师对学生的回答表示满意、赞同时，例如“嗯，回答得不错！”“回答得非常好！”
沉默	课堂中无人讲话，学生在思考问题，但有其他的细微声音，例如教师板书发出的声音、学生移动身体发出的声音及其他不明显声音等。

情感类别	在课堂中的解释
高昂	教师和学生都包含；课堂氛围浓厚，声音较大，例如教师或学生富有感情地大声朗读、学生齐声大声朗读等。
紧张	多为学生情感；出现次数较少，例如当学生回答问题时因紧张导致回答声音小、磕磕绊绊等。
平静	教师和学生都包含；教师正常地讲课，学生正常地回答问题，没有明显的情感波动。
疑问	多为教师情感；老师对学生进行提问时的疑问语气，例如“哪位同学可以来读一下这段话呢？”
满意	多为教师情感；当教师对学生的回答表示满意、赞同时，例如“嗯，回答得不错！”“回答得非常好！”
沉默	课堂中无人讲话，学生在思考问题，但有其他的细微声音，例如教师板书发出的声音、学生移动身体发出的声音及其他不明显声音等。

情感类别	初始标注数	增强数据数	增强数据采样数	重采样后数量
总计	2 010	2 016	648	1 800
高昂	474	0	0	300
紧张	71	568	229	300
平静	626	0	0	300
疑问	98	784	202	300
满意	83	664	217	300
沉默	658	0	0	300

基于多尺度时序感知网络的课堂语音情感识别方法

Classroom speech emotion recognition method based on multi-scale temporal-aware network

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 21

参考文献 38

相关文章 13

编辑推荐

Metrics

D	UAR/%		WAR/%
D	TIM-Net	本文方法	TIM-Net	本文方法
2	89.85	90.21	89.78	90.17
4	90.11	90.31	89.94	90.17
6	90.17	90.58	90.11	90.44
8	90.05	90.51	90.00	90.45

数据集	语言	样本数	数据集	语言	样本数
CASIA	汉语	1 200	IEMOCAP	英语	5 531
EMODB	德语	535	RAVDESS	英语	1 440
EMOVO	意大利语	588	SAVEE	英语	480

方法	UAR	WAR	方法	UAR	WAR
TLFMRF^［26］	85.83	85.83	TIM-Net^［11］	90.20	90.33
CNN^［27］	87.90	87.90	本文方法	90.60	90.75
GM-TCNet^［28］	89.50	89.50

方法	UAR	WAR	方法	UAR	WAR
LMT^［29］	68.00	70.40	TIM-Net^［11］	89.90	90.09
TLFMRF^［26］	—	87.85	本文方法	89.30	89.53
GM-TCNet^［28］	89.47	89.35

方法	UAR	WAR	方法	UAR	WAR
RM-CN^［30］	68.93	68.93	TIM-Net^［11］	87.78	87.59
TSP+INCA^［31］	79.08	79.08	本文方法	87.57	87.08
CTL-MTNet^［32］	85.40	85.40

方法	UAR	WAR	方法	UAR	WAR
CNN-LSTM^［33］	63.70	68.80	TIM-Net^［11］	70.04	69.08
MPM^［34］	64.50	65.50	本文方法	70.38	69.01
GLAM^［35］	68.20	69.70

方法	UAR	WAR	方法	UAR	WAR
LMT^［29］	71.00	71.60	GM-TCNet^［28］	86.91	87.08
Bi-LSTM^［36］	77.00	86.00	本文方法	87.32	87.29
TIM-Net^［11］	85.74	85.90

方法	UAR	WAR	方法	UAR	WAR
SVM^［37］	—	69.80	TIM-Net^［11］	80.96	82.92
Zeta^［38］	68.90	—	本文方法	81.84	83.13
LMT^［29］	68.00	70.40

D	UAR/%			WAR/%
D	本文方法	无交叉门控机制	无多尺度特征融合	本文方法	无交叉门控机制	无多尺度特征融合
2	90.21	89.64	90.19	90.17	89.54	90.06
4	90.31	90.05	89.95	90.17	90.00	90.00
6	90.58	90.01	90.18	90.44	90.04	90.12
8	90.51	90.19	90.15	90.45	90.05	90.13

[1]	汪洋, 傅洪亮, 陶华伟, 杨静, 谢跃, 赵力. 基于决策边界优化域自适应的跨库语音情感识别[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 374-379.
[2]	杨淑莹, 国海铭, 李欣. 基于通道选择和多维特征融合的脑电信号分类[J]. 《计算机应用》唯一官方网站, 2023, 43(11): 3418-3427.
[3]	杨磊, 赵红东, 于快快. 基于多头注意力机制的端到端语音情感识别[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1869-1875.
[4]	姚杰, 程春玲, 韩静, 刘峥. 云工作流中基于多任务时序卷积网络的异常检测方法[J]. 计算机应用, 2021, 41(6): 1701-1708.
[5]	朱霖, 宁芊, 雷印杰, 陈炳才. 基于遗传算法选优的集成手段与时序卷积网络的涡扇发动机剩余寿命预测[J]. 计算机应用, 2020, 40(12): 3534-3540.
[6]	王天锐, 鲍骞月, 秦品乐. 基于梅尔倒谱系数、深层卷积和Bagging的环境音分类方法[J]. 计算机应用, 2019, 39(12): 3515-3521.
[7]	向立, 严迪群, 王让定, 李孝文. 针对多种处理痕迹的数字语音取证算法[J]. 计算机应用, 2019, 39(1): 126-130.
[8]	谢小娟, 曾以成, 熊冰峰. 说话人识别中基于Fisher比的特征组合方法[J]. 计算机应用, 2016, 36(5): 1421-1425.
[9]	张小霞李应. 基于能量检测的复杂环境下的鸟鸣识别[J]. 计算机应用, 2013, 33(10): 2945-2949.
[10]	李书玲刘蓉张鎏钦刘红. 基于改进型SVM算法的语音情感识别[J]. 计算机应用, 2013, 33(07): 1938-1941.
[11]	胡峰松张璇. 基于梅尔频率倒谱系数与翻转梅尔频率倒谱系数的说话人识别方法[J]. 计算机应用, 2012, 32(09): 2542-2544.
[12]	付丽琴毛峡陈立江. 基于改进的排序式选举算法的语音情感融合识别[J]. 计算机应用, 2009, 29(2): 381-385.
[13]	梁泽马义德张恩溯朱望飞汤书森. 一种基于脉冲耦合神经网络的语音情感识别新方法[J]. 计算机应用, 2008, 28(3): 710-713.