《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (5): 1636-1643.DOI: 10.11772/j.issn.1001-9081.2023050663

• 多媒体计算与计算机仿真 • 上一篇    

基于多尺度时序感知网络的课堂语音情感识别方法

周菊香1,2, 刘金生1(), 甘健侯1,2, 吴迪1, 李子杰1   

  1. 1.民族教育信息化教育部重点实验室(云南师范大学), 昆明 650500
    2.云南省智慧教育重点实验室(云南师范大学), 昆明 650500
  • 收稿日期:2023-05-29 修回日期:2023-08-16 接受日期:2023-09-12 发布日期:2023-09-19 出版日期:2024-05-10
  • 通讯作者: 刘金生
  • 作者简介:周菊香(1986—),女,陕西蓝田人,副教授,博士,CCF会员,主要研究方向:智慧教育、计算机视觉、语音识别
    甘健侯(1976—),男,云南凤庆人,教授,博士,CCF会员,主要研究方向:智慧教育、知识图谱
    吴迪(1995—),男,河南商丘人,博士研究生,CCF会员,主要研究方向:智慧教育、计算机视觉
    李子杰(1998—),男,云南昆明人,博士研究生,CCF会员,主要研究方向:智慧教育、自然语言处理。
    第一联系人:刘金生(1998—),男,山东潍坊人,硕士研究生,主要研究方向:智慧教育、语音识别
  • 基金资助:
    国家自然科学基金资助项目(62107034);云南省科技厅科技计划项目(202101AT070095);云南省中老泰教育数字化国际联合研发中心项目(202203AP140006)

Classroom speech emotion recognition method based on multi-scale temporal-aware network

Juxiang ZHOU1,2, Jinsheng LIU1(), Jianhou GAN1,2, Di WU1, Zijie LI1   

  1. 1.Key Laboratory of Educational Informalization for Nationalities (Yunnan Normal University),Ministry of Education,Kunming Yunnan 650500,China
    2.Yunnan Key Laboratory of Smart Education (Yunnan Normal University),Kunming Yunnan 650500,China
  • Received:2023-05-29 Revised:2023-08-16 Accepted:2023-09-12 Online:2023-09-19 Published:2024-05-10
  • Contact: Jinsheng LIU
  • About author:ZHOU Juxiang, born in 1986, Ph. D., associate professor. Her research interests include smart education, computer vision, speech recognition.
    GAN Jianhou, born in 1976, Ph. D., professor. His research interests include smart education, knowledge graph.
    WU Di, born in 1995, Ph. D. candidate. His research interests include smart education, computer vision.
    LI Zijie, born in 1998, Ph. D. candidate. His research interests include smart education, natural language processing.
  • Supported by:
    National Natural Science Foundation of China(62107034);Yunnan Scientific and Technological Program(202101AT070095);Project of Yunnan International Joint R&D Center of China-Laos-Thailand Educational Digitalization(202203AP140006)

摘要:

语音情感识别近年来在多场景智能系统中得到了广泛应用,也为实现智慧课堂环境下的教学行为智能分析提供了可能。通过课堂语音情感识别技术可以自动识别课堂教学中教师和学生的情感状态,帮助教师了解自己的授课风格并及时掌握学生的课堂学习状态,从而达到精准施教的目的。针对课堂语音情感识别任务,首先,收集中小学的课堂实录教学视频,提取音频并进行人工切分和标注,构建了包含6类情感的中小学教学语音情感语料库;其次,基于时序卷积网络(TCN)和交叉门控机制(cross-gated mechanism)设计了双路时序卷积通道,以提取多尺度交叉融合特征;最后,采用动态权重融合策略调整不同尺度特征的贡献度,减少非重要特征对识别结果的干扰,进一步增强模型的表征和学习能力。实验结果表明,所提方法在多个公共数据集上优于TIM-Net(Temporal-aware bI-direction Multi-scale Network)、GM-TCNet(Gated Multi-scale Temporal Convolutional Network)和CTL-MTNet(CapsNet and Transfer Learning-based Mixed Task Net)等先进模型,在真实课堂语音情感识别任务上未加权平均召回率(UAR)和加权平均召回率(WAR)分别达90.58%和90.45%。

关键词: 语音情感识别, 课堂语音, 时序卷积网络, 交叉门控卷积, 梅尔频率倒谱系数

Abstract:

Speech emotion recognition has been widely used in multi-scenario intelligent systems in recent years, and it also provides the possibility to realize intelligent analysis of teaching behaviors in smart classroom environments. Classroom speech emotion recognition technology can be used to automatically recognize the emotional states of teachers and students during classroom teaching, help teachers understand their own teaching styles and grasp students’ classroom learning status in a timely manner, thereby achieving the purpose of precise teaching. For the classroom speech emotion recognition task, firstly, classroom teaching videos were collected from primary and secondary schools, the audio was extracted, and manually segmented and annotated to construct a primary and secondary school teaching speech emotion corpus containing six emotion categories. Secondly, based on the Temporal Convolutional Network (TCN) and cross-gated mechanism, dual temporal convolution channels were designed to extract multi-scale cross-fusion features. Finally, a dynamic weight fusion strategy was adopted to adjust the contributions of features at different scales, reduce the interference of non-important features on the recognition results, and further enhance the representation and learning ability of the model. Experimental results show that the proposed method is superior to advanced models such as TIM-Net (Temporal-aware bI-direction Multi-scale Network), GM-TCNet (Gated Multi-scale Temporal Convolutional Network), and CTL-MTNet (CapsNet and Transfer Learning-based Mixed Task Net) on multiple public datasets, and its UAR (Unweighted Average Recall) and WAR (Weighted Average Recall) reach 90.58% and 90.45% respectively in real classroom speech emotion recognition task.

Key words: speech emotion recognition, classroom speech, temporal convolutional network, cross-gated convolution, Mel-Frequency Cepstral Coefficient (MFCC)

中图分类号: