Speech emotion recognition with global-aware fusion on binary multi-scale features

doi:10.11772/j.issn.1001-9081.2025080935

Abstract

Abstract: In response to the problems that traditional Convolutional Neural Networks (CNNs) have difficulty capturing global emotional information in Speech Emotion Recognition (SER) due to limited receptive fields, and that global self-attention mechanisms have excessive computational overhead on resource-constrained devices such as mobile terminals, a lightweight and high-accuracy speech emotion recognition model suitable for edge deployment was aimed to be developed. For this purpose, a Global-Aware Fusion neural network with Binary Multi-Scale Feature Representation (GAF-BMFR) was proposed. This network was structured upon a Binary Neural Network (BNN) as its backbone to reduce model complexity. The core methods included the design of a Binary Multi-Scale Block (BMSB), where parallel 3×3 and 5×5 binary convolutions were used to synchronously extract fine-grained and wide-field features in the time-frequency domain. A lightweight Global-Aware Block (GAB) was introduced, which utilized fully connected layers and 1×1 convolutions to replace the high-cost self-attention mechanism, achieving global correlation modeling and adaptive selection of cross-scale features. Furthermore, a combined loss function of Class-Balanced Focal-Logarithmic Margin-Multi-Scale Regularization (CB-FLM-MSR) was designed to alleviate problems such as class imbalance and ambiguous emotional boundaries. Experimental results on public datasets showed that the proposed GAF-BMFR model achieved superior performance with low resource consumption. Taking the scripted portion of the IEMOCAP dataset as an example, compared to the typical APCNN model, the Weighted Accuracy (WA) of GAF-BMFR was improved by 25.23 percentage points, and the Unweighted Accuracy (UA) was improved by 24.18 percentage points. Concurrently, the number of model parameters was reduced to 686 K; compared to the 904 K parameters of its full-precision version, the parameter count was significantly reduced by approximately 24.04%. The GAF-BMFR model effectively enhanced the accuracy and robustness of speech emotion recognition while significantly reducing the number of parameters and computational overhead. This approach provided a feasible lightweight solution for deploying efficient affective computing applications in resource-constrained scenarios.

Key words: speech emotion recognition, binarization, multi-scale features, global-aware block, global-aware fusion binary multi-scale feature network (GAF-BMFN)

摘要： 针对传统卷积神经网络(Convolutional Neural Network, CNN)在语音情感识别(Speech Emotion Recognition, SER)中因感受野受限而难以捕获全局情感信息，以及全局自注意力机制在移动端等资源受限设备上计算开销过大的问题，旨在研发一种适用于边缘部署的轻量级、高精度的语音情感识别模型。为此提出了一种全局感知融合二值多尺度特征网络(Global-Aware Fusion neural network with Binary Multi-Scale Feature Representation, GAF-BMFR)。该网络以二值神经网络(Binary Neural Network, BNN)为骨架以降低模型复杂度。核心方法包括设计二值多尺度模块(Binary Multi-Scale Block, BMSB)，通过并行的3×3与5×5二值卷积同步提取时频域的细粒度与宽视野特征；引入轻量级的全局感知模块(Global-Aware Block, GAB)，利用全连接层与1×1卷积替代高成本的自注意力机制，实现跨尺度特征的全局关联建模与自适应选择。此外，还设计了类均衡焦点–对数边距–多尺度正则化联合损失函数(CB-FLM-MSR)，以缓解类别不平衡与情感边界模糊等问题。在公开数据集上的实验结果表明，所提出的GAF-BMFR模型性能优越且资源消耗低。以IEMOCAP数据集的剧本部分为例，与典型的APCNN模型相比，GAF-BMFR的加权准确率提升了25.23个百分点，未加权准确率提升了24.18个百分点。同时，模型参数量减少至686 K，相较于其全精度版本的参数量904 K，参数量显著降低了约24.04%。GAF-BMFR模型在显著降低参数量与计算开销的同时，有效提升了语音情感识别的准确率与鲁棒性。该方法为在资源受限场景下部署高效的情感计算应用提供了一个可行的轻量级解决方案。

关键词: 语音情感识别, 二值化, 多尺度特征, 全局感知模块, 全局感知融合二值多尺度特征网络

CLC Number:

TP391.42

王政倪佳慧张凡龙. 全局感知融合二值多尺度特征的语音情感识别[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2025080935.

[1]	Yiming LIANG, Jing FAN, Wenze CHAI. Multi-scale feature fusion sentiment classification based on bidirectional cross attention [J]. Journal of Computer Applications, 2025, 45(9): 2773-2782.
[2]	Peng PENG, Ziting CAI, Wenling LIU, Caihua CHEN, Wei ZENG, Baolai HUANG. Speech emotion recognition method based on hybrid Siamese network with CNN and bidirectional GRU [J]. Journal of Computer Applications, 2025, 45(8): 2515-2521.
[3]	Liang CHEN, Xuan WANG, Kun LEI. Helmet wearing detection algorithm for complex scenarios based on cross-layer multi-scale feature fusion [J]. Journal of Computer Applications, 2025, 45(7): 2333-2341.
[4]	Xiang WANG, Qianqian CUI, Xiaoming ZHANG, Jianchao WANG, Zhenzhou WANG, Jialin SONG. Wireless capsule endoscopy image classification model based on improved ConvNeXt [J]. Journal of Computer Applications, 2025, 45(6): 2016-2024.
[5]	Shiyue GUO, Jianwu DANG, Yangping WANG, Jiu YONG. 3D hand pose estimation combining attention mechanism and multi-scale feature fusion [J]. Journal of Computer Applications, 2025, 45(4): 1293-1299.
[6]	Zhongwei ZHANG, Jun WANG, Shudong LIU, Zhiheng WANG. Object detection in remote sensing image based on multi-scale feature fusion and weighted boxes fusion [J]. Journal of Computer Applications, 2025, 45(2): 633-639.
[7]	Xuehui YIN, Linlin FU, Shangbo ZHOU. Concrete pavement crack detection network with progressive context interaction and attention mechanism [J]. Journal of Computer Applications, 2025, 45(10): 3353-3362.
[8]	Ziyi WANG, Weijun LI, Xueyang LIU, Jianping DING, Shixia LIU, Yilei SU. Image caption method based on Swin Transformer and multi-scale feature fusion [J]. Journal of Computer Applications, 2025, 45(10): 3154-3160.
[9]	Shang LIU, Yuwei ZHOU, Rao DAI, Linfang DONG, Meng LIU. Small target detection algorithm in remote sensing images integrating attention and contextual information [J]. Journal of Computer Applications, 2025, 45(1): 292-300.
[10]	Pengcheng SONG, Lijun GUO, Rong ZHANG. Weakly supervised video anomaly detection with local-global temporal dependency [J]. Journal of Computer Applications, 2025, 45(1): 240-246.
[11]	Yan RONG, Jiawen LIU, Xinlei LI. Adaptive hybrid network for affective computing in student classroom [J]. Journal of Computer Applications, 2024, 44(9): 2919-2930.
[12]	Pengqi GAO, Heming HUANG, Yonghong FAN. Fusion of coordinate and multi-head attention mechanisms for interactive speech emotion recognition [J]. Journal of Computer Applications, 2024, 44(8): 2400-2406.
[13]	Tong CHEN, Fengyu YANG, Yu XIONG, Hong YAN, Fuxing QIU. Construction method of voiceprint library based on multi-scale frequency-channel attention fusion [J]. Journal of Computer Applications, 2024, 44(8): 2407-2413.
[14]	Hongtian LI, Xinhao SHI, Weiguo PAN, Cheng XU, Bingxin XU, Jiazheng YUAN. Few-shot object detection via fusing multi-scale and attention mechanism [J]. Journal of Computer Applications, 2024, 44(5): 1437-1444.
[15]	Juxiang ZHOU, Jinsheng LIU, Jianhou GAN, Di WU, Zijie LI. Classroom speech emotion recognition method based on multi-scale temporal-aware network [J]. Journal of Computer Applications, 2024, 44(5): 1636-1643.

Speech emotion recognition with global-aware fusion on binary multi-scale features

全局感知融合二值多尺度特征的语音情感识别

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics