Journal of Computer Applications
Next Articles
Received:
Revised:
Online:
Published:
王政,倪佳慧,张凡龙
通讯作者:
基金资助:
Abstract: In response to the problems that traditional Convolutional Neural Networks (CNNs) have difficulty capturing global emotional information in Speech Emotion Recognition (SER) due to limited receptive fields, and that global self-attention mechanisms have excessive computational overhead on resource-constrained devices such as mobile terminals, a lightweight and high-accuracy speech emotion recognition model suitable for edge deployment was aimed to be developed. For this purpose, a Global-Aware Fusion neural network with Binary Multi-Scale Feature Representation (GAF-BMFR) was proposed. This network was structured upon a Binary Neural Network (BNN) as its backbone to reduce model complexity. The core methods included the design of a Binary Multi-Scale Block (BMSB), where parallel 3×3 and 5×5 binary convolutions were used to synchronously extract fine-grained and wide-field features in the time-frequency domain. A lightweight Global-Aware Block (GAB) was introduced, which utilized fully connected layers and 1×1 convolutions to replace the high-cost self-attention mechanism, achieving global correlation modeling and adaptive selection of cross-scale features. Furthermore, a combined loss function of Class-Balanced Focal-Logarithmic Margin-Multi-Scale Regularization (CB-FLM-MSR) was designed to alleviate problems such as class imbalance and ambiguous emotional boundaries. Experimental results on public datasets showed that the proposed GAF-BMFR model achieved superior performance with low resource consumption. Taking the scripted portion of the IEMOCAP dataset as an example, compared to the typical APCNN model, the Weighted Accuracy (WA) of GAF-BMFR was improved by 25.23 percentage points, and the Unweighted Accuracy (UA) was improved by 24.18 percentage points. Concurrently, the number of model parameters was reduced to 686 K; compared to the 904 K parameters of its full-precision version, the parameter count was significantly reduced by approximately 24.04%. The GAF-BMFR model effectively enhanced the accuracy and robustness of speech emotion recognition while significantly reducing the number of parameters and computational overhead. This approach provided a feasible lightweight solution for deploying efficient affective computing applications in resource-constrained scenarios.
Key words: speech emotion recognition, binarization, multi-scale features, global-aware block, global-aware fusion binary multi-scale feature network (GAF-BMFN)
摘要: 针对传统卷积神经网络(Convolutional Neural Network, CNN)在语音情感识别(Speech Emotion Recognition, SER)中因感受野受限而难以捕获全局情感信息,以及全局自注意力机制在移动端等资源受限设备上计算开销过大的问题,旨在研发一种适用于边缘部署的轻量级、高精度的语音情感识别模型。为此提出了一种全局感知融合二值多尺度特征网络(Global-Aware Fusion neural network with Binary Multi-Scale Feature Representation, GAF-BMFR)。该网络以二值神经网络(Binary Neural Network, BNN)为骨架以降低模型复杂度。核心方法包括设计二值多尺度模块(Binary Multi-Scale Block, BMSB),通过并行的3×3与5×5二值卷积同步提取时频域的细粒度与宽视野特征;引入轻量级的全局感知模块(Global-Aware Block, GAB),利用全连接层与1×1卷积替代高成本的自注意力机制,实现跨尺度特征的全局关联建模与自适应选择。此外,还设计了类均衡焦点–对数边距–多尺度正则化联合损失函数(CB-FLM-MSR),以缓解类别不平衡与情感边界模糊等问题。在公开数据集上的实验结果表明,所提出的GAF-BMFR模型性能优越且资源消耗低。以IEMOCAP数据集的剧本部分为例,与典型的APCNN模型相比,GAF-BMFR的加权准确率提升了25.23个百分点,未加权准确率提升了24.18个百分点。同时,模型参数量减少至686 K,相较于其全精度版本的参数量904 K,参数量显著降低了约24.04%。GAF-BMFR模型在显著降低参数量与计算开销的同时,有效提升了语音情感识别的准确率与鲁棒性。该方法为在资源受限场景下部署高效的情感计算应用提供了一个可行的轻量级解决方案。
关键词: 语音情感识别, 二值化, 多尺度特征, 全局感知模块, 全局感知融合二值多尺度特征网络
CLC Number:
TP391.42
王政 倪佳慧 张凡龙. 全局感知融合二值多尺度特征的语音情感识别[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2025080935.
0 / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2025080935