基于混合特征提取与跨模态特征预测融合的情感识别模型

doi:10.11772/j.issn.1001-9081.2023060753

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (1): 86-93.DOI: 10.11772/j.issn.1001-9081.2023060753

• 跨媒体表征学习与认知推理 • 上一篇下一篇

基于混合特征提取与跨模态特征预测融合的情感识别模型

李牧, 杨宇恒(), 柯熙政

西安理工大学自动化与信息工程学院，西安 710048

收稿日期:2023-06-15 修回日期:2023-08-14 接受日期:2023-08-21 发布日期:2023-09-25 出版日期:2024-01-10
通讯作者: 杨宇恒
作者简介:李牧（1972—），男，陕西西安人，高级工程师，硕士，主要研究方向：生命体征检测、深度学习；
柯熙政（1962—），男，陕西临潼人，教授，博士，主要研究方向：无线激光通信。
第一联系人：杨宇恒（1998—），男，陕西西安人，硕士研究生，主要研究方向：情感识别、深度学习；
基金资助:
西安市科技计划项目(2020KJRC0083)

Emotion recognition model based on hybrid-mel gama frequency cross-attention transformer modal

Mu LI, Yuheng YANG(), Xizheng KE

School of Automation and Information Engineering，Xi’an University of Technology，Xi’an Shaanxi 710048，China

Received:2023-06-15 Revised:2023-08-14 Accepted:2023-08-21 Online:2023-09-25 Published:2024-01-10
Contact: Yuheng YANG
About author:LI Mu， born in 1972， M. S.， senior engineer. His research interests include vital sign detection， deep learning.
KE Xizheng， born in 1962， Ph. D.， professor. His research interests include wireless laser communication.
Supported by:
Xi’an Science and Technology Plan Project(2020 KJRC0083)

摘要/Abstract

摘要：

为从多模态情感分析中有效挖掘单模态表征信息，并实现多模态信息充分融合，提出一种基于混合特征与跨模态预测融合的情感识别模型（H-MGFCT）。首先，利用Mel频率倒谱系数（MFCC）和Gammatone频率倒谱系数（GFCC）及其一阶动态特征融合得到混合特征参数提取算法（H-MGFCC），解决了语音情感特征丢失的问题；其次，利用基于注意力权重的跨模态预测模型，筛选出与语音特征相关性更高的文本特征；随后，加入对比学习的跨模态注意力机制模型对相关性高的文本特征和语音模态情感特征进行跨模态信息融合；最后，将含有文本-语音的跨模态信息特征与筛选出的相关性低的文本特征相融合，以起到信息补充的作用。实验结果表明，该模型在公开IEMOCAP （Interactive EMotional dyadic MOtion CAPture）、CMU-MOSI （CMU-Multimodal Opinion Emotion Intensity）、CMU-MOSEI （CMU-Multimodal Opinion Sentiment Emotion Intensity）数据集上与加权决策层融合的语音文本情感识别（DLFT）模型相比，准确率分别提高了2.83、2.64和3.05个百分点，验证了该模型情感识别的有效性。

关键词: 特征提取, 多模态融合, 情感识别, 跨模态融合, 注意力机制

Abstract:

An emotion recognition model based on Hybrid-Mel Gama Frequency Cross-attention Transformer modal （H-MGFCT） was proposed to address the issues of effectively mining single modal representation information and achieving full fusion of multimodal information in multimodal sentiment analysis. Firstly， Hybird-Mel Gama Frequency Cepstral Coefficient （H-MGFCC） was obtained by fusing Mel Frequency Cepstral Coefficient （MFCC） and Gammatone Frequency Cepstral Coefficient （GFCC）， as well as their first-order dynamic features， to solve the problem of speech emotional feature loss； secondly， a cross modal prediction model based on attention weight was used to filter out text features more relevant to speech features； subsequently， a Cross Self-Attention Transformer （CSA-Transformer） incorporating contrastive learning was used to fuse highly correlated cross modal information of text features and speech modal emotional features； finally， the cross modal information features containing text and speech were fused with the selected text features with low correlation to achieve information supplement. The experimental results show that the proposed model improves the accuracy by 2.83， 2.64， and 3.05 percentage points compared to the weighted Decision Level Fusion Text-audio （DLFT） model on the publicly available IEMOCAP （Interactive EMotional dyadic MOtion CAPture）， CMU-MOSI （CMU-Multimodal Opinion Emotion Intensity）， and CMU-MOSEI （CMU-Multimodal Opinion Sentiment Emotion Intensity） datasets， verifying the effectiveness of this model for emotion recognition.

Key words: feature extraction, multimodal fusion, emotion recognition, cross modal fusion, attention mechanism

中图分类号:

TP391.1

李牧, 杨宇恒, 柯熙政. 基于混合特征提取与跨模态特征预测融合的情感识别模型[J]. 计算机应用, 2024, 44(1): 86-93.

Mu LI, Yuheng YANG, Xizheng KE. Emotion recognition model based on hybrid-mel gama frequency cross-attention transformer modal[J]. Journal of Computer Applications, 2024, 44(1): 86-93.

图/表 9

参考文献 28

1	KE X， CAO B， BAI J， et al. Speech emotion recognition based on PCA and CHMM ［C］// Proceedings of the 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference. Piscataway： IEEE， 2019： 667-671. 10.1109/itaic.2019.8785867
2	SHAH M， MIAO L， CHAKRABARTI C， et al. A speech emotion recognition framework based on latent Dirichlet allocation ［C］// Proceedings of the 2013 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2013： 2553-2557. 10.1109/icassp.2013.6638116
3	DUTTA K， SARMA K K. Multiple feature extraction for RNN-based Assamese speech recognition for speech to text conversion application ［C］// Proceedings of the 2012 International Conference on Communications， Devices and Intelligent Systems. Piscataway： IEEE， 2012： 600-603. 10.1109/codis.2012.6422274
4	郭卉，姜囡，任杰.基于MFCC和GFCC混合特征的语音情感识别研究［J］.光电技术应用， 2019， 34（6）： 34-39. 10.3969/j.issn.1673-1255.2019.06.008
	GUO H， JIANG N， REN J. Research on speech emotion recognition based on mixed features of MFCC and GFCC ［J］. Electro-Optic Technology Application， 2019， 34（6）： 34-39. 10.3969/j.issn.1673-1255.2019.06.008
5	CHEN M， ZHAO X. A multi-scale fusion framework for bimodal speech emotion recognition ［C］// Proceedings of the 2020 Cognitive Intelligence for Speech Processing. Baixas， FR： International Speech Communication Association， 2020： 374-378. 10.21437/interspeech.2020-3156
6	TZIRAKIS P， TRIGEORGIS G， NICOLAOU M A， et al. End-to-end multimodal emotion recognition using deep neural networks ［J］. IEEE Journal of Selected Topics in Signal Processing， 2017， 11（8）： 1301-1309. 10.1109/jstsp.2017.2764438
7	SUN L， LIU B， TAO J， et al. Multimodal cross- and self-attention network for speech emotion recognition ［C］// Proceedings of the 2021 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2021： 4275-4279. 10.1109/icassp39728.2021.9414654
8	YOON S， BYUN S， DEY S， et al. Speech emotion recognition using multi-hop attention mechanism ［C］// Proceedings of the 2019 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2019： 2822-2826. 10.1109/icassp.2019.8683483
9	CHOI W Y， SONG K Y， LEE C W. Convolutional attention networks for multimodal emotion recognition from speech and text data ［C］// Proceedings of the 2018 Grand Challenge and Workshop on Human Multimodal Language. Stroudsburg， PA： Association for Computational Linguistics， 2018： 28-34. 10.18653/v1/w18-3304
10	陈鹏展，张欣徐，徐芳萍.基于语音信号与文本信息的双模态情感识别［J］.华东交通大学学报， 2017， 34（2）： 100-104.
	CHEN P Z， ZHANG X X， XU F P. Multimodal emotion recognition based on speech signals and text information ［J］. Journal of East China Jiaotong University， 2017， 34（2）： 100-104.
11	ZHONG Y， HU Y， HUANG H， et al. A lightweight model based on separable convolution for speech emotion recognition ［C］// Proceedings of the 2020 Cognitive Intelligence for Speech Processing. Baixas， FR： International Speech Communication Association， 2020： 3331-3335. 10.21437/interspeech.2020-2408
12	顾煜，金赟，马勇，等.基于声学和文本特征的多模态情感识别［J］.数据采集与处理， 2022， 37（6）： 1353-1362.
	GU Y， JIN Y， MA Y， et al. Multimodal emotion recognition based on acoustic and lexical features ［J］. Journal of Data Acquisition & Processing， 2022， 37（6）： 1353-1362.
13	高玮军，赵华洋，李磊，等.基于ALBERT-HACNN-TUP模型的文本情感分析［J］.计算机仿真， 2023， 40（5）： 491-496. 10.3969/j.issn.1006-9348.2023.05.089
	GAO W J， ZHAO H Y， LI L， et al. Text sentiment analysis based on the ALBERT-HACNN-TUP model ［J］. Computer Simulation， 2023， 40（5）： 491-496. 10.3969/j.issn.1006-9348.2023.05.089
14	王跃跃.基于Albert和句法树的方面级情感分析［J］.智能计算机与应用， 2023， 13（4）： 52-59. 10.3969/j.issn.2095-2163.2023.04.010
	WANG Y Y. Aspect-level sentiment analysis based on Albert and syntactic tree ［J］. Intelligent Computer and Applications， 2023， 13（4）： 52-59. 10.3969/j.issn.2095-2163.2023.04.010
15	阮国恒，钟业荣，江嘉铭.基于MFCC系数的语音交互系统设计［J］.自动化与仪器仪表， 2022（6）： 167-171.
	RUAN G H， ZHONG Y R， JIANG J M. Design of speech interaction system based on MFCC coefficient ［J］. Automation & Instrumentation， 2022（6）： 167-171.
16	蒙倩霞，余江，常俊，等.基于MFCC特征的Wi-Fi信道状态信息人体行为识别方法［J］.计算机应用与软件， 2022， 39（12）： 125-131. 10.3969/j.issn.1000-386x.2022.12.019
	MENG Q X， YU J， CHANG J， et al. Human behavior recognition method by Wi-Fi channel state information based on MFCC characteristics ［J］. Computer Applications and Software， 2022， 39（12）： 125-131. 10.3969/j.issn.1000-386x.2022.12.019
17	WU Y， LIN Z， ZHAO Y， et al. A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis ［C］// Proceedings of the 2021 Findings of the Association for Computational Linguistics. Stroudsburg， PA： Association for Computational Linguistics， 2021： 4730-4738. 10.18653/v1/2021.findings-acl.417
18	李晋荣，吕国英，李茹，等.结合Hybrid Attention机制和BiLSTM-CRF的汉语否定语义表示及标注［J］.计算机工程与应用， 2023， 59（9）： 167-175. 10.3778/j.issn.1002-8331.2201-0088
	LI J R， LYU G Y， LI R， et al. Chinese negative semantic representation and annotation combined with hybrid attention mechanism and BiLSTM-CRF ［J］. Computer Engineering and Applications， 2023， 59（9）： 167-175. 10.3778/j.issn.1002-8331.2201-0088
19	YANG M， LI Y， HUANG Z， et al. Partially view-aligned representation learning with noise robust contrastive loss ［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 1134-1143. 10.1109/cvpr46437.2021.00119
20	CHEN T， KORNBLITH S， NOROUZI M， et al.A simple framework for contrastive learning of visual representations ［C］// Proceedings of the 37th International Conference on Machine Learning. New York： JMLR.org， 2020： 1597-1607.
21	SCHULLER B W， BATLINER A， BERGLER C， et al. The INTERSPEECH 2020 computational paralinguistics challenge： elderly emotion， breathing and masks ［C］// Proceedings of the 2020 Cognitive Intelligence for Speech Processing. Baixas， FR： International Speech Communication Association， 2020： 2042-2046. 10.21437/interspeech.2020-32
22	李文雪，甘臣权.基于注意力机制的分层次交互融合多模态情感分析［J］.重庆邮电大学学报（自然科学版）， 2023， 35（1）： 176-184.
	LI W X， GAN C Q. Multimodal emotional analysis of hierarchical interactive fusion based on attention mechanism ［J］. Journal of Chongqing University of Posts and Telecommunications （Natural Science Edition）， 2023， 35（1）： 176-184.
23	赖雪梅，唐宏，陈虹羽，等.基于注意力机制的特征融合-双向门控循环单元多模态情感分析［J］.计算机应用， 2021， 41（5）： 1268-1274. 10.11772/j.issn.1001-9081.2020071092
	LAI X M， TANG H， CHEN H Y， et al. Multimodal sentiment analysis based on feature fusion of attention mechanism-bidirectional gated recurrent unit ［J］. Journal of Computer Applications， 2021， 41（5）： 1268-1274. 10.11772/j.issn.1001-9081.2020071092
24	龙英潮，丁美荣，林桂锦，等.基于视听觉感知系统的多模态情感识别［J］.计算机系统应用， 2021， 30（12）： 218-225.
	LONG Y C， DING M R， LIN G J， et al. Emotion recognition based on visual and audiovisual perception system ［J］. Computer Systems & Applications， 2021， 30（12）： 218-225.
25	YOON S， BYUN S， JUNG K. Multimodal speech emotion recognition using audio and text ［C］// Proceedings of the 2018 IEEE Spoken Language Technology Workshop. Piscataway： IEEE， 2018： 112-118. 10.1109/slt.2018.8639583
26	TRIPATHI S， TRIPATHI S， BEIGI H. Multi-modal emotion recognition on IEMOCAP dataset using deep learning ［EB/OL］. （2018-04-16）［2023-01-05］. .
27	ATMAJA B T， SHIRAI K， AKAGI M. Speech emotion recognition using speech feature and word embedding ［C］// Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Piscataway： IEEE， 2019： 519-523. 10.1109/apsipaasc47483.2019.9023098
28	ZHANG X， WANG M-J， GUO X-D. Multi-modal emotion recognition based on deep learning in speech， video and text ［C］// Proceedings of the 2020 IEEE 5th International Conference on Signal and Image Processing. Piscataway： IEEE， 2020： 328-333. 10.1109/icsip49896.2020.9339464

数据集	样本数
数据集	训练集	验证集	测试集	总计
CMU-MOSI	1 453	232	411	2 096
CMU-MOSEI	16 853	2 103	2 597	21 553
IEMOCAP	6 711	634	1 746	9 091

数据集	样本数
数据集	训练集	验证集	测试集	总计
CMU-MOSI	1 453	232	411	2 096
CMU-MOSEI	16 853	2 103	2 597	21 553
IEMOCAP	6 711	634	1 746	9 091

模型	IEMOCAP				CMU-MOSI				CMU-MOSEI
模型	Acc/%	F1/%	MAE	Corr	Acc/%	F1/%	MAE	Corr	Acc/%	F1/%	MAE	Corr
GGRU^［25］	71.80	62.83	0.894	0.695	73.42	61.31	0.797	0.696	72.93	64.35	0.762	0.701
LLA^［26］	69.74	57.97	0.923	0.698	68.73	59.49	0.832	0.701	71.31	61.99	0.816	0.706
LFC^［27］	75.49	63.62	0.793	0.745	71.29	65.34	0.743	0.743	73.84	67.68	0.663	0.747
FLFT^［28］	74.27	67.45	0.787	0.748	77.83	74.33	0.693	0.752	79.33	69.73	0.593	0.753
DLFT^［28］	77.18	71.26	0.768	0.755	79.32	73.89	0.687	0.757	78.37	74.33	0.588	0.759
本文模型	80.01	69.73	0.759	0.763	81.96	74.33	0.676	0.768	81.42	72.46	0.594	0.765

模型	IEMOCAP				CMU-MOSI				CMU-MOSEI
模型	Acc/%	F1/%	MAE	Corr	Acc/%	F1/%	MAE	Corr	Acc/%	F1/%	MAE	Corr
GGRU^［25］	71.80	62.83	0.894	0.695	73.42	61.31	0.797	0.696	72.93	64.35	0.762	0.701
LLA^［26］	69.74	57.97	0.923	0.698	68.73	59.49	0.832	0.701	71.31	61.99	0.816	0.706
LFC^［27］	75.49	63.62	0.793	0.745	71.29	65.34	0.743	0.743	73.84	67.68	0.663	0.747
FLFT^［28］	74.27	67.45	0.787	0.748	77.83	74.33	0.693	0.752	79.33	69.73	0.593	0.753
DLFT^［28］	77.18	71.26	0.768	0.755	79.32	73.89	0.687	0.757	78.37	74.33	0.588	0.759
本文模型	80.01	69.73	0.759	0.763	81.96	74.33	0.676	0.768	81.42	72.46	0.594	0.765

实验序号	模型	CMU-MOSI				CMU-MOSEI
实验序号	模型	Acc/%	F1/%	MAE	Corr	Acc/%	F1/%	MAE	Corr
1	本文模型	81.96	74.33	0.676	0.768	81.42	72.46	0.594	0.765
2	H-MGFCT/H	72.39	67.16	0.796	0.652	73.42	69.67	0.827	0.674
3	H-MGFCT/C	74.64	71.37	0.821	0.691	75.79	73.54	0.787	0.621
4	H-MGFCT/S	75.41	72.03	0.784	0.667	76.76	71.98	0.769	0.659
5	H-MGFCT/T	77.41	72.56	0.794	0.711	77.28	72.73	0.754	0.761

基于混合特征提取与跨模态特征预测融合的情感识别模型

Emotion recognition model based on hybrid-mel gama frequency cross-attention transformer modal

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 28

相关文章 15

编辑推荐

Metrics

模型	参数量/10⁶	平均运行时间/s	模型大小/MB
GGRU^［25］	29	7.74	5.62
LLA^［26］	86	13.89	9.83
LFC^［27］	71	9.92	6.36
FLFT^［28］	63	7.35	5.88
DLFT^［28］	55	5.24	4.84
本文模型	17	2.74	3.76

[1]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[2]	李力铤, 华蓓, 贺若舟, 徐况. 基于解耦注意力机制的多变量时序预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2732-2738.
[3]	杨鑫, 陈雪妮, 吴春江, 周世杰. 结合变种残差模型和Transformer的城市公路短时交通流预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2947-2951.
[4]	黄颖, 杨佳宇, 金家昊, 万邦睿. 用于RGBT跟踪的孪生混合信息融合算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2878-2885.
[5]	赵志强, 马培红, 黑新宏. 基于双重注意力机制的人群计数方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2886-2892.
[6]	付帅, 郭小英, 白茹意, 闫涛, 陈斌. 改进的CloFormer模型与有序回归相结合的年龄评估方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2372-2380.
[7]	薛凯鹏, 徐涛, 廖春节. 融合自监督和多层交叉注意力的多模态情感分析网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2387-2392.
[8]	汪雨晴, 朱广丽, 段文杰, 李书羽, 周若彤. 基于交互注意力机制的心理咨询文本情感分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2393-2399.
[9]	高鹏淇, 黄鹤鸣, 樊永红. 融合坐标与多头注意力机制的交互语音情感识别[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2400-2406.
[10]	陈彤, 杨丰玉, 熊宇, 严荭, 邱福星. 基于多尺度频率通道注意力融合的声纹库构建方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2407-2413.
[11]	李钟华, 白云起, 王雪津, 黄雷雷, 林初俊, 廖诗宇. 基于图像增强的低照度人脸检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2588-2594.
[12]	莫尚斌, 王文君, 董凌, 高盛祥, 余正涛. 基于多路信息聚合协同解码的单通道语音增强[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2611-2617.
[13]	刘丽, 侯海金, 王安红, 张涛. 基于多尺度注意力的生成式信息隐藏算法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2102-2109.
[14]	徐松, 张文博, 王一帆. 基于时空信息的轻量视频显著性目标检测网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2192-2199.
[15]	李大海, 王忠华, 王振东. 结合空间域和频域信息的双分支低光照图像增强网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2175-2182.