Review on bimodal emotion recognition based on speech and text

doi:10.11772/j.issn.1001-9081.2024030319

Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (4): 1025-1034.DOI: 10.11772/j.issn.1001-9081.2024030319

• Artificial intelligence • Next Articles

Review on bimodal emotion recognition based on speech and text

Lingmin HAN, Xianhong CHEN(), Wenmeng XIONG

School of Information Science and Technology，Beijing University of Technology，Beijing 100124，China

Received:2024-03-21 Revised:2024-05-20 Accepted:2024-05-27 Online:2024-06-18 Published:2025-04-10
Contact: Xianhong CHEN
About author:HAN Lingmin， born in 1999， M. S. candidate. Her research interests include sentiment analysis， emotion recognition.
CHEN Xianhong， born in 1989， Ph. D.， associate professor. Her research interests include emotion recognition， speaker recognition， speech recognition.
XIONG Wenmeng， born in 1989， Ph. D.， lecturer. Her research interests include audio signal processing.
Supported by:
National Natural Science Foundation of China(62006010);R&D Program of Beijing Municipal Education Commission(KM202210005029)

基于语音和文本的双模态情感识别综述

韩令敏, 陈仙红(), 熊文梦

北京工业大学信息科学技术学院，北京 100124

通讯作者: 陈仙红
作者简介:韩令敏（1999—），女，河北唐山人，硕士研究生，主要研究方向：情感分析、情感识别
陈仙红（1989—），女，福建龙岩人，副教授，博士，主要研究方向：情感识别、说话人识别、语音识别
熊文梦（1989—），女，湖北荆州人，讲师，博士，主要研究方向：音频信号处理。
基金资助:
国家自然科学基金资助项目(62006010);北京市教育委员会科技/社科计划项目(KM202210005029)

Abstract

Abstract:

Emotion recognition is a technology that allows computers to recognize and understand human emotions. It plays an important role in many fields and is an important development direction in the field of artificial intelligence. Therefore， the research status of bimodal emotion recognition based on speech and text was summarized. Firstly， the representation space of emotion was classified and elaborated. Secondly， the emotion databases were classified according to their emotion representation space， and the common multi-modal emotion databases were summed up. Thirdly， the methods of bimodal emotion recognition based on speech and text were introduced， including feature extraction， modal fusion， and decision classification. Specifically， the modal fusion methods were highlighted and divided into four categories， namely feature level fusion， decision level fusion， model level fusion and multi-level fusion. In addition， results of a series of bimodal emotion recognition methods based on speech and text were compared and analyzed. Finally， the application scenarios， challenges， and future development directions of emotion recognition were introduced. The above aims to analyze and review the work of multi-modal emotion recognition， especially bimodal emotion recognition based on speech and text， providing valuable information for emotion recognition.

Key words: emotion recognition, bimodal, modal fusion, speech, text

摘要：

情感识别是一种让计算机识别和理解人类情感的技术，在众多领域都起着重要的作用，也是人工智能领域重要的发展方向。因此，梳理与归纳基于语音和文本的双模态情感识别的研究现状：首先，分类阐述情感表示空间；其次，按照情感数据库的情感表示空间对这些数据库进行分类，并总结常见的多模态情感数据库；再次，介绍基于语音和文本的双模态情感识别方法，包括特征提取、模态融合和决策分类，重点介绍模态融合方法并将这些方法分为特征级融合、决策级融合、模型级融合和多层次融合这4类；此外，比较和分析一系列语音和文本双模态情感识别方法的结果；最后，介绍情感识别的应用场景、面临的挑战与未来的发展方向。以上旨在对多模态情感识别，尤其是对基于语音和文本的双模态情感识别的相关工作进行分析与总结，并为情感识别提供有价值的参考。

关键词: 情感识别, 双模态, 模态融合, 语音, 文本

CLC Number:

TP391

Lingmin HAN, Xianhong CHEN, Wenmeng XIONG. Review on bimodal emotion recognition based on speech and text[J]. Journal of Computer Applications, 2025, 45(4): 1025-1034.

韩令敏, 陈仙红, 熊文梦. 基于语音和文本的双模态情感识别综述[J]. 《计算机应用》唯一官方网站, 2025, 45(4): 1025-1034.

Figures/Tables 7

References 82

1	BEGAJ S， TOPAL A O， ALI M. Emotion recognition based on facial expressions using Convolutional Neural Network （CNN）［C］// Proceedings of the 2020 International Conference on Computing， Networking， Telecommunications and Engineering Sciences Applications. Piscataway： IEEE， 2020： 58-63.
2	杨磊，赵红东，于快快. 基于多头注意力机制的端到端语音情感识别［J］. 计算机应用， 2022， 42（6）：1869-1875.
	YANG L， ZHAO H D， YU K K. End-to-end speech emotion recognition based on multi-head attention ［J］. Journal of Computer Applications， 2022， 42（6）： 1869-1875.
3	李路宝，陈田，任福继，等. 基于图神经网络和注意力的双模态情感识别方法［J］. 计算机应用， 2023， 43（3）：700-705.
	LI L B， CHEN T， REN F J， et al. Bimodal emotion recognition method based on graph neural network and attention ［J］. Journal of Computer Applications， 2023， 43（3）： 700-705.
4	潘莹. 情感识别综述［J］. 电脑知识与技术， 2018， 14（8）：169-171.
	PAN Y. Survey on emotion recognition ［J］. Computer Knowledge and Technology， 2018， 14（8）： 169-171.
5	SRIKANTH S S， MANIKANTAN S， POSONIA A M， et al. Voice assistant with emotion recognition ［C］// Proceedings of the 2023 International Conference on Circuit Power and Computing Technologies. Piscataway： IEEE， 2023： 799-804.
6	TAN L， YU K， LIN L， et al. Speech emotion recognition enhanced traffic efficiency solution for autonomous vehicles in a 5G-enabled space-air-ground integrated intelligent transportation system ［J］. IEEE Transactions on Intelligent Transportation Systems， 2022， 23（3）： 2830-2842.
7	KYI A S S， LIN K Z. Detecting voice features for criminal case［C］// Proceedings of the 2019 International Conference on Advanced Information Technologies. Piscataway： IEEE， 2019： 212-216.
8	SUZUKI K， IGUCHI T， NAKAGAWA Y， et al. A multi-modal interaction robot based on emotion estimation method using physiological signals applied for elderly* ［C］// Proceedings of the 32nd IEEE International Conference on Robot and Human Interactive Communication. Piscataway： IEEE， 2023： 2051-2057.
9	TANKO D， DOGAN S， DEMIR F B， et al. Shoelace pattern-based speech emotion recognition of the lecturers in distance education： ShoePat23 ［J］. Applied Acoustics， 2022， 190： No.108637.
10	陶建华，陈俊杰，李永伟. 语音情感识别综述［J］. 信号处理， 2023， 39（4）：571-587.
	TAO J H， CHEN J J， LI Y W. Review on speech emotion recognition ［J］. Journal of Signal Processing， 2023， 39（4）： 571-587.
11	高利军，薛雷. 语音情感识别综述［J］. 工业控制计算机， 2022， 35（10）：115-116， 120.
	GAO L J， XUE L. Overview of speech emotion recognition ［J］. Industrial Control Computer， 2022， 35（10）： 115-116， 120.
12	KONANGI U M Y， KATREDDY V R， RASULA S K， et al. Emotion recognition through speech： a review ［C］// Proceedings of the 2022 International Conference on Applied Artificial Intelligence and Computing. Piscataway： IEEE， 2022： 1150-1153.
13	BHOSALE S A， CHOUGULE S R. A review on face emotion recognition using EEG features and facial features ［C］// Proceedings of the 1st International Conference on Cognitive Computing and Engineering Education. Piscataway： IEEE， 2023： 1-5.
14	梅英，谭冠政，刘振焘. 基于视频图像的面部表情识别研究综述［J］. 湖南文理学院学报（自然科学版）， 2016， 28（3）：19-25.
	MEI Y， TAN G Z， LIU Z T. Review on facial expression recognition based on video image ［J］. Journal of Hunan University of Arts and Science （Science and Technology）， 2016， 28（3）： 19-25.
15	何俊，刘跃，何忠文. 多模态情感识别研究进展［J］. 计算机应用研究， 2018， 35（11）：3201-3205.
	HE J， LIU Y， HE Z W. Research progress of multimodal emotion recognition ［J］. Application Research of Computers， 2018， 35（11）： 3201-3205.
16	王传昱，李为相，陈震环. 基于语音和视频图像的多模态情感识别研究［J］. 计算机工程与应用， 2021， 57（23）：163-170.
	WANG C Y， LI W X， CHEN Z H. Research of multi-modal emotion recognition based on voice and video images ［J］. Computer Engineering and Applications， 2021， 57（23）： 163-170.
17	EKMAN P. Facial expressions of emotion： new findings， new questions ［J］. Psychological Science， 1992， 3（1）： 34-38.
18	RUSSELL J A. A circumplex model of affect ［J］. Journal of Personality and Social Psychology， 1980， 39（6）： 1161-1178.
19	RUSSELL J A， MEHRABIAN A. Evidence for a three-factor theory of emotions ［J］. Journal of Research in Personality， 1977， 11（3）： 273-294.
20	LIVINGSTONE S R， RUSSO F A. The Ryerson Audio-Visual Database of Emotional Speech and Song （RAVDESS）： a dynamic， multimodal set of facial and vocal expressions in North American English ［J］. PLoS ONE， 2018， 13（5）： No.e0196391.
21	MARTIN O， KOTSIA I， MACQ B， et al. The eNTERFACE’ 05 audio-visual emotion database ［C］// Proceedings of the 22nd International Conference on Data Engineering Workshops. Piscataway： IEEE， 2006： 8-8.
22	LI Y， TAO J， CHAO L， et al. CHEAVD： a Chinese natural emotional audio-visual database ［J］. Journal of Ambient Intelligence and Humanized Computing， 2017， 8（6）： 913-924.
23	PORIA S， HAZARIKA D， MAJUMDER N， et al. MELD： a multimodal multi-party dataset for emotion recognition in conversations ［C］// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg： ACL， 2019： 527-536.
24	DOUGLAS-COWIE E， COWIE R， SCHRÖDER M. A new emotion database： considerations， sources and scope ［C］// Proceedings of the ITRW on Speech and Emotion 2000. ［S.l.］： International Speech Communication Association， 2000： 39-44.
25	BUSSO C， BULUT M， LEE C C， et al. IEMOCAP： interactive emotional dyadic motion capture database ［J］. Language Resources and Evaluation， 2008， 42（4）： 335-359.
26	BAGHER ZADEH A， LIANG P P， PORIA S， et al. Multimodal language analysis in the wild： CMU-MOSEI dataset and interpretable dynamic fusion graph ［C］// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2018： 2236-2246.
27	McKEOWN G， VALSTAR M F， COWIE R， et al. The SEMAINE corpus of emotionally coloured character interactions ［C］// Proceedings of the 2010 IEEE International Conference on Multimedia and Expo. Piscataway： IEEE， 2010： 1079-1084.
28	GRIMM M， KROSCHEL K， NARAYANAN S. The Vera am Mittag German audio-visual emotional speech database ［C］// Proceedings of the 2008 IEEE International Conference on Multimedia and Expo. Piscataway： IEEE， 2008： 865-868.
29	RINGEVAL F， SONDEREGGER A， SAUER J， et al. Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions ［C］// Proceedings of the 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition. Piscataway： IEEE， 2013： 1-8.
30	LeCUN Y， BOSER B， DENKER J S， et al. Backpropagation applied to handwritten zip code recognition ［J］. Neural Computation， 1989， 1（4）： 541-551.
31	ELMAN J L. Finding structure in time ［J］. Cognitive Science， 1990， 14（2）： 179-211.
32	HOCHREITER S， SCHMIDHUBER J. Long short-term memory［J］. Neural Computation， 1997， 9（8）： 1735-1780.
33	SCHNEIDER S， BAEVSKI A， COLLOBERT R， et al. wav2vec： Unsupervised pre-training for speech recognition ［C］// Proceedings of the INTERSPEECH 2019. ［S.l.］： International Speech Communication Association， 2019： 3465-3469.
34	BAEVSKI A， ZHOU H， MOHAMED A， et al. wav2vec 2.0： A framework for self-supervised learning of speech representations［C］// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2020： 12449-12460.
35	HSU W N， BOLTE B， TSAI Y H H， et al. HuBERT： self-supervised speech representation learning by masked prediction of hidden units ［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2021， 29： 3451-3460.
36	CHEN S， WANG C， CHEN Z， et al. WavLM： large-scale self-supervised pre-training for full stack speech processing ［J］. IEEE Journal of Selected Topics in Signal Processing， 2022， 16（6）： 1505-1518.
37	LIANG H， SUN X， SUN Y， et al. Text feature extraction based on deep learning： a review ［J］. EURASIP Journal on Wireless Communications and Networking， 2017， 2017： No.211.
38	刘颖，艾豪，张伟东. 基于深度学习的多模态情感识别综述［J］. 西安邮电大学学报， 2022， 27（1）：60-71.
	LIU Y， AI H， ZHANG W D. A survey of multimodal emotion recognition based on deep learning ［J］. Journal of Xi’an University of Posts and Telecommunications， 2022， 27（1）： 60-71.
39	PENNINGTON J， SOCHER R， MANNING C D. GloVe： global vectors for word representation ［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2014： 1532-1543.
40	PETERS M E， NEUMANN M， IYYER M， et al. Deep contextualized word representations ［C］// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long Papers）. Stroudsburg： ACL， 2018： 2227-2237.
41	DEVLIN J， CHANG M W， LEE K， et al. BERT： pre-training of deep bidirectional Transformers for language understanding ［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsburg： ACL， 2019： 4171-4186.
42	何俊，张彩庆，李小珍，等. 面向深度学习的多模态融合技术研究综述［J］. 计算机工程， 2020， 46（5）：1-11.
	HE J， ZHANG C Q， LI X Z， et al. Survey of research on multimodal fusion technology for deep learning ［J］. Computer Engineering， 2020， 46（5）： 1-11.
43	HOU M， ZHANG Z， LIU C， et al. Semantic alignment network for multi-modal emotion recognition ［J］. IEEE Transactions on Circuits and Systems for Video Technology， 2023， 33（9）： 5318-5329.
44	BOKADE G U， KANPHADE R D. Secure multimodal biometric authentication using Face， Palmprint and Ear： a feature level fusion approach ［C］// Proceedings of the 10th International Conference on Computing， Communication and Networking Technologies. Piscataway： IEEE， 2019： 1-5.
45	CAI L， LIAO Z， ZHOU S， et al. Visual question answering combining multi-modal feature fusion and multi-attention mechanism ［C］// Proceedings of the IEEE 2nd International Conference on Big Data， Artificial Intelligence and Internet of Things Engineering. Piscataway： IEEE， 2021： 1035-1039.
46	DU X， YANG J， XIE X. Multimodal emotion recognition based on feature fusion and residual connection ［C］// Proceedings of the IEEE 2nd International Conference on Electrical Engineering， Big Data and Algorithms. Piscataway： IEEE， 2023： 373-377.
47	KIM E， SHIN J W. DNN-based emotion recognition based on bottleneck acoustic features and lexical features ［C］// Proceedings of the 2019 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2019： 6720-6724.
48	LI F， LUO J， LIU W. Speech emotion recognition using multi-modal feature fusion network ［C］// Proceedings of the IEEE 6th International Conference on Pattern Recognition and Artificial Intelligence. Piscataway： IEEE， 2023： 884-888.
49	DEVI D V R， RAO K N. Decision level fusion schemes for a Multimodal Biometric System using local and global wavelet features ［C］// Proceedings of the 2020 IEEE International Conference on Electronics， Computing and Communication Technologies. Piscataway： IEEE， 2020： 1-6.
50	CHEN Y， JIA Z， HIROTA K， et al. A multimodal emotion perception model based on context-aware decision-level fusion［C］// Proceedings of the 41st Chinese Control Conference. Piscataway： IEEE， 2022： 7332-7337.
51	SATO K， KISHI K， KOSAKA T. Speech emotion recognition by late fusion of linguistic and acoustic features using deep learning models ［C］// Proceedings of the 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference. Piscataway： IEEE， 2023： 1013-1018.
52	ZHAO Y， CAO X， LIN J， et al. Multimodal affective states recognition based on multiscale CNNs and biologically inspired decision fusion model ［J］. IEEE Transactions on Affective Computing， 2023， 14（2）： 1391-1403.
53	HAN H， XIANG M， LIAN C， et al. A multimodal deep neural network for ECG and PCG classification with multimodal fusion［C］// Proceedings of the 13th International Conference on Information Science and Technology. Piscataway： IEEE， 2023： 124-128.
54	HUANG J， TAO J， LIU B， et al. Multimodal Transformer fusion for continuous emotion recognition ［C］// Proceedings of the 2020 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2020： 3507-3511.
55	WANG S， MA Y， DING Y. Exploring complementary features in multi-modal speech emotion recognition ［C］// Proceedings of the 2023 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2023： 1-5.
56	ZHU X， WU J， ZHU L. RGB-D saliency detection based on cross-modal and multi-scale feature fusion ［C］// Proceedings of the 34th Chinese Control and Decision Conference. Piscataway： IEEE， 2022： 6154-6160.
57	ZHANG J， WANG Q， HAN Y. Multi-modal fusion with multi-level attention for visual dialog ［J］. Information Processing and Management， 2020， 57： No.102152.
58	WANG Y. Efficient audio-visual speaker recognition via deep multi-modal feature fusion ［C］// Proceedings of the 17th International Conference on Computational Intelligence and Security. Piscataway： IEEE， 2021： 99-103.
59	GU Y， YANG K， FU S， et al. Multimodal affective analysis using hierarchical attention strategy with word-level alignment ［C］// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2018： 2225-2235.
60	李牧，杨宇恒，柯熙政. 基于混合特征提取与跨模态特征预测融合的情感识别模型［J］. 计算机应用， 2024， 44（1）：86-93.
	LI M， YANG Y H， KE X Z. Emotion recognition model based on hybrid-Mel Gama frequency cross-attention transformer modal ［J］. Journal of Computer Applications， 2024， 44（1）： 86-93.
61	HUNT E B， MARIN J， STONE P J. Experiments in induction［M］. New York： Academic Press， 1996.
62	BREIMAN L. Random forests ［J］. Machine Learning， 2001， 45（1）： 5-32.
63	CORTES C， VAPNIK V. Support-vector networks ［J］. Machine Learning， 1995， 20（3）： 273-297.
64	COVER T， HART P. Nearest neighbor pattern classification ［J］. IEEE Transactions on Information Theory， 1967， 13（1）： 21-27.
65	McCULLOCH W S， PITTS W. A logical calculus of the ideas immanent in nervous activity ［J］. Bulletin of Mathematical Biophysics， 1943， 5： 115-133.
66	冯亚琴，沈凌洁，胡婷婷，等. 利用语音与文本特征融合改善语音情感识别［J］. 数据采集与处理， 2019， 34（4）：625-631.
	FENG Y Q， SHEN L J， HU T T， et al. Using speech and text features fusion to improve speech emotion recognition ［J］. Journal of Data Acquisition and Processing， 2019， 34（4）： 625-631.
67	顾煜，金赟，马勇，等. 基于声学和文本特征的多模态情感识别［J］. 数据采集与处理， 2022， 37（6）：1353-1362.
	GU Y， JIN Y， MA Y， et al. Multimodal emotion recognition based on acoustic and lexical features ［J］. Journal of Data Acquisition and Processing， 2022， 37（6）： 1353-1362.
68	ZHAO Z， GAO T， WANG H， et al. SWRR： feature map classifier based on sliding window attention and high-response feature reuse for multimodal emotion recognition ［C］// Proceedings of the INTERSPEECH 2023. ［S.l.］： International Speech Communication Association， 2023： 2433-2437.
69	史爱武，蔡润. 结合多种注意力机制的多模态情感识别方法［J］. 软件导刊， 2023， 22（10）：105-109.
	SHI A W， CAI R. Multimodal emotion recognition method combining multiple attention mechanisms ［J］. Software Guide， 2023， 22（10）： 105-109.
70	PATAMIA R A， JIN W， ACHEAMPONG K N， et al. Transformer based multimodal speech emotion recognition with improved neural networks ［C］// Proceedings of the IEEE 2nd International Conference on Pattern Recognition and Machine Learning. Piscataway： IEEE， 2021： 195-203.
71	LEE Y， YOON S， JUNG K. Multimodal speech emotion recognition using cross attention with aligned audio and text ［C］// Proceedings of the INTERSPEECH 2020. ［S.l.］： International Speech Communication Association， 2020： 2717-2721.
72	PEPINO L， RIERA P， FERRER L， et al. Fusion approaches for emotion recognition from speech using acoustic and text-based features ［C］// Proceedings of the 2020 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2020： 6484-6488.
73	LI J， WANG X， LIU Y， et al. CFN-ESA： a cross-modal fusion network with emotion-shift awareness for dialogue emotion recognition ［J］. IEEE Transactions on Affective Computing， 2024， 15（4）： 1919-1933.
74	MAKIUCHI M R， UTO K， SHINODA K. Multimodal emotion recognition with high-level speech and text features ［C］// Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop. Piscataway： IEEE， 2021： 350-357.
75	SUN H， CHEN Y W， LIN L. TensorFormer： a tensor-based multimodal transformer for multimodal sentiment analysis and depression detection ［J］. IEEE Transactions on Affective Computing， 2023， 14（4）： 2776-2786.
76	AUDEBERT N， LE SAUX B， LEFÈVRE S. Beyond RGB： very high resolution urban remote sensing with multimodal deep networks ［J］. ISPRS Journal of Photogrammetry and Remote Sensing， 2018， 140： 20-32.
77	WÖLLMER M， WENINGER F， KNAUP T， et al. YouTube movie reviews： sentiment analysis in an audio-visual context ［J］. IEEE Intelligent Systems， 2013， 28（3）： 46-53.
78	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need ［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 6000-6010.
79	MOU L， ZHAO Y， ZHOU C， et al. Driver emotion recognition with a hybrid attentional multimodal fusion framework ［J］. IEEE Transactions on Affective Computing， 2023， 14（4）： 2970-2981.
80	GAO Y， FU Y， SUN M， et al. Multi-modal hierarchical empathetic framework for social robots with affective body control［J］. IEEE Transactions on Affective Computing， 2024， 15（3）： 16210-1633.
81	SANTOSA J， ISHIZUKA K， HASHIMOTO T. Large language model-based emotional speech annotation using context and acoustic feature for speech emotion recognition ［C］// Proceedings of the 2024 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2024： 11026-11030.
82	BOSCH E， LE HOUCQ CORBÍ R， IHME K， et al. Frustration recognition using spatio temporal data： a novel dataset and GCN model to recognize in-vehicle frustration ［J］. IEEE Transactions on Affective Computing， 2023， 14（4）： 2864-2875.

数据库	规模	模态	类型	情感标注
RAVDESS	7 356个文件	图像、语音	表演型	平静、快乐、悲伤、愤怒、恐惧、惊讶、厌恶
eNTERFACE’05	1 166个视频序列	图像、语音	引导型	愤怒、恐惧、惊讶、快乐、悲伤、厌恶
CHEAVD	2 322个句子	图像、语音	表演型	中性、生气、高兴、悲伤、担心、焦虑、惊讶、厌恶
MELD	1 433个对话	图像、语音、文本	表演型	愤怒、厌恶、恐惧、喜悦、中立、悲伤、惊讶和积极、消极、中性
IEMOCAP	10 039个句子	图像、语音、文本	表演型	中性、快乐、悲伤、愤怒、惊讶、恐惧、厌恶、沮丧、兴奋和效价度、唤醒度、支配度
CMU-MOSEI	23 453个句子	图像、语音、文本	自发型	快乐、悲伤、愤怒、恐惧、厌烦、惊讶和积极、消极、中性
SEMAINE	150个句子	图像、语音	引导型	效价度、唤醒度、支配度、力量、强度
Belfast	298个视频片段	图像、语音	自发型	愤怒、厌恶、恐惧、高兴、中性、悲伤、惊讶
VAM	12 h数据	图像、语音	自发型	效价度、唤醒度、支配度
RECOLA	3.8 h数据	图像、语音、心电图、皮肤电信号	自发型	效价度、唤醒度、支配度、一致性、参与度、表现力、融洽度

数据库	规模	模态	类型	情感标注
RAVDESS	7 356个文件	图像、语音	表演型	平静、快乐、悲伤、愤怒、恐惧、惊讶、厌恶
eNTERFACE’05	1 166个视频序列	图像、语音	引导型	愤怒、恐惧、惊讶、快乐、悲伤、厌恶
CHEAVD	2 322个句子	图像、语音	表演型	中性、生气、高兴、悲伤、担心、焦虑、惊讶、厌恶
MELD	1 433个对话	图像、语音、文本	表演型	愤怒、厌恶、恐惧、喜悦、中立、悲伤、惊讶和积极、消极、中性
IEMOCAP	10 039个句子	图像、语音、文本	表演型	中性、快乐、悲伤、愤怒、惊讶、恐惧、厌恶、沮丧、兴奋和效价度、唤醒度、支配度
CMU-MOSEI	23 453个句子	图像、语音、文本	自发型	快乐、悲伤、愤怒、恐惧、厌烦、惊讶和积极、消极、中性
SEMAINE	150个句子	图像、语音	引导型	效价度、唤醒度、支配度、力量、强度
Belfast	298个视频片段	图像、语音	自发型	愤怒、厌恶、恐惧、高兴、中性、悲伤、惊讶
VAM	12 h数据	图像、语音	自发型	效价度、唤醒度、支配度
RECOLA	3.8 h数据	图像、语音、心电图、皮肤电信号	自发型	效价度、唤醒度、支配度、一致性、参与度、表现力、融洽度

文献	数据库	提取特征（方法/特征）	融合方法	实验结果
文献［70］	IEMOCAP	Librosa提取+BERT词向量	特征级融合	WA=70.18%
文献［71］	IEMOCAP	MFCC特征+GloVe词向量	模型级融合	UA=48.7%；WA=57.9%
文献［72］	IEMOCAP	OpenSMILE提取+BERT	模型级融合	UA=65.10%
文献［73］	IEMOCAP	OpenSMILE提取+RoBERTa	多层次融合	W-F1=68.46%；Acc=68.67%
文献［69］	IEMOCAP	双向门控循环单元（Bidirectional Gated Recurrent Unit， Bi-GRU）	多层次融合	UA=70.71%；WA=72.75%
文献［74］	IEMOCAP	wav2vec 2.0、Mel语谱图等+SER转录文本	决策级融合	UA=73.0%；WA=73.5%
文献［47］	IEMOCAP	BN（BottleNeck）+Word2vec、 ANEW（Affective Norms for English Words）	特征级融合	UA=75.5%；WA=73.7%
文献［55］	IEMOCAP	HuBERT特征+MPNet特征	模型级融合	UA=76.4%；WA=75.2%
文献［59］	MOSI	MFSC（log Mel-Frequency Spectral Coefficients）特征+Word2vec	多层次融合	UA=76.5%；WA=76.4%
文献［75］	CMU-MOSI	LSTM+BERT	多层次融合	F1-Score=77.7%
文献［43］	IEMOCAP	FBank特征+BERT词向量	多层次融合	UA=79.7%；WA=77.9%
文献［67］	IEMOCAP	LLDs（Low-Level Descriptors）+Word2vec	特征级融合	UA=79.51%；WA=78.70%
文献［48］	IEMOCAP	wav2vec 2.0、Mel语谱图+RoBERTa-base	特征级融合	UA=80.5%；WA=81.2%
文献［51］	JTES	LLDs特征+BERT词向量	决策级融合	平均识别率=82.5%

文献	数据库	提取特征（方法/特征）	融合方法	实验结果
文献［70］	IEMOCAP	Librosa提取+BERT词向量	特征级融合	WA=70.18%
文献［71］	IEMOCAP	MFCC特征+GloVe词向量	模型级融合	UA=48.7%；WA=57.9%
文献［72］	IEMOCAP	OpenSMILE提取+BERT	模型级融合	UA=65.10%
文献［73］	IEMOCAP	OpenSMILE提取+RoBERTa	多层次融合	W-F1=68.46%；Acc=68.67%
文献［69］	IEMOCAP	双向门控循环单元（Bidirectional Gated Recurrent Unit， Bi-GRU）	多层次融合	UA=70.71%；WA=72.75%
文献［74］	IEMOCAP	wav2vec 2.0、Mel语谱图等+SER转录文本	决策级融合	UA=73.0%；WA=73.5%
文献［47］	IEMOCAP	BN（BottleNeck）+Word2vec、 ANEW（Affective Norms for English Words）	特征级融合	UA=75.5%；WA=73.7%
文献［55］	IEMOCAP	HuBERT特征+MPNet特征	模型级融合	UA=76.4%；WA=75.2%
文献［59］	MOSI	MFSC（log Mel-Frequency Spectral Coefficients）特征+Word2vec	多层次融合	UA=76.5%；WA=76.4%
文献［75］	CMU-MOSI	LSTM+BERT	多层次融合	F1-Score=77.7%
文献［43］	IEMOCAP	FBank特征+BERT词向量	多层次融合	UA=79.7%；WA=77.9%
文献［67］	IEMOCAP	LLDs（Low-Level Descriptors）+Word2vec	特征级融合	UA=79.51%；WA=78.70%
文献［48］	IEMOCAP	wav2vec 2.0、Mel语谱图+RoBERTa-base	特征级融合	UA=80.5%；WA=81.2%
文献［51］	JTES	LLDs特征+BERT词向量	决策级融合	平均识别率=82.5%

[1]	Tianyu LIU, Ye TAO, Chaofeng LU, Jiawang LIU. Novel speaker identification framework based on narrative unit and reliable label [J]. Journal of Computer Applications, 2025, 45(4): 1190-1198.
[2]	Liqin WANG, Zhilei GENG, Yingshuang LI, Yongfeng DONG, Meng BIAN. Open-world knowledge reasoning model based on path and enhanced triplet text [J]. Journal of Computer Applications, 2025, 45(4): 1177-1183.
[3]	Haitao SUN, Jiayu LIN, Zuhong LIANG, Jie GUO. Data augmentation technique incorporating label confusion for Chinese text classification [J]. Journal of Computer Applications, 2025, 45(4): 1113-1119.
[4]	Junxiu AN, Linwang YANG, Yuan LIU. Unsupervised text style transfer based on semantic perception of proximity [J]. Journal of Computer Applications, 2025, 45(4): 1139-1147.
[5]	Jiaxin LI, Site MO. Power work order classification in substation area based on MiniRBT-LSTM-GAT and label smoothing [J]. Journal of Computer Applications, 2025, 45(4): 1356-1362.
[6]	Yanping ZHANG, Meifang CHEN, Changhai TIAN, Zibo YI, Wenpeng HU, Wei LUO, Zhunchen LUO. Multi-strategy retrieval-augmented generation method for military domain knowledge question answering systems [J]. Journal of Computer Applications, 2025, 45(3): 746-754.
[7]	Chaofeng LU, Ye TAO, Lianqing WEN, Fei MENG, Xiugong QIN, Yongjie DU, Yunlong TIAN. Speaker-emotion voice conversion method with limited corpus based on large language model and pre-trained model [J]. Journal of Computer Applications, 2025, 45(3): 815-822.
[8]	Qijian CAI, Wei TAN. Semantic graph enhanced multi-modal recommendation algorithm [J]. Journal of Computer Applications, 2025, 45(2): 421-427.
[9]	Ming JIANG, Linqin WANG, Hua LAI, Shengxiang GAO. End-to-end Vietnamese text normalization method based on editing constraints [J]. Journal of Computer Applications, 2025, 45(2): 362-370.
[10]	Qiang FU, Zhenping XU, Wenxing SHENG, Qing YE. End-to-end Chinese speech recognition method with byte-level byte pair encoding [J]. Journal of Computer Applications, 2025, 45(1): 318-324.
[11]	Shang LIU, Yuwei ZHOU, Rao DAI, Linfang DONG, Meng LIU. Small target detection algorithm in remote sensing images integrating attention and contextual information [J]. Journal of Computer Applications, 2025, 45(1): 292-300.
[12]	Jialin ZHANG, Qinghua REN, Qirong MAO. Speaker verification system utilizing global-local feature dependency for anti-spoofing [J]. Journal of Computer Applications, 2025, 45(1): 308-317.
[13]	Ying HUANG, Jiayu YANG, Jiahao JIN, Bangrui WAN. Siamese mixed information fusion algorithm for RGBT tracking [J]. Journal of Computer Applications, 2024, 44(9): 2878-2885.
[14]	Yuxin HUANG, Jialong XU, Zhengtao YU, Shukai HOU, Jiaqi ZHOU. Unsupervised text sentiment transfer method based on generation prompt [J]. Journal of Computer Applications, 2024, 44(9): 2667-2673.
[15]	Xianglan WU, Yang XIAO, Mengying LIU, Mingming LIU. Text-to-SQL model based on semantic enhanced schema linking [J]. Journal of Computer Applications, 2024, 44(9): 2689-2695.

Review on bimodal emotion recognition based on speech and text

基于语音和文本的双模态情感识别综述

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 7

References 82

Related Articles 15

Recommended Articles

Metrics