Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (1): 86-93.DOI: 10.11772/j.issn.1001-9081.2023060753
• Cross-media representation learning and cognitive reasoning • Previous Articles Next Articles
Mu LI, Yuheng YANG(), Xizheng KE
Received:
2023-06-15
Revised:
2023-08-14
Accepted:
2023-08-21
Online:
2023-09-25
Published:
2024-01-10
Contact:
Yuheng YANG
About author:
LI Mu, born in 1972, M. S., senior engineer. His research interests include vital sign detection, deep learning.Supported by:
通讯作者:
杨宇恒
作者简介:
李牧(1972—),男,陕西西安人,高级工程师,硕士,主要研究方向:生命体征检测、深度学习;基金资助:
CLC Number:
Mu LI, Yuheng YANG, Xizheng KE. Emotion recognition model based on hybrid-mel gama frequency cross-attention transformer modal[J]. Journal of Computer Applications, 2024, 44(1): 86-93.
李牧, 杨宇恒, 柯熙政. 基于混合特征提取与跨模态特征预测融合的情感识别模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 86-93.
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2023060753
数据集 | 样本数 | |||
---|---|---|---|---|
训练集 | 验证集 | 测试集 | 总计 | |
CMU-MOSI | 1 453 | 232 | 411 | 2 096 |
CMU-MOSEI | 16 853 | 2 103 | 2 597 | 21 553 |
IEMOCAP | 6 711 | 634 | 1 746 | 9 091 |
Tab. 1 Dataset sample size
数据集 | 样本数 | |||
---|---|---|---|---|
训练集 | 验证集 | 测试集 | 总计 | |
CMU-MOSI | 1 453 | 232 | 411 | 2 096 |
CMU-MOSEI | 16 853 | 2 103 | 2 597 | 21 553 |
IEMOCAP | 6 711 | 634 | 1 746 | 9 091 |
模型 | IEMOCAP | CMU-MOSI | CMU-MOSEI | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Acc/% | F1/% | MAE | Corr | Acc/% | F1/% | MAE | Corr | Acc/% | F1/% | MAE | Corr | |
GGRU[ | 71.80 | 62.83 | 0.894 | 0.695 | 73.42 | 61.31 | 0.797 | 0.696 | 72.93 | 64.35 | 0.762 | 0.701 |
LLA[ | 69.74 | 57.97 | 0.923 | 0.698 | 68.73 | 59.49 | 0.832 | 0.701 | 71.31 | 61.99 | 0.816 | 0.706 |
LFC[ | 75.49 | 63.62 | 0.793 | 0.745 | 71.29 | 65.34 | 0.743 | 0.743 | 73.84 | 67.68 | 0.663 | 0.747 |
FLFT[ | 74.27 | 67.45 | 0.787 | 0.748 | 77.83 | 74.33 | 0.693 | 0.752 | 79.33 | 69.73 | 0.593 | 0.753 |
DLFT[ | 77.18 | 71.26 | 0.768 | 0.755 | 79.32 | 73.89 | 0.687 | 0.757 | 78.37 | 74.33 | 0.588 | 0.759 |
本文模型 | 80.01 | 69.73 | 0.759 | 0.763 | 81.96 | 74.33 | 0.676 | 0.768 | 81.42 | 72.46 | 0.594 | 0.765 |
Tab. 2 Comparison of emotional fusion effects of different models
模型 | IEMOCAP | CMU-MOSI | CMU-MOSEI | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Acc/% | F1/% | MAE | Corr | Acc/% | F1/% | MAE | Corr | Acc/% | F1/% | MAE | Corr | |
GGRU[ | 71.80 | 62.83 | 0.894 | 0.695 | 73.42 | 61.31 | 0.797 | 0.696 | 72.93 | 64.35 | 0.762 | 0.701 |
LLA[ | 69.74 | 57.97 | 0.923 | 0.698 | 68.73 | 59.49 | 0.832 | 0.701 | 71.31 | 61.99 | 0.816 | 0.706 |
LFC[ | 75.49 | 63.62 | 0.793 | 0.745 | 71.29 | 65.34 | 0.743 | 0.743 | 73.84 | 67.68 | 0.663 | 0.747 |
FLFT[ | 74.27 | 67.45 | 0.787 | 0.748 | 77.83 | 74.33 | 0.693 | 0.752 | 79.33 | 69.73 | 0.593 | 0.753 |
DLFT[ | 77.18 | 71.26 | 0.768 | 0.755 | 79.32 | 73.89 | 0.687 | 0.757 | 78.37 | 74.33 | 0.588 | 0.759 |
本文模型 | 80.01 | 69.73 | 0.759 | 0.763 | 81.96 | 74.33 | 0.676 | 0.768 | 81.42 | 72.46 | 0.594 | 0.765 |
实验序号 | 模型 | CMU-MOSI | CMU-MOSEI | ||||||
---|---|---|---|---|---|---|---|---|---|
Acc/% | F1/% | MAE | Corr | Acc/% | F1/% | MAE | Corr | ||
1 | 本文模型 | 81.96 | 74.33 | 0.676 | 0.768 | 81.42 | 72.46 | 0.594 | 0.765 |
2 | H-MGFCT/H | 72.39 | 67.16 | 0.796 | 0.652 | 73.42 | 69.67 | 0.827 | 0.674 |
3 | H-MGFCT/C | 74.64 | 71.37 | 0.821 | 0.691 | 75.79 | 73.54 | 0.787 | 0.621 |
4 | H-MGFCT/S | 75.41 | 72.03 | 0.784 | 0.667 | 76.76 | 71.98 | 0.769 | 0.659 |
5 | H-MGFCT/T | 77.41 | 72.56 | 0.794 | 0.711 | 77.28 | 72.73 | 0.754 | 0.761 |
Tab. 3 Effective validation of H-MGFCC/CSA-Transformer/comparative learning/encoder prediction model based on attention weight index
实验序号 | 模型 | CMU-MOSI | CMU-MOSEI | ||||||
---|---|---|---|---|---|---|---|---|---|
Acc/% | F1/% | MAE | Corr | Acc/% | F1/% | MAE | Corr | ||
1 | 本文模型 | 81.96 | 74.33 | 0.676 | 0.768 | 81.42 | 72.46 | 0.594 | 0.765 |
2 | H-MGFCT/H | 72.39 | 67.16 | 0.796 | 0.652 | 73.42 | 69.67 | 0.827 | 0.674 |
3 | H-MGFCT/C | 74.64 | 71.37 | 0.821 | 0.691 | 75.79 | 73.54 | 0.787 | 0.621 |
4 | H-MGFCT/S | 75.41 | 72.03 | 0.784 | 0.667 | 76.76 | 71.98 | 0.769 | 0.659 |
5 | H-MGFCT/T | 77.41 | 72.56 | 0.794 | 0.711 | 77.28 | 72.73 | 0.754 | 0.761 |
模型 | 参数量/106 | 平均运行时间/s | 模型大小/MB |
---|---|---|---|
GGRU[ | 29 | 7.74 | 5.62 |
LLA[ | 86 | 13.89 | 9.83 |
LFC[ | 71 | 9.92 | 6.36 |
FLFT[ | 63 | 7.35 | 5.88 |
DLFT[ | 55 | 5.24 | 4.84 |
本文模型 | 17 | 2.74 | 3.76 |
Tab. 4 Overall performance comparison of different models
模型 | 参数量/106 | 平均运行时间/s | 模型大小/MB |
---|---|---|---|
GGRU[ | 29 | 7.74 | 5.62 |
LLA[ | 86 | 13.89 | 9.83 |
LFC[ | 71 | 9.92 | 6.36 |
FLFT[ | 63 | 7.35 | 5.88 |
DLFT[ | 55 | 5.24 | 4.84 |
本文模型 | 17 | 2.74 | 3.76 |
1 | KE X, CAO B, BAI J, et al. Speech emotion recognition based on PCA and CHMM [C]// Proceedings of the 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference. Piscataway: IEEE, 2019: 667-671. 10.1109/itaic.2019.8785867 |
2 | SHAH M, MIAO L, CHAKRABARTI C, et al. A speech emotion recognition framework based on latent Dirichlet allocation [C]// Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2013: 2553-2557. 10.1109/icassp.2013.6638116 |
3 | DUTTA K, SARMA K K. Multiple feature extraction for RNN-based Assamese speech recognition for speech to text conversion application [C]// Proceedings of the 2012 International Conference on Communications, Devices and Intelligent Systems. Piscataway: IEEE, 2012: 600-603. 10.1109/codis.2012.6422274 |
4 | 郭卉,姜囡,任杰.基于MFCC和GFCC混合特征的语音情感识别研究[J].光电技术应用, 2019, 34(6): 34-39. 10.3969/j.issn.1673-1255.2019.06.008 |
GUO H, JIANG N, REN J. Research on speech emotion recognition based on mixed features of MFCC and GFCC [J]. Electro-Optic Technology Application, 2019, 34(6): 34-39. 10.3969/j.issn.1673-1255.2019.06.008 | |
5 | CHEN M, ZHAO X. A multi-scale fusion framework for bimodal speech emotion recognition [C]// Proceedings of the 2020 Cognitive Intelligence for Speech Processing. Baixas, FR: International Speech Communication Association, 2020: 374-378. 10.21437/interspeech.2020-3156 |
6 | TZIRAKIS P, TRIGEORGIS G, NICOLAOU M A, et al. End-to-end multimodal emotion recognition using deep neural networks [J]. IEEE Journal of Selected Topics in Signal Processing, 2017, 11(8): 1301-1309. 10.1109/jstsp.2017.2764438 |
7 | SUN L, LIU B, TAO J, et al. Multimodal cross- and self-attention network for speech emotion recognition [C]// Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2021: 4275-4279. 10.1109/icassp39728.2021.9414654 |
8 | YOON S, BYUN S, DEY S, et al. Speech emotion recognition using multi-hop attention mechanism [C]// Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2019: 2822-2826. 10.1109/icassp.2019.8683483 |
9 | CHOI W Y, SONG K Y, LEE C W. Convolutional attention networks for multimodal emotion recognition from speech and text data [C]// Proceedings of the 2018 Grand Challenge and Workshop on Human Multimodal Language. Stroudsburg, PA: Association for Computational Linguistics, 2018: 28-34. 10.18653/v1/w18-3304 |
10 | 陈鹏展,张欣徐,徐芳萍.基于语音信号与文本信息的双模态情感识别[J].华东交通大学学报, 2017, 34(2): 100-104. |
CHEN P Z, ZHANG X X, XU F P. Multimodal emotion recognition based on speech signals and text information [J]. Journal of East China Jiaotong University, 2017, 34(2): 100-104. | |
11 | ZHONG Y, HU Y, HUANG H, et al. A lightweight model based on separable convolution for speech emotion recognition [C]// Proceedings of the 2020 Cognitive Intelligence for Speech Processing. Baixas, FR: International Speech Communication Association, 2020: 3331-3335. 10.21437/interspeech.2020-2408 |
12 | 顾煜,金赟,马勇,等.基于声学和文本特征的多模态情感识别[J].数据采集与处理, 2022, 37(6): 1353-1362. |
GU Y, JIN Y, MA Y, et al. Multimodal emotion recognition based on acoustic and lexical features [J]. Journal of Data Acquisition & Processing, 2022, 37(6): 1353-1362. | |
13 | 高玮军,赵华洋,李磊,等.基于ALBERT-HACNN-TUP模型的文本情感分析[J].计算机仿真, 2023, 40(5): 491-496. 10.3969/j.issn.1006-9348.2023.05.089 |
GAO W J, ZHAO H Y, LI L, et al. Text sentiment analysis based on the ALBERT-HACNN-TUP model [J]. Computer Simulation, 2023, 40(5): 491-496. 10.3969/j.issn.1006-9348.2023.05.089 | |
14 | 王跃跃.基于Albert和句法树的方面级情感分析[J].智能计算机与应用, 2023, 13(4): 52-59. 10.3969/j.issn.2095-2163.2023.04.010 |
WANG Y Y. Aspect-level sentiment analysis based on Albert and syntactic tree [J]. Intelligent Computer and Applications, 2023, 13(4): 52-59. 10.3969/j.issn.2095-2163.2023.04.010 | |
15 | 阮国恒,钟业荣,江嘉铭.基于MFCC系数的语音交互系统设计[J].自动化与仪器仪表, 2022(6): 167-171. |
RUAN G H, ZHONG Y R, JIANG J M. Design of speech interaction system based on MFCC coefficient [J]. Automation & Instrumentation, 2022(6): 167-171. | |
16 | 蒙倩霞,余江,常俊,等.基于MFCC特征的Wi-Fi信道状态信息人体行为识别方法[J].计算机应用与软件, 2022, 39(12): 125-131. 10.3969/j.issn.1000-386x.2022.12.019 |
MENG Q X, YU J, CHANG J, et al. Human behavior recognition method by Wi-Fi channel state information based on MFCC characteristics [J]. Computer Applications and Software, 2022, 39(12): 125-131. 10.3969/j.issn.1000-386x.2022.12.019 | |
17 | WU Y, LIN Z, ZHAO Y, et al. A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis [C]// Proceedings of the 2021 Findings of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2021: 4730-4738. 10.18653/v1/2021.findings-acl.417 |
18 | 李晋荣,吕国英,李茹,等.结合Hybrid Attention机制和BiLSTM-CRF的汉语否定语义表示及标注[J].计算机工程与应用, 2023, 59(9): 167-175. 10.3778/j.issn.1002-8331.2201-0088 |
LI J R, LYU G Y, LI R, et al. Chinese negative semantic representation and annotation combined with hybrid attention mechanism and BiLSTM-CRF [J]. Computer Engineering and Applications, 2023, 59(9): 167-175. 10.3778/j.issn.1002-8331.2201-0088 | |
19 | YANG M, LI Y, HUANG Z, et al. Partially view-aligned representation learning with noise robust contrastive loss [C]// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 1134-1143. 10.1109/cvpr46437.2021.00119 |
20 | CHEN T, KORNBLITH S, NOROUZI M, et al.A simple framework for contrastive learning of visual representations [C]// Proceedings of the 37th International Conference on Machine Learning. New York: JMLR.org, 2020: 1597-1607. |
21 | SCHULLER B W, BATLINER A, BERGLER C, et al. The INTERSPEECH 2020 computational paralinguistics challenge: elderly emotion, breathing and masks [C]// Proceedings of the 2020 Cognitive Intelligence for Speech Processing. Baixas, FR: International Speech Communication Association, 2020: 2042-2046. 10.21437/interspeech.2020-32 |
22 | 李文雪,甘臣权.基于注意力机制的分层次交互融合多模态情感分析[J].重庆邮电大学学报(自然科学版), 2023, 35(1): 176-184. |
LI W X, GAN C Q. Multimodal emotional analysis of hierarchical interactive fusion based on attention mechanism [J]. Journal of Chongqing University of Posts and Telecommunications (Natural Science Edition), 2023, 35(1): 176-184. | |
23 | 赖雪梅,唐宏,陈虹羽,等.基于注意力机制的特征融合-双向门控循环单元多模态情感分析[J].计算机应用, 2021, 41(5): 1268-1274. 10.11772/j.issn.1001-9081.2020071092 |
LAI X M, TANG H, CHEN H Y, et al. Multimodal sentiment analysis based on feature fusion of attention mechanism-bidirectional gated recurrent unit [J]. Journal of Computer Applications, 2021, 41(5): 1268-1274. 10.11772/j.issn.1001-9081.2020071092 | |
24 | 龙英潮,丁美荣,林桂锦,等.基于视听觉感知系统的多模态情感识别[J].计算机系统应用, 2021, 30(12): 218-225. |
LONG Y C, DING M R, LIN G J, et al. Emotion recognition based on visual and audiovisual perception system [J]. Computer Systems & Applications, 2021, 30(12): 218-225. | |
25 | YOON S, BYUN S, JUNG K. Multimodal speech emotion recognition using audio and text [C]// Proceedings of the 2018 IEEE Spoken Language Technology Workshop. Piscataway: IEEE, 2018: 112-118. 10.1109/slt.2018.8639583 |
26 | TRIPATHI S, TRIPATHI S, BEIGI H. Multi-modal emotion recognition on IEMOCAP dataset using deep learning [EB/OL]. (2018-04-16) [2023-01-05]. . |
27 | ATMAJA B T, SHIRAI K, AKAGI M. Speech emotion recognition using speech feature and word embedding [C]// Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Piscataway: IEEE, 2019: 519-523. 10.1109/apsipaasc47483.2019.9023098 |
28 | ZHANG X, WANG M-J, GUO X-D. Multi-modal emotion recognition based on deep learning in speech, video and text [C]// Proceedings of the 2020 IEEE 5th International Conference on Signal and Image Processing. Piscataway: IEEE, 2020: 328-333. 10.1109/icsip49896.2020.9339464 |
[1] | Jing QIN, Zhiguang QIN, Fali LI, Yueheng PENG. Diagnosis of major depressive disorder based on probabilistic sparse self-attention neural network [J]. Journal of Computer Applications, 2024, 44(9): 2970-2974. |
[2] | Liting LI, Bei HUA, Ruozhou HE, Kuang XU. Multivariate time series prediction model based on decoupled attention mechanism [J]. Journal of Computer Applications, 2024, 44(9): 2732-2738. |
[3] | Zhiqiang ZHAO, Peihong MA, Xinhong HEI. Crowd counting method based on dual attention mechanism [J]. Journal of Computer Applications, 2024, 44(9): 2886-2892. |
[4] | Xin YANG, Xueni CHEN, Chunjiang WU, Shijie ZHOU. Short-term traffic flow prediction of urban highway based on variant residual model and Transformer [J]. Journal of Computer Applications, 2024, 44(9): 2947-2951. |
[5] | Shuai FU, Xiaoying GUO, Ruyi BAI, Tao YAN, Bin CHEN. Age estimation method combining improved CloFormer model and ordinal regression [J]. Journal of Computer Applications, 2024, 44(8): 2372-2380. |
[6] | Kaipeng XUE, Tao XU, Chunjie LIAO. Multimodal sentiment analysis network with self-supervision and multi-layer cross attention [J]. Journal of Computer Applications, 2024, 44(8): 2387-2392. |
[7] | Pengqi GAO, Heming HUANG, Yonghong FAN. Fusion of coordinate and multi-head attention mechanisms for interactive speech emotion recognition [J]. Journal of Computer Applications, 2024, 44(8): 2400-2406. |
[8] | Tong CHEN, Fengyu YANG, Yu XIONG, Hong YAN, Fuxing QIU. Construction method of voiceprint library based on multi-scale frequency-channel attention fusion [J]. Journal of Computer Applications, 2024, 44(8): 2407-2413. |
[9] | Zhonghua LI, Yunqi BAI, Xuejin WANG, Leilei HUANG, Chujun LIN, Shiyu LIAO. Low illumination face detection based on image enhancement [J]. Journal of Computer Applications, 2024, 44(8): 2588-2594. |
[10] | Shangbin MO, Wenjun WANG, Ling DONG, Shengxiang GAO, Zhengtao YU. Single-channel speech enhancement based on multi-channel information aggregation and collaborative decoding [J]. Journal of Computer Applications, 2024, 44(8): 2611-2617. |
[11] | Wu XIONG, Congjun CAO, Xuefang SONG, Yunlong SHAO, Xusheng WANG. Handwriting identification method based on multi-scale mixed domain attention mechanism [J]. Journal of Computer Applications, 2024, 44(7): 2225-2232. |
[12] | Huanhuan LI, Tianqiang HUANG, Xuemei DING, Haifeng LUO, Liqing HUANG. Public traffic demand prediction based on multi-scale spatial-temporal graph convolutional network [J]. Journal of Computer Applications, 2024, 44(7): 2065-2072. |
[13] | Dianhui MAO, Xuebo LI, Junling LIU, Denghui ZHANG, Wenjing YAN. Chinese entity and relation extraction model based on parallel heterogeneous graph and sequential attention mechanism [J]. Journal of Computer Applications, 2024, 44(7): 2018-2025. |
[14] | Li LIU, Haijin HOU, Anhong WANG, Tao ZHANG. Generative data hiding algorithm based on multi-scale attention [J]. Journal of Computer Applications, 2024, 44(7): 2102-2109. |
[15] | Song XU, Wenbo ZHANG, Yifan WANG. Lightweight video salient object detection network based on spatiotemporal information [J]. Journal of Computer Applications, 2024, 44(7): 2192-2199. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||