[1] |
HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778.
|
[2] |
CHUNG J, GULCEHRE C, CHO K, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling [EB/OL]. [2024-07-11]. .
|
[3] |
ZHAO Z, LIU Q. Former-DFER: dynamic facial expression recognition Transformer [C]// Proceedings of the 29th ACM International Conference on Multimedia. New York: ACM, 2021: 1553-1561.
|
[4] |
HARA K, KATAOKA H, SATOH Y. Learning spatio-temporal features with 3D residual networks for action recognition [C]// Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops. Piscataway: IEEE, 2017:3154-3160.
|
[5] |
IQBAL M, SAMEEM M S I, NAQVI N, et al. A deep learning approach for face recognition based on angularly discriminative features [J]. Pattern Recognition Letters, 2019, 128: 414-419.
|
[6] |
JIANG X, ZONG Y, ZHENG W, et al. DFEW: a large-scale database for recognizing dynamic facial expressions in the wild [C]// Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020:2881-2889.
|
[7] |
ZHANG S, ZHAO X, CHUANG Y, et al. Feature learning via deep belief network for Chinese speech emotion recognition [C]// Proceedings of the 2016 Chinese Conference on Pattern Recognition, CCIS 663. Singapore: Springer, 2016:645-651.
|
[8] |
陈婧,李海峰,马琳,等. 多粒度特征融合的维度语音情感识别方法[J]. 信号处理, 2017, 33(3):374-382.
|
|
CHEN J, LI H F, MA L, et al. Multi-granularity feature fusion for dimensional speech emotion recognition [J]. Journal of Signal Processing, 2017, 33(3): 374-382.
|
[9] |
AREZZO A, BERRETTI S. Speaker VGG CCT: cross-corpus speech emotion recognition with speaker embedding and Vision Transformers [C]// Proceedings of the 4th ACM International Conference on Multimedia in Asia. New York: ACM, 2022: No.7.
|
[10] |
龙英潮,丁美荣,林桂锦,等. 基于视听觉感知系统的多模态情感识别[J]. 计算机系统应用, 2021, 30(12):218-225.
|
|
LONG Y C, DING M R, LIN G J, et al. Emotion recognition based on visual and audiovisual perception system [J]. Computer Systems and Applications, 2021, 30(12):218-225.
|
[11] |
ZADEH A, CHEN M, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis [C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2017: 1103-1114.
|
[12] |
ZENG Z, TU J, PIANFETTI B, et al. Audio-visual affect recognition through multi-stream fused HMM for HCI [C]// Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition — Volume 2. Piscataway: IEEE, 2005: 967-972.
|
[13] |
刘菁菁,吴晓峰. 基于长短时记忆网络的多模态情感识别和空间标注[J]. 复旦学报(自然科学版), 2020, 59(5):565-574.
|
|
LIU J J, WU X F. Real-time multimodal emotion recognition and emotion space labeling using LSTM networks [J]. Journal of Fudan University (Natural Science), 2020, 59(5):565-574.
|
[14] |
王传昱,李为相,陈震环. 基于语音和视频图像的多模态情感识别研究[J]. 计算机工程与应用, 2021, 57(23):163-170.
|
|
WANG C Y, LI W X, CHEN Z H. Research of multi-modal emotion recognition based on voice and video images [J]. Computer Engineering and Applications, 2021, 57(23):163-170.
|
[15] |
CHEN S, JIN Q. Multi-modal conditional attention fusion for dimensional emotion prediction [C]// Proceedings of the 24th ACM International Conference on Multimedia. New York: ACM, 2016:571-575.
|
[16] |
HUANG J, TAO J, LIU B, et al. Multimodal Transformer fusion for continuous emotion recognition [C]// Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2020: 3507-3511.
|
[17] |
ZADEH A B, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph [C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg: ACL, 2018: 2236-2246.
|
[18] |
ALWASSEL H, MAHAJAN D, KORBAR B, et al. Self-supervised learning by cross-modal audio-video clustering [C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 9758-9770.
|
[19] |
ASANO Y M, PATRICK M, RUPPRECHT C, et al. Labelling unlabelled videos from scratch with multi-modal self-supervision[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 4660-4671.
|
[20] |
SHI B, HSU W N, LAKHOTIA K, et al. Learning audio-visual speech representation by masked multimodal cluster prediction[EB/OL]. [2022-03-13]. .
|
[21] |
AKBARI H, YUAN L, QIAN R, et al. VATT: Transformers for multimodal self-supervised learning from raw video, audio and text[C]// Proceedings of the 35th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2021: 24206-24221.
|
[22] |
LIU J, ZHU X, LIU F, et al. OPT: omni-perception pre-trainer for cross-modal understanding and generation [EB/OL]. [2024-03-12]..
|
[23] |
PARTHASARATHY S, SUNDARAM S. Training strategies to handle missing modalities for audio-visual expression recognition[C]// Companion Publication of the 2020 International Conference on Multimodal Interaction. New York: ACM, 2020: 400-404.
|
[24] |
ZHAO J, LI R, JIN Q. Missing modality imagination network for emotion recognition with uncertain missing modalities [C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Stroudsburg: ACL, 2021: 2608-2618.
|
[25] |
DESPLANQUES B, THIENPONDT J, DEMUYNCK K. ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification [C]// Proceedings of the INTERSPEECH 2020. [S.l.]: International Speech Communication Association, 2020: 3830-3834.
|