《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (1): 86-93.DOI: 10.11772/j.issn.1001-9081.2023060753

• 跨媒体表征学习与认知推理 • 上一篇    下一篇

基于混合特征提取与跨模态特征预测融合的情感识别模型

李牧, 杨宇恒(), 柯熙政   

  1. 西安理工大学 自动化与信息工程学院,西安 710048
  • 收稿日期:2023-06-15 修回日期:2023-08-14 接受日期:2023-08-21 发布日期:2023-09-25 出版日期:2024-01-10
  • 通讯作者: 杨宇恒
  • 作者简介:李牧(1972—),男,陕西西安人,高级工程师,硕士,主要研究方向:生命体征检测、深度学习;
    柯熙政(1962—),男,陕西临潼人,教授,博士,主要研究方向:无线激光通信。
    第一联系人:杨宇恒(1998—),男,陕西西安人,硕士研究生,主要研究方向:情感识别、深度学习;
  • 基金资助:
    西安市科技计划项目(2020KJRC0083)

Emotion recognition model based on hybrid-mel gama frequency cross-attention transformer modal

Mu LI, Yuheng YANG(), Xizheng KE   

  1. School of Automation and Information Engineering,Xi’an University of Technology,Xi’an Shaanxi 710048,China
  • Received:2023-06-15 Revised:2023-08-14 Accepted:2023-08-21 Online:2023-09-25 Published:2024-01-10
  • Contact: Yuheng YANG
  • About author:LI Mu, born in 1972, M. S., senior engineer. His research interests include vital sign detection, deep learning.
    KE Xizheng, born in 1962, Ph. D., professor. His research interests include wireless laser communication.
  • Supported by:
    Xi’an Science and Technology Plan Project(2020 KJRC0083)

摘要:

为从多模态情感分析中有效挖掘单模态表征信息,并实现多模态信息充分融合,提出一种基于混合特征与跨模态预测融合的情感识别模型(H-MGFCT)。首先,利用Mel频率倒谱系数(MFCC)和Gammatone频率倒谱系数(GFCC)及其一阶动态特征融合得到混合特征参数提取算法(H-MGFCC),解决了语音情感特征丢失的问题;其次,利用基于注意力权重的跨模态预测模型,筛选出与语音特征相关性更高的文本特征;随后,加入对比学习的跨模态注意力机制模型对相关性高的文本特征和语音模态情感特征进行跨模态信息融合;最后,将含有文本-语音的跨模态信息特征与筛选出的相关性低的文本特征相融合,以起到信息补充的作用。实验结果表明,该模型在公开IEMOCAP (Interactive EMotional dyadic MOtion CAPture)、CMU-MOSI (CMU-Multimodal Opinion Emotion Intensity)、CMU-MOSEI (CMU-Multimodal Opinion Sentiment Emotion Intensity)数据集上与加权决策层融合的语音文本情感识别(DLFT)模型相比,准确率分别提高了2.83、2.64和3.05个百分点,验证了该模型情感识别的有效性。

关键词: 特征提取, 多模态融合, 情感识别, 跨模态融合, 注意力机制

Abstract:

An emotion recognition model based on Hybrid-Mel Gama Frequency Cross-attention Transformer modal (H-MGFCT) was proposed to address the issues of effectively mining single modal representation information and achieving full fusion of multimodal information in multimodal sentiment analysis. Firstly, Hybird-Mel Gama Frequency Cepstral Coefficient (H-MGFCC) was obtained by fusing Mel Frequency Cepstral Coefficient (MFCC) and Gammatone Frequency Cepstral Coefficient (GFCC), as well as their first-order dynamic features, to solve the problem of speech emotional feature loss; secondly, a cross modal prediction model based on attention weight was used to filter out text features more relevant to speech features; subsequently, a Cross Self-Attention Transformer (CSA-Transformer) incorporating contrastive learning was used to fuse highly correlated cross modal information of text features and speech modal emotional features; finally, the cross modal information features containing text and speech were fused with the selected text features with low correlation to achieve information supplement. The experimental results show that the proposed model improves the accuracy by 2.83, 2.64, and 3.05 percentage points compared to the weighted Decision Level Fusion Text-audio (DLFT) model on the publicly available IEMOCAP (Interactive EMotional dyadic MOtion CAPture), CMU-MOSI (CMU-Multimodal Opinion Emotion Intensity), and CMU-MOSEI (CMU-Multimodal Opinion Sentiment Emotion Intensity) datasets, verifying the effectiveness of this model for emotion recognition.

Key words: feature extraction, multimodal fusion, emotion recognition, cross modal fusion, attention mechanism

中图分类号: