Sequential multimodal sentiment analysis model based on multi-task learning

doi:10.11772/j.issn.1001-9081.2020091416

Abstract

Abstract: Considering the issues of unimodal feature representation and cross-modal feature fusion in sequential multimodal sentiment analysis, a multi-task learning based sentiment analysis model was proposed by combining with multi-head attention mechanism. Firstly, Convolution Neural Network (CNN), Bidirectional Gated Recurrent Unit (BiGRU) and Multi-Head Self-Attention (MHSA) were used to realize the sequential unimodal feature representation. Secondly, the bidirectional cross-modal information was fused by multi-head attention. Finally, based on multi-task learning, the sentiment polarity classification and sentiment intensity regression were added as auxiliary tasks to improve the comprehensive performance of the main task of sentiment score regression. Experimental results demonstrate that the proposed model improves the accuracy of binary classification by 7.8 percentage points and 3.1 percentage points respectively on CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI) and CMU Multimodal Opinion level Sentiment Intensity (CMU-MOSI) datasets compared with multimodal factorization model. Therefore, the proposed model is applicable for the sentiment analysis problems under multimodal scenarios, and can provide the decision supports for product recommendation, stock market forecasting, public opinion monitoring and other relevant applications.

Key words: sentiment analysis, multimodal, multi-task learning, sequential learning, feature fusion

摘要： 针对时序多模态情感分析中存在的单模态特征表示和跨模态特征融合问题，结合多头注意力机制，提出一种基于多任务学习的情感分析模型。首先，使用卷积神经网络（CNN）、双向门控循环神经网络（BiGRU）和多头自注意力（MHSA）实现了对时序单模态的特征表示；然后，利用多头注意力实现跨模态的双向信息融合；最后，基于多任务学习思想，添加额外的情感极性分类和情感强度回归任务作为辅助，从而提升情感评分回归主任务的综合性能。实验结果表明，相较于多模态分解模型，所提模型的二分类准确度指标在CMU-MOSEI和CMU-MOSI多模态数据集上分别提高了7.8个百分点和3.1个百分点。该模型适用于多模态场景下的情感分析问题，能够为商品推荐、股市预测、舆情监控等应用提供决策支持。

关键词: 情感分析, 多模态, 多任务学习, 序列学习, 特征融合

CLC Number:

TP391.1

ZHANG Sun, YIN Chunyong. Sequential multimodal sentiment analysis model based on multi-task learning[J]. Journal of Computer Applications, 2021, 41(6): 1631-1639.

章荪, 尹春勇. 基于多任务学习的时序多模态情感分析模型[J]. 计算机应用, 2021, 41(6): 1631-1639.

References

[1] YADOLLAHI A,SHAHRAKI A G,ZAIANE O R. Current state of text sentiment analysis from opinion to emotion mining[J]. ACM Computing Surveys,2017,50(2):Article No. 25.
[2] HONG M,JUNG J J. Multi-sided recommendation based on social tensor factorization[J]. Information Sciences, 2018, 447:140-156.
[3] 蔡国永, 吕光瑞, 徐智. 基于层次化深度关联融合网络的社交媒体情感分类[J]. 计算机研究与发展, 2019, 56(6):1312-1324. (CAI G Y,LYU G R,XU Z. A hierarchical deep correlative fusion network for sentiment classification in social media[J]. Journal of Computer Research and Development,2019,56(6):1312-1324.)
[4] TRUONG Q T,LAUW H W. VistaNet:visual aspect attention network for multimodal sentiment analysis[C]//Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press,2019:305-312.
[5] VERMA S,WANG C,ZHU L,et al. DeepCU:integrating both common and unique latent information for multimodal sentiment analysis[C]//Proceedings of the 2019 28th International Joint Conference on Artificial Intelligence. Palo Alto:AAAI Press, 2019:3627-3634.
[6] ZADEH A,LIANG P P,MAZUMDER N,et al. Memory fusion network for multi-view sequential learning[C]//Proceedings of the 201832nd AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press,2018:5634-5641.
[7] PHAM H,MANZINI T,LIANG P P,et al. Seq2Seq2Sentiment:multimodal sequence to sequence models for sentiment analysis[C]//Proceedings of the 20181st Grand Challenge and Workshop on Human Multimodal Language. Stroudsburg:ACL,2018:53-63.
[8] MAI S,HU H,XING S. Modality to modality translation:an adversarial representation learning and graph fusion network for multimodal fusion[C]//Proceedings of the 2020 34th AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press, 2020:164-172.
[9] TSAI Y H H,BAI S,LIANG P P,et al. Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the 2019 57th Conference of the Association for Computational Linguistics. Stroudsburg:ACL,2019:6558-6569.
[10] KIM Y. Convolutional neural networks for sentence classification[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg:ACL,2014:1746-1751.
[11] TIAN L,LAI C,MOORE J. Polarity and intensity:the two aspects of sentiment analysis[C]//Proceedings of the 20181st Grand Challenge and Workshop on Human Multimodal Language. Stroudsburg:ACL,2018:40-47.
[12] AKHTAR M S,CHAUHAN D S,GHOSAL D,et al. Multi-task learning for multi-modal emotion recognition and sentiment analysis[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg:ACL, 2019:370-379.
[13] ZHAO S,WANG S,SOLEYMANI M,et al. Affective computing for large-scale heterogeneous multimedia data:a survey[J]. ACM Transactions on Multimedia Computing, Communications, and Applications,2019,15(3S):Article No. 93.
[14] HOVY E H. What are sentiment,affect,and emotion? Applying the methodology of Michael Zock to sentiment analysis[M]//GALA N,RAPP R,BEL-ENGUIX G. Language Production, Cognition,and the Lexicon. Cham:Springer,2015:13-24.
[15] MUNEZERO M,MONTERO C S,SUTINEN E,et al. Are they different? Affect, feeling, emotion, sentiment, and opinion detection in text[J]. IEEE Transactions on Affective Computing, 2014,5(2):101-111.
[16] BORTH D,JI R,CHEN T,et al. Large-scale visual sentiment ontology and detectors using adjective noun pairs[C]//Proceedings of the 201321st ACM International Conference on Multimedia. New York:ACM,2013:223-232.
[17] GUILLAUMIN M,VERBEEK J,SCHMID C. Multimodal semisupervised learning for image classification[C]//Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2010:902-909.
[18] BALTRUŠAITIS T,AHUJA C,MORENCY L P. Multimodal machine learning:a survey and taxonomy[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,41(2):423-443.
[19] 陈郑淏, 冯翱, 何嘉. 基于一维卷积混合神经网络的文本情感分类[J]. 计算机应用, 2019, 39(7):1936-1941.(CHEN Z H, FENG A, HE J. Text sentiment classification based on 1D convolutional hybrid neural network[J]. Journal of Computer Applications,2019,39(7):1936-1941.)
[20] HUANG F,ZHANG X,ZHAO Z,et al. Image-text sentiment analysis via deep multimodal attentive fusion[J]. KnowledgeBased Systems,2019,167:26-37.
[21] 李洋, 董红斌. 基于CNN和BiLSTM网络特征融合的文本情感分析[J]. 计算机应用, 2018, 38(11):3075-3080.(LI Y,DONG H B. Text sentiment analysis based on feature fusion of convolution neural network and bidirectional long short-term memory network[J]. Journal of Computer Applications,2018,38(11):3075-3080.)
[22] CHEN F,JI R,SU J,et al. Predicting microblog sentiments via weakly supervised multimodal deep learning[J]. IEEE Transactions on Multimedia,2018,20(4):997-1007.
[23] CHEN F,LUO Z,XU Y,et al. Complementary fusion of multifeatures and multi-modalities in sentiment analysis[C]//Proceedings of the 2020 3rd Workshop on Affective Content Analysis/34th AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press,2020:82-99.
[24] ZADEH A,CHEN M,PORIA S,et al. Tensor fusion network for multimodal sentiment analysis[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg:ACL,2017:1103-1114.
[25] LIU Z,SHEN Y,LAKSHMINARASIMHAN V B,et al. Efficient low-rank multimodal fusion with modality-specific factors[C]//Proceedings of the 2018 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg:ACL,2018:2247-2256.
[26] LIANG P P,LIU Z,TSAI Y H H,et al. Learning representations from imperfect time series data via tensor rank regularization[C]//Proceedings of the 2019 57th Conference of the Association for Computational Linguistics. Stroudsburg:ACL,2019:1569-1576.
[27] YU J,JIANG J. Adapting BERT for target-oriented multimodal sentiment classification[C]//Proceedings of the 2019 28th International Joint Conference on Artificial Intelligence. Palo Alto:AAAI Press,2019:5408-5414.
[28] MAJUMDER N, HAZARIKA D, GELBUKH A, et al. Multimodal sentiment analysis using hierarchical fusion with context modeling[J]. Knowledge-Based Systems,2018,161:124-133.
[29] CARUANA R. Multitask learning[J]. Machine Learning,1997, 28(1):41-75.
[30] ZADEH A,ZELLERS R,PINCUS E,et al. Multimodal sentiment intensity analysis in videos:facial gestures and verbal messages[J]. IEEE Intelligent Systems,2016,31(6):82-88.
[31] ZADEH A,LIANG P P,PORIA S,et al. Multimodal language analysis in the wild:CMU-MOSEI dataset and interpretable dynamic fusion graph[C]//Proceedings of the 2018 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg:ACL,2018:2236-2246.
[32] ZADEH A, LIANG P P, PORIA S, et al. Multi-attention recurrent network for human communication comprehension[C]//Proceedings of the 201832nd AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press,2018:5642-5649.
[33] LIANG P P,LIU Z,ZADEH A,et al. Multimodal language analysis with recurrent multistage fusion[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg:ACL,2018:150-161.
[34] TSAI Y H H,LIANG P P,ZADEH A,et al. Learning factorized multimodal representations[EB/OL].[2020-10-11]. https://arxiv.org/pdf/1806.06176.pdf.
[35] WANG Y,SHEN Y,LIU Z,et al. Words can shift:dynamically adjusting word representations using nonverbal behaviors[C]//Proceedings of the 2019 33rd AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press,2019:7216-7223.
[36] PHAM H,LIANG P P,MANZINI T,et al. Found in translation:learning robust joint representations by cyclic translations between modalities[C]//Proceedings of the 2019 33rd AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press,2019:6892-6899.