Multimodal sentiment analysis network with self-supervision and multi-layer cross attention

doi:10.11772/j.issn.1001-9081.2023081209

Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (8): 2387-2392.DOI: 10.11772/j.issn.1001-9081.2023081209

• Artificial intelligence • Previous Articles Next Articles

Multimodal sentiment analysis network with self-supervision and multi-layer cross attention

Kaipeng XUE¹^,², Tao XU¹^,²(), Chunjie LIAO¹^,²

^1.Institute of China National Information Technology，Northwest Minzu University，Lanzhou Gansu 730030，China
^2.Key Laboratory of Linguistic and Cultural Computing，Ministry of Education （Northwest Minzu University），Lanzhou Gansu 730030，China

Received:2023-09-06 Revised:2023-11-16 Accepted:2023-11-20 Online:2024-08-22 Published:2024-08-10
Contact: Tao XU
About author:XUE Kaipeng ， born in 2000， M. S. His research interestsinclude multimodal fusion.
XU Tao ， born in 1986， Ph. D. ， associate professor. His researchinterests include artificial intelligence， knowledge graph， informationretrieval， archival informatization.
LIAO Chunjie， born in 1997， M. S. Her research interestsinclude knowledge graph， recommender system.
Supported by:
This work is partially supported by Youth Doctoral Fund of GansuUniversities （2022QB-016）； Fundamental Research Funds for CentralUniversities （31920230069）； Youth Science and Technology Program ofGansu Province （21JR1RA21）； Technology Project of National ArchivesAdministration of China（ 2021-X-56）.

融合自监督和多层交叉注意力的多模态情感分析网络

薛凯鹏¹^,², 徐涛¹^,²(), 廖春节¹^,²

^1.西北民族大学中国民族信息技术研究院，兰州 730030
^2.语言与文化计算教育部重点实验室（西北民族大学），兰州 730030

通讯作者: 徐涛
作者简介:薛凯鹏（2000—），男，山东青岛人，硕士，CCF会员，主要研究方向：多模态融合
徐涛（1986—），男，四川广安人，副教授，博士，CCF会员，主要研究方向：人工智能、知识图谱、信息检索、档案领域信息化 alfredxly@163.com
廖春节（1997—），女，重庆人，硕士，CCF会员，主要研究方向：知识图谱、推荐系统。
基金资助:
甘肃省高等学校青年博士基金资助项目(2022QB?016);中央高校基本科研业务费专项(31920230069);甘肃省青年科技计划项目(21JR1RA21);国家档案局科技项目(2021?X?56)

Abstract

Abstract:

Aiming at the problems of incomplete intra-modal information， poor inter-modal interaction， and difficulty in training in multimodal sentiment analysis， a Multimodal Sentiment analysis network with Self-supervision and Multi-layer cross Attention fusion （MSSM） was proposed with Visual-and-Language Pre-training （VLP） model applied to the field of multimodal sentiment analysis. The visual encoder module was enhanced through self-supervised learning， and multi-layer cross attention was added to better model textual and visual features. Thus， the intra-modal information was made more abundant and complete， and the inter-modal information interaction was made more sufficient. Besides， the fast and memory-efficient exact attention with IO-awareness： FlashAttention was adopted in the proposed algorithm to address the high complexity of attention computation in Transformer. Experimental results show that compared with the current mainstream model Contrastive Language-Image Pre-training （CLIP）， MSSM improves the accuracy by 3.6 percentage points on the processed MVSA-S dataset and 2.2 percentage points on MVSA-M dataset， proving that the proposed network can effectively improve the integrity of multimodal information fusion while reducing computational cost.

Key words: multimodal, sentiment analysis, self-supervision, attention mechanism, Visual-and-Language Pre-training (VLP) model

摘要：

针对多模态情感分析任务中模态内信息不完整、模态间交互能力差和难以训练的问题，将视觉语言预训练（VLP）模型应用于多模态情感分析领域，提出一种融合自监督和多层交叉注意力的多模态情感分析网络（MSSM）。通过自监督学习强化视觉编码器模块，并加入多层交叉注意力以更好地建模文本和视觉特征，使模态内部信息更丰富完整，同时使模态间的信息交互更充分。此外，通过具有感知意识的快速、内存效率高的精确注意力FlashAttention解决Transformer中注意力计算高复杂度的问题。实验结果表明，与目前主流的基于对比文本-图像对的模型（CLIP）相比，MSSM在处理后的MVSA-S数据集上的准确率提高3.6个百分点，在MVSA-M数据集上的准确率提高2.2个百分点，验证所提网络能在降低运算成本的同时有效提高多模态信息融合的完整性。

关键词: 多模态, 情感分析, 自监督, 注意力机制, 视觉语言预训练模型

CLC Number:

TP391.41

Kaipeng XUE, Tao XU, Chunjie LIAO. Multimodal sentiment analysis network with self-supervision and multi-layer cross attention[J]. Journal of Computer Applications, 2024, 44(8): 2387-2392.

薛凯鹏, 徐涛, 廖春节. 融合自监督和多层交叉注意力的多模态情感分析网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2387-2392.

Figures/Tables 8

References 20

1	MORENCY L P， MIHALCEA R， DOSHI P. Towards multimodal sentiment analysis： harvesting opinions from the web［C］// Proceedings of the 13th International Conference on Multimodal Interfaces. New York： ACM， 2011： 169-176.
2	CHOY K L， FAN K K H， LO V. Development of an intelligent customer-supplier relationship management system： the application of case-based reasoning［J］. Industrial Management and Data Systems， 2003， 103（4）： 263-274.
3	MA L， LU Z， SHANG L， et al. Multimodal convolutional neural networks for matching image and sentence［C］// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2015： 2623-2631.
4	MAO J， XU W， YANG Y， et al. Explain images with multimodal recurrent neural networks［EB/OL］. （2014-10-04）［2023-03-12］..
5	LI G， DUAN N， FANG Y， et al. Unicoder-VL： a universal encoder for vision and language by cross-modal pre-training［C］// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2020： 11336-11344.
6	DOSOVITSKIY A， BEYER L， KOLESNIKOV A， et al. An image is worth 16×16 words： Transformers for image recognition at scale［EB/OL］. （2021-06-03）［2023-03-03］..
7	TOUVRON H， CORD M， DOUZE M， et al. Training data-efficient image transformers & distillation through attention［C］// Proceedings of the 38th International Conference on Machine Learning. New York： JMLR.org， 2021： 10347-10357.
8	YU J， JIANG J. Adapting BERT for target-oriented multimodal sentiment classification［C］// Proceedings of the 28th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2019： 5408-5414.
9	HOU R， CHANG H， MA B， et al. Cross attention network for few-shot classification［C］// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2019：4003-4014.
10	LU J， BATRA D， PARIKH D， et al. ViLBERT： pretraining task-agnostic visiolinguistic representations for vision-and-language tasks［C］// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2019， 32：13-23.
11	RADFORD A， KIM J W， HALLACY C， et al. Learning transferable visual models from natural language supervision［C］// Proceedings of the 38th International Conference on Machine Learning. New York： JMLR.org， 2021： 8748-8763.
12	KIM W， SON B， KIM I. ViLT： vision-and-language transformer without convolution or region supervision［C］// Proceedings of the 38th International Conference on Machine Learning. New York： JMLR.org， 2021： 5583-5594.
13	DAO T， FU D Y， ERMON S， et al. FlashAttention： fast and memory-efficient exact attention with io-awareness［C/OL］//Proceedings of the 33rd International Conference on Neural Information Processing Systems. （2022）［2023-11-12］..
14	HE P， LIU X， GAO J， et al. DeBERTa： decoding-enhanced BERT with disentangled attention［EB/OL］. （2021-10-06）［2023-11-12］..
15	OQUAB M， DARCET T， MOUTAKANNI T， et al. DINOv2： learning robust visual features without supervision［EB/OL］. （2023-04-14）［2023-11-12］..
16	NIU T， ZHU S， PANG L， et al. Sentiment analysis on multi-view social data［C］// Proceedings of the 2006 International Conference on MultiMedia Modeling， LNCS 9517. Cham： Springer， 2016： 15-27.
17	李文潇，梅红岩，李雨恬. 基于深度学习的多模态情感分析研究综述［J］. 辽宁工业大学学报（自然科学版）， 2022， 42（5）：293-298.
	LI W X， MEI H Y， LI Y T. Survey of multimodal sentiment analysis based on deep learning［J］. Journal of Liaoning Institute of Technology （Natural Science Edition）， 2022， 42（5）：293-298.
18	郭续，买日旦·吾守尔，古兰拜尔·吐尔洪. 基于多模态融合的情感分析算法研究综述［J］. 计算机工程与应用， 2024， 60（2）：1-18.
	GUO X， WUSHOUER M， TUERHONG G. Survey of sentiment analysis algorithms based on multimodal fusion［J］. Computer Engineering and Applications， 2024， 60（2）：1-18.
19	LI J， LI D， XIONG C， et al. BLIP： bootstrapping language-image pre-training for unified vision-language understanding and generation［C］// Proceedings of the 39th International Conference on Machine Learning. New York： JMLR.org， 2022： 12888-12900.
20	SINGH A， HU R， GOSWAMI V， et al. FLAVA： a foundational language and vision alignment model［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 15617-15629.

描述	原始数据集样本数	采样后数据集样本数
积极	11 653	6 429
中性	2 655	3 025
消极	3 569	4 875

描述	原始数据集样本数	采样后数据集样本数
积极	11 653	6 429
中性	2 655	3 025
消极	3 569	4 875

超参数	参数值	超参数	参数值
Batch_size	64	attention heads	16
Patch size	16	Dropout rate	0.2
Embedding dimension	512	Learning rate	10^-5
Number of layers	12	训练优化器	Adam

超参数	参数值	超参数	参数值
Batch_size	64	attention heads	16
Patch size	16	Dropout rate	0.2
Embedding dimension	512	Learning rate	10^-5
Number of layers	12	训练优化器	Adam

网络	MVSA-S		MVSA-M
网络	ACC	F1	ACC	F1
TomBERT	68.9	64.3	66.8	65.6
ViLBERT	67.6	67.1	69.7	67.1
CLIP	72.2	71.4	69.1	61.5
FLAVA	72.7	72.8	69.8	66.2
Blip	71.4	71.5	69.6	65.4
ViLT	68.7	69.8	68.7	63.4
MSSM	75.8	76.5	71.3	69.2

Multimodal sentiment analysis network with self-supervision and multi-layer cross attention

融合自监督和多层交叉注意力的多模态情感分析网络

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 8

References 20

Related Articles 15

Recommended Articles

Metrics

[1]	Jing QIN, Zhiguang QIN, Fali LI, Yueheng PENG. Diagnosis of major depressive disorder based on probabilistic sparse self-attention neural network [J]. Journal of Computer Applications, 2024, 44(9): 2970-2974.
[2]	Liting LI, Bei HUA, Ruozhou HE, Kuang XU. Multivariate time series prediction model based on decoupled attention mechanism [J]. Journal of Computer Applications, 2024, 44(9): 2732-2738.
[3]	Zhiqiang ZHAO, Peihong MA, Xinhong HEI. Crowd counting method based on dual attention mechanism [J]. Journal of Computer Applications, 2024, 44(9): 2886-2892.
[4]	Pengqi GAO, Heming HUANG, Yonghong FAN. Fusion of coordinate and multi-head attention mechanisms for interactive speech emotion recognition [J]. Journal of Computer Applications, 2024, 44(8): 2400-2406.
[5]	Zhonghua LI, Yunqi BAI, Xuejin WANG, Leilei HUANG, Chujun LIN, Shiyu LIAO. Low illumination face detection based on image enhancement [J]. Journal of Computer Applications, 2024, 44(8): 2588-2594.
[6]	Shangbin MO, Wenjun WANG, Ling DONG, Shengxiang GAO, Zhengtao YU. Single-channel speech enhancement based on multi-channel information aggregation and collaborative decoding [J]. Journal of Computer Applications, 2024, 44(8): 2611-2617.
[7]	Wu XIONG, Congjun CAO, Xuefang SONG, Yunlong SHAO, Xusheng WANG. Handwriting identification method based on multi-scale mixed domain attention mechanism [J]. Journal of Computer Applications, 2024, 44(7): 2225-2232.
[8]	Huanhuan LI, Tianqiang HUANG, Xuemei DING, Haifeng LUO, Liqing HUANG. Public traffic demand prediction based on multi-scale spatial-temporal graph convolutional network [J]. Journal of Computer Applications, 2024, 44(7): 2065-2072.
[9]	Dianhui MAO, Xuebo LI, Junling LIU, Denghui ZHANG, Wenjing YAN. Chinese entity and relation extraction model based on parallel heterogeneous graph and sequential attention mechanism [J]. Journal of Computer Applications, 2024, 44(7): 2018-2025.
[10]	Li LIU, Haijin HOU, Anhong WANG, Tao ZHANG. Generative data hiding algorithm based on multi-scale attention [J]. Journal of Computer Applications, 2024, 44(7): 2102-2109.
[11]	Song XU, Wenbo ZHANG, Yifan WANG. Lightweight video salient object detection network based on spatiotemporal information [J]. Journal of Computer Applications, 2024, 44(7): 2192-2199.
[12]	Dahai LI, Zhonghua WANG, Zhendong WANG. Dual-branch low-light image enhancement network combining spatial and frequency domain information [J]. Journal of Computer Applications, 2024, 44(7): 2175-2182.
[13]	Wenliang WEI, Yangping WANG, Biao YUE, Anzheng WANG, Zhe ZHANG. Deep learning model for infrared and visible image fusion based on illumination weight allocation and attention [J]. Journal of Computer Applications, 2024, 44(7): 2183-2191.
[14]	Yan ZHOU, Yang LI. Rectified cross pseudo supervision method with attention mechanism for stroke lesion segmentation [J]. Journal of Computer Applications, 2024, 44(6): 1942-1948.
[15]	Yue LIU, Fang LIU, Aoyun WU, Qiuyue CHAI, Tianxiao WANG. 3D object detection network based on self-attention mechanism and graph convolution [J]. Journal of Computer Applications, 2024, 44(6): 1972-1977.

多层交叉注意力	自监督视觉解码器	ACC
×	×	68.9
×	√	70.4
√	×	69.6
√	√	71.3

多层交叉注意力	自监督视觉解码器	ACC
×	×	68.9
×	√	70.4
√	×	69.6
√	√	71.3