| 1 | 
																						 
											MORENCY L P, MIHALCEA R, DOSHI P. Towards multimodal sentiment analysis: harvesting opinions from the web[C]// Proceedings of the 13th International Conference on Multimodal Interfaces. New York: ACM, 2011: 169-176.
																						 | 
										
																													
																							| 2 | 
																						 
											CHOY K L, FAN K K H, LO V. Development of an intelligent customer-supplier relationship management system: the application of case-based reasoning[J]. Industrial Management and Data Systems, 2003, 103(4): 263-274.
																						 | 
										
																													
																							| 3 | 
																						 
											MA L, LU Z, SHANG L, et al. Multimodal convolutional neural networks for matching image and sentence[C]// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015: 2623-2631.
																						 | 
										
																													
																							| 4 | 
																						 
											MAO J, XU W, YANG Y, et al. Explain images with multimodal recurrent neural networks[EB/OL]. (2014-10-04) [2023-03-12]..
																						 | 
										
																													
																							| 5 | 
																						 
											LI G, DUAN N, FANG Y, et al. Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training[C]// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press, 2020: 11336-11344.
																						 | 
										
																													
																							| 6 | 
																						 
											DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[EB/OL]. (2021-06-03) [2023-03-03]..
																						 | 
										
																													
																							| 7 | 
																						 
											TOUVRON H, CORD M, DOUZE M, et al. Training data-efficient image transformers & distillation through attention[C]// Proceedings of the 38th International Conference on Machine Learning. New York: JMLR.org, 2021: 10347-10357.
																						 | 
										
																													
																							| 8 | 
																						 
											YU J, JIANG J. Adapting BERT for target-oriented multimodal sentiment classification[C]// Proceedings of the 28th International Joint Conference on Artificial Intelligence. California: ijcai.org, 2019: 5408-5414.
																						 | 
										
																													
																							| 9 | 
																						 
											HOU R, CHANG H, MA B, et al. Cross attention network for few-shot classification[C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2019:4003-4014.
																						 | 
										
																													
																							| 10 | 
																						 
											LU J, BATRA D, PARIKH D, et al. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2019, 32:13-23.
																						 | 
										
																													
																							| 11 | 
																						 
											RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]// Proceedings of the 38th International Conference on Machine Learning. New York: JMLR.org, 2021: 8748-8763.
																						 | 
										
																													
																							| 12 | 
																						 
											KIM W, SON B, KIM I. ViLT: vision-and-language transformer without convolution or region supervision[C]// Proceedings of the 38th International Conference on Machine Learning. New York: JMLR.org, 2021: 5583-5594.
																						 | 
										
																													
																							| 13 | 
																						 
											DAO T, FU D Y, ERMON S, et al. FlashAttention: fast and memory-efficient exact attention with io-awareness[C/OL]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. (2022) [2023-11-12]..
																						 | 
										
																													
																							| 14 | 
																						 
											HE P, LIU X, GAO J, et al. DeBERTa: decoding-enhanced BERT with disentangled attention[EB/OL]. (2021-10-06) [2023-11-12]..
																						 | 
										
																													
																							| 15 | 
																						 
											OQUAB M, DARCET T, MOUTAKANNI T, et al. DINOv2: learning robust visual features without supervision[EB/OL]. (2023-04-14)[2023-11-12]..
																						 | 
										
																													
																							| 16 | 
																						 
											NIU T, ZHU S, PANG L, et al. Sentiment analysis on multi-view social data[C]// Proceedings of the 2006 International Conference on MultiMedia Modeling, LNCS 9517. Cham: Springer, 2016: 15-27.
																						 | 
										
																													
																							| 17 | 
																						 
											李文潇,梅红岩,李雨恬. 基于深度学习的多模态情感分析研究综述[J]. 辽宁工业大学学报(自然科学版), 2022, 42(5):293-298.
																						 | 
										
																													
																							 | 
																						 
											LI W X, MEI H Y, LI Y T. Survey of multimodal sentiment analysis based on deep learning[J]. Journal of Liaoning Institute of Technology (Natural Science Edition), 2022, 42(5):293-298.
																						 | 
										
																													
																							| 18 | 
																						 
											郭续,买日旦·吾守尔,古兰拜尔·吐尔洪. 基于多模态融合的情感分析算法研究综述[J]. 计算机工程与应用, 2024, 60(2):1-18.
																						 | 
										
																													
																							 | 
																						 
											GUO X, WUSHOUER M, TUERHONG G. Survey of sentiment analysis algorithms based on multimodal fusion[J]. Computer Engineering and Applications, 2024, 60(2):1-18.
																						 | 
										
																													
																							| 19 | 
																						 
											LI J, LI D, XIONG C, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation[C]// Proceedings of the 39th International Conference on Machine Learning. New York: JMLR.org, 2022: 12888-12900.
																						 | 
										
																													
																							| 20 | 
																						 
											SINGH A, HU R, GOSWAMI V, et al. FLAVA: a foundational language and vision alignment model[C]// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 15617-15629.
																						 |