基于语义融合和对比增强的多模态推荐方法

doi:10.11772/j.issn.1001-9081.2025050528

《计算机应用》唯一官方网站 ›› 2026, Vol. 46 ›› Issue (4): 1058-1068.DOI: 10.11772/j.issn.1001-9081.2025050528

基于语义融合和对比增强的多模态推荐方法

赵海华¹, 胡怡君¹, 唐瑞², 莫先¹()

^1.宁夏大学信息工程学院，银川 750021
^2.四川大学网络空间安全学院，成都 610065

收稿日期:2025-05-15 修回日期:2025-07-18 接受日期:2025-08-06 发布日期:2025-08-12 出版日期:2026-04-10
通讯作者: 莫先
作者简介:赵海华（1999—），男，宁夏中卫人，硕士研究生，CCF会员，主要研究方向：图学习、推荐系统
胡怡君（2000—），男，山东济宁人，硕士研究生，主要研究方向：图学习、推荐系统
唐瑞（1990—），男，四川泸州人，副研究员，博士，CCF会员，主要研究方向：社交网络分析、网络表示学习
基金资助:
国家自然科学基金资助项目(62306157);宁夏自然科学基金资助项目(2024AAC05011)

Multimodal recommendation method based on semantic fusion and contrast enhancement

Haihua ZHAO¹, Yijun HU¹, Rui TANG², Xian MO¹()

^1.School of Information Engineering，Ningxia University，Yinchuan Ningxia 750021，China
^2.School of Cyberspace Security，Sichuan University，Chengdu Sichuan 610065，China

Received:2025-05-15 Revised:2025-07-18 Accepted:2025-08-06 Online:2025-08-12 Published:2026-04-10
Contact: Xian MO
About author:ZHAO Haihua， born in 1999， M. S. candidate. His research interests include graph learning， recommender systems.
HU Yijun， born in 2000， M. S. candidate. His research interests include graph learning， recommender systems.
TANG Rui， born in 1990， Ph. D.， associate research fellow. His research interests include social network analysis， network representational learning.
Supported by:
National Natural Science Foundation of China(62306157);Natural Science Foundation of Ningxia(2024AAC05011)

摘要/Abstract

摘要：

多模态推荐旨在通过融合多模态信息增强用户和项目的特征表示，提升推荐性能。然而，现有方法存在跨模态语义信息融合不足、多模态特征冗余及噪声干扰问题。针对这些问题，提出一种基于语义融合和对比增强的多模态推荐方法（SFCERec）。首先，设计跨模态语义一致性增强框架，通过多模态语义特征筛选机制构建全局关联图，动态聚合多模态共性特征并抑制噪声传播；同时，提出多粒度属性解耦模块，从模态特征中分离粗粒度共性特征与用户行为驱动的细粒度特征，缓解特征冗余。其次，提出多层次对比学习范式，联合跨模态一致性对齐、用户行为相似性建模、项目语义关联性约束及显式?潜在特征互信息最大化这4类任务，通过对比学习强化表征的判别性。最后，进一步结合图扰动增强策略，以通过添加噪声与双重对比正则化，提升模型对稀疏数据与噪声干扰的鲁棒性。在Amazon-Baby、Amazon-Sports和Amazon-Clothing数据集上的实验结果表明，该方法在Recall@20和NDCG@20指标上均优于所有基线模型，尤其在稀疏场景下。消融实验结果也验证了该方法的有效性。

关键词: 推荐系统, 多模态, 对比学习, 语义融合, 特征解耦

Abstract:

Multimodal recommendation aims to enhance user and item feature representations by integrating multimodal information， so as to improve recommendation performance. However， the existing methods still face challenges including insufficient cross-modal semantic information fusion， redundant multimodal features， and noise interference. To address these issues， a multimodal Recommendation method based on Semantic Fusion and Contrast Enhancement （SFCERec） was proposed. Firstly， a cross-modal semantic consistency enhancement framework was designed by which a global correlation graph was constructed through a multimodal semantic feature filtering mechanism， so as to aggregate common multimodal features dynamically while suppressing noise propagation. Concurrently， a multi-granularity attribute disentanglement module was introduced to separate coarse-grained common features from user behavior-driven fine-grained features from modal features， so as to mitigate feature redundancy. Secondly， a multi-level contrastive learning paradigm was proposed， so as to joint four tasks： cross-modal consistency alignment， user behavior similarity modeling， item semantic relevance constraint， and explicit-latent feature mutual information maximization， thereby enhancing representation discriminability through contrastive learning. Finally， a graph perturbation enhancement strategy was further incorporated， thereby employing noise injection and dual contrastive regularization to improve model robustness against sparse data and noise interference. Experimental results on Amazon-Baby， Amazon-Sports， and Amazon-Clothing datasets demonstrate that this method outperforms all baseline models in both Recall@20 and NDCG@20 metrics， particularly in sparse scenarios. Ablation studies further validate the effectiveness of the proposed method.

Key words: Recommender System (RS), multimodal, contrastive learning, semantic fusion, feature disentanglement

中图分类号:

TP183

赵海华, 胡怡君, 唐瑞, 莫先. 基于语义融合和对比增强的多模态推荐方法[J]. 计算机应用, 2026, 46(4): 1058-1068.

Haihua ZHAO, Yijun HU, Rui TANG, Xian MO. Multimodal recommendation method based on semantic fusion and contrast enhancement[J]. Journal of Computer Applications, 2026, 46(4): 1058-1068.

图/表 11

参考文献 34

[1]	RENDLE S， FREUDENTHALER C， GANTNER Z， et al. BPR： Bayesian personalized ranking from implicit feedback［C］// Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence. Arlington， VA： AUAI Press， 2009： 452-461.
[2]	KIPF T N， WELLING M. Semi-supervised classification with graph convolutional networks［EB/OL］. ［2025-04-20］..
[3]	WANG X， HE X， WANG M， et al. Neural graph collaborative filtering［C］// Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York： ACM， 2019： 165-174.
[4]	LI S. Harnessing multimodal data and Mult-Recall strategies for enhanced product recommendation in e-commerce［C］// Proceedings of the 4th International Conference on Computer Systems. Piscataway： IEEE， 2024： 181-185.
[5]	ZHANG J， ZHU Y， LIU Q， et al. Latent structure mining with contrastive modality fusion for multimedia recommendation［J］. IEEE Transactions on Knowledge and Data Engineering， 2023， 35（9）： 9154-9167.
[6]	MALITESTA D， CORNACCHIA G， POMO C， et al. Formalizing multimedia recommendation through multimodal deep learning［J］. ACM Transactions on Recommender Systems， 2025， 3（3）： No.37.
[7]	HE R， McAULEY J. VBPR： visual Bayesian personalized ranking from implicit feedback［C］// Proceedings of the 30th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2016： 144-150.
[8]	WEI Y， WANG X， NIE L， et al. MMGCN： multi-modal graph convolution network for personalized recommendation of micro-video［C］// Proceedings of the 27th ACM International Conference on Multimedia. New York： ACM， 2019： 1437-1445.
[9]	WANG Q， WEI Y， YIN J， et al. DualGNN： dual graph neural network for multimedia recommendation［J］. IEEE Transactions on Multimedia， 2023， 25： 1074-1084.
[10]	ZHANG J， ZHU Y， LIU Q， et al. Mining latent structures for multimedia recommendation［C］// Proceedings of the 29th ACM International Conference on Multimedia. New York： ACM， 2021： 3872-3880.
[11]	TAO Z， LIU X， XIA Y， et al. Self-supervised learning for multimedia recommendation［J］. IEEE Transactions on Multimedia， 2023， 25： 5107-5116.
[12]	YI Z， WANG X， OUNIS I， et al. Multi-modal graph contrastive learning for micro-video recommendation［C］// Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York： ACM， 2022： 1807-1811.
[13]	WEI W， HUANG C， XIA L， et al. Multi-modal self-supervised learning for recommendation［C］// Proceedings of the ACM Web Conference 2023. New York： ACM， 2023： 790-800.
[14]	CHEN J， ZHANG H， HE X， et al. Attentive collaborative filtering： multimedia recommendation with item- and component-level attention［C］// Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York： ACM， 2017： 335-344.
[15]	KANG W C， FANG C， WANG Z， et al. Visually-aware fashion recommendation and design with generative image models［C］// Proceedings of the 2017 IEEE International Conference on Data Mining. Piscataway： IEEE， 2017： 207-216.
[16]	ZHOU X， SHEN Z. A tale of two graphs： freezing and denoising graph structures for multimodal recommendation［C］// Proceedings of the 31st ACM International Conference on Multimedia. New York： ACM， 2023： 935-943.
[17]	WU J， WANG X， FENG F， et al. Self-supervised graph learning for recommendation［C］// Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York： ACM， 2021： 726-735.
[18]	LIN Z， TIAN C， HOU Y， et al. Improving graph collaborative filtering with neighborhood-enriched contrastive learning［C］// Proceedings of the ACM Web Conference 2022. New York： ACM， 2022： 2320-2329.
[19]	ZHOU X， ZHOU H， LIU Y， et al. Bootstrap latent representations for multi-modal recommendation［C］// Proceedings of the ACM Web Conference 2023. New York： ACM， 2023： 845-854.
[20]	LIU F， CHENG Z， ZHU L， et al. Interest-aware message-passing GCN for recommendation［C］// Proceedings of the Web Conference 2021. New York： ACM， 2021： 1296-1305.
[21]	YU P， TAN Z， LU G， et al. Multi-view graph convolutional network for multimedia recommendation［C］// Proceedings of the 31st ACM International Conference on Multimedia. New York： ACM， 2023： 6576-6585.
[22]	XIAO M， QIAO Z， FU Y， et al. Hierarchical interdisciplinary topic detection model for research proposal classification［J］. IEEE Transactions on Knowledge and Data Engineering， 2023， 35（9）： 9685-9699.
[23]	YE X， XIAO M， NING Z， et al. NEEDED： introducing hierarchical transformer to eye diseases diagnosis［C］// Proceedings of the 2023 SIAM International Conference on Data Mining. Philadelphia， PA： SIAM， 2023： 667-675.
[24]	YU J， YIN H， XIA X， et al. Are graph augmentations necessary？ simple graph contrastive learning for recommendation［C］// Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York： ACM， 2022： 1294-1303.
[25]	McAULEY J， TARGETT C， SHI Q， et al. Image-based recommendations on styles and substitutes［C］// Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York： ACM， 2015： 43-52.
[26]	SIMONYAN K， ZISSERMAN A. Very deep convolutional networks for large-scale image recognition［EB/OL］. ［2025-03-15］..
[27]	REIMERS N， GUREVYCH I. Sentence-BERT： sentence embeddings using Siamese BERT-networks［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg： ACL， 2019： 3982-3992.
[28]	GLOROT X， BENGIO Y. Understanding the difficulty of training deep feedforward neural networks［C］// Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. New York： JMLR.org， 2010： 249-256.
[29]	KINGMA D P， BA J L. Adam： a method for stochastic optimization［EB/OL］. ［2025-04-10］..
[30]	HE X， DENG K， WANG X， et al. LightGCN： simplifying and powering graph convolution network for recommendation［C］// Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York： ACM， 2020： 639-648.
[31]	GUO Z， LI J， LI G， et al. LGMRec： local and global graph learning for multimodal recommendation［C］// Proceedings of the 38th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2024： 8454-8462.
[32]	ZHAO X， MIAO C. Disentangled graph variational auto-encoder for multimodal recommendation with interpretability［J］. IEEE Transactions on Multimedia， 2024， 26： 7543-7554.
[33]	XU J， CHEN Z， YANG S， et al. MENTOR： multi-level self-supervised learning for multimodal recommendation［C］// Proceedings of the 39th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2025： 12908-12917.
[34]	ONG R K， KHONG A W H. Spectrum-based modality representation fusion graph convolutional network for multimodal recommendation［C］// Proceedings of the 18th ACM International Conference on Web Search and Data Mining. New York： ACM， 2025： 773-781.

数据集	用户数	项目数	交互数	稀疏度/%	模态
Baby	19 445	7 050	160 792	99.88	v、t
Sports	35 598	18 357	296 337	99.95	v、t
Clothing	39 387	23 033	278 677	99.97	v、t

数据集	用户数	项目数	交互数	稀疏度/%	模态
Baby	19 445	7 050	160 792	99.88	v、t
Sports	35 598	18 357	296 337	99.95	v、t
Clothing	39 387	23 033	278 677	99.97	v、t

名称	参数
处理器	Intel_Core i9-13900HX
操作系统	Windows11 64位
内存	16 GB
显卡	5张NVIDIA GeForce RTX 4090显卡
深度学习框架	PyTorch 1.12.0

名称	参数
处理器	Intel_Core i9-13900HX
操作系统	Windows11 64位
内存	16 GB
显卡	5张NVIDIA GeForce RTX 4090显卡
深度学习框架	PyTorch 1.12.0

模型	Baby				Sports				Clothing
模型	R@10	R@20	N@10	N@20	R@10	R@20	N@10	N@20	R@10	R@20	N@10	N@20
MF-BPR	0.035 7	0.057 5	0.019 2	0.024 9	0.043 2	0.065 3	0.024 1	0.029 8	0.018 7	0.027 9	0.010 3	0.012 6
LightGCN	0.047 9	0.075 4	0.025 7	0.032 8	0.056 9	0.086 4	0.031 3	0.038 7	0.034 0	0.052 6	0.018 8	0.023 6
SGL	0.053 2	0.082 0	0.028 9	0.036 3	0.062 0	0.094 5	0.033 9	0.042 3	0.039 2	0.058 6	0.021 6	0.026 6
NCL	0.053 8	0.083 6	0.029 2	0.036 9	0.061 6	0.094 0	0.033 9	0.042 1	0.041 0	0.060 7	0.022 8	0.027 5
VBPR	0.042 3	0.066 3	0.022 3	0.028 4	0.055 8	0.085 6	0.030 7	0.038 4	0.042 3	0.066 3	0.022 3	0.028 4
SLMRec	0.052 9	0.077 5	0.029 0	0.035 3	0.066 3	0.099 0	0.036 5	0.045 0	0.045 2	0.067 5	0.024 7	0.030 3
BM3	0.056 4	0.088 3	0.030 1	0.038 3	0.065 6	0.098 0	0.035 5	0.043 8	0.042 2	0.062 1	0.023 1	0.028 1
FREEDOM	0.062 7	0.099 2	0.033 0	0.042 4	0.071 7	0.108 9	0.038 5	0.048 1	0.062 9	0.094 1	0.034 1	0.042 0
MGCN	0.062 0	0.096 4	0.033 9	0.042 7	0.072 9	0.110 6	0.039 7	0.049 6	0.064 1	0.094 5	0.034 7	0.042 8
LGMRec	0.064 9	0.097 9	0.035 1	0.043 6	0.071 9	0.108 5	0.039 5	0.049 0	0.055 3	0.082 3	0.030 1	0.037 1
DGVAE	0.063 6	0.100 9	0.034 0	0.043 6	0.075 3	0.112 7	0.041 0	0.050 6	0.061 9	0.091 7	0.033 6	0.041 2
MENTOR	0.064 7	0.102 7	0.034 9	0.044 7	0.075 6	0.112 9	0.040 6	0.050 3	0.065 6	0.096 9	0.036 0	0.043 9
SMORE	0.067 0	0.102 1	0.036 2	0.045 7	0.076 0	0.114 1	0.040 7	0.050 6	0.066 5	0.098 3	0.036 3	0.044 3
SFCERec	0.067 5	0.105 2	0.036 5	0.046 1	0.079 3	0.118 1	0.042 9	0.052 9	0.069 1	0.102 1	0.037 8	0.046 2

基于语义融合和对比增强的多模态推荐方法

Multimodal recommendation method based on semantic fusion and contrast enhancement

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 34

相关文章 15

编辑推荐

Metrics

算法	Baby		Sports		Clothing
算法	R@20	N@20	R@20	N@20	R@20	N@20
w/o GE	0.104 4	0.045 5	0.116 7	0.052 6	0.099 8	0.045 5
w/o PD	0.094 3	0.040 9	0.105 8	0.046 5	0.082 3	0.036 9
w/o CL	0.095 1	0.040 7	0.108 0	0.047 7	0.093 1	0.041 6
w/o PE	0.104 5	0.045 2	0.114 8	0.052 3	0.098 4	0.044 4
SFCERec	0.105 2	0.046 1	0.118 1	0.052 9	0.102 1	0.046 2

[1]	刘欢娴, 王洪涛, 王宪奥, 王洪梅, 徐伟峰. 跨模态语义关联的多模态事实验证[J]. 《计算机应用》唯一官方网站, 2026, 46(4): 1069-1076.
[2]	肖毓航, 李贯峰, 陈昱胤, 秦晶. 基于图的多视角对比学习小样本关系抽取模型[J]. 《计算机应用》唯一官方网站, 2026, 46(3): 732-740.
[3]	刘汉卿, 桑国明, 张益嘉. 结合密集多尺度特征融合和特征知识增强Transformer的遥感图像描述模型[J]. 《计算机应用》唯一官方网站, 2026, 46(3): 741-749.
[4]	刘晓霞, 况立群, 王松, 焦世超, 韩慧妍, 熊风光. 多尺度时空解耦的骨架行为识别对比学习[J]. 《计算机应用》唯一官方网站, 2026, 46(3): 767-774.
[5]	董莉梅, 李雁姿, 李家印, 许力. 基于邻域增强的无监督图异常检测[J]. 《计算机应用》唯一官方网站, 2026, 46(2): 458-466.
[6]	姜皓骞, 张东, 李冠宇, 陈恒. 基于结构增强的层次化任务导向提示策略的对话推荐系统SetaCRS[J]. 《计算机应用》唯一官方网站, 2026, 46(2): 368-377.
[7]	王雪, 张丽萍, 闫盛, 李娜, 张学飞. 多模态知识图谱补全方法综述[J]. 《计算机应用》唯一官方网站, 2026, 46(2): 341-353.
[8]	魏涵玥, 郭晨娟, 梅杰源, 田锦东, 陈鹏, 徐榕荟, 杨彬. 融合时频特征与混合文本的多模态股票预测框架MATCH[J]. 《计算机应用》唯一官方网站, 2026, 46(2): 427-436.
[9]	罗虎, 张明书. 基于跨模态注意力机制与对比学习的谣言检测方法[J]. 《计算机应用》唯一官方网站, 2026, 46(2): 361-367.
[10]	史艳翠, 秦浩哲. 融合用户行为和改进长尾算法的推荐方法[J]. 《计算机应用》唯一官方网站, 2026, 46(1): 95-103.
[11]	程梓洋, 黄瑞章, 薛菁菁. 深度演化主题聚类模型[J]. 《计算机应用》唯一官方网站, 2026, 46(1): 85-94.
[12]	王菲, 陶冶, 刘家旺, 李伟, 秦修功, 张宁. 面向智慧家庭空间的时空知识图谱的双模态融合构建方法[J]. 《计算机应用》唯一官方网站, 2026, 46(1): 52-59.
[13]	李玟, 李开荣, 杨凯. 基于数据增强的子图感知对比学习[J]. 《计算机应用》唯一官方网站, 2026, 46(1): 1-9.
[14]	杨兴耀, 齐正, 于炯, 张祖莲, 马帅, 沈洪涛. 时间感知和空间增强的双通道图神经网络会话推荐模型[J]. 《计算机应用》唯一官方网站, 2026, 46(1): 104-112.
[15]	黄舒雯, 郭柯宇, 宋翔宇, 韩锋, 孙士杰, 宋焕生. 基于单目图像的多目标三维视觉定位方法[J]. 《计算机应用》唯一官方网站, 2026, 46(1): 207-215.

模型	平均epoch时间
模型	Baby	Sports
MENTOR	8.95	24.30
SMORE	6.51	15.42
SFCERec	6.42	15.23

模型	平均epoch时间
模型	Baby	Sports
MENTOR	8.95	24.30
SMORE	6.51	15.42
SFCERec	6.42	15.23