跨模态语义关联的多模态事实验证

doi:10.11772/j.issn.1001-9081.2025050526

《计算机应用》唯一官方网站 ›› 2026, Vol. 46 ›› Issue (4): 1069-1076.DOI: 10.11772/j.issn.1001-9081.2025050526

跨模态语义关联的多模态事实验证

刘欢娴¹, 王洪涛¹^,², 王宪奥¹, 王洪梅¹, 徐伟峰¹^,³()

^1.华北电力大学计算机系，河北保定 071003
^2.复杂能源系统智能计算教育部工程研究中心（华北电力大学），河北保定 071003
^3.河北省能源电力知识计算重点实验室（华北电力大学），河北保定 071003

收稿日期:2025-05-14 修回日期:2025-07-14 接受日期:2025-08-08 发布日期:2025-08-22 出版日期:2026-04-10
通讯作者: 徐伟峰
作者简介:刘欢娴（2001—），女，吉林白山人，硕士研究生，主要研究方向：自然语言处理
王洪涛（1983—），男，山东济宁人，副教授，博士，CCF会员，主要研究方向：自然语言处理、人工智能安全、隐私计算、知识计算
王宪奥（2001—），男，山东济宁人，硕士研究生，主要研究方向：人工智能安全
王洪梅（1981—），女，黑龙江哈尔滨人，讲师，硕士，主要研究方向：统计机器学习、概率图理论、数值计算
基金资助:
中央高校基本科研业务费专项资金资助项目(2023JC006)

Multimodal fact verification with cross-modal semantic association

Huanxian LIU¹, Hongtao WANG¹^,², Xian’ao WANG¹, Hongmei WANG¹, Weifeng XU¹^,³()

^1.Department of Computer，North China Electric Power University，Baoding Hebei 071003，China
^2.Engineering Research Center of Intelligent Computing for Complex Energy Systems，Ministry of Education （North China Electric Power University），Baoding Hebei 071003，China
^3.Hebei Key Laboratory of Knowledge Computing for Energy and Power （North China Electric Power University），Baoding Hebei 071003，China

Received:2025-05-14 Revised:2025-07-14 Accepted:2025-08-08 Online:2025-08-22 Published:2026-04-10
Contact: Weifeng XU
About author:LIU Huanxian， born in 2001， M. S. candidate. Her research interests include natural language processing.
WANG Hongtao， born in 1983， Ph. D.， associate professor. His research interests include natural language processing， AI security， privacy computing， knowledge computing.
WANG Xian’ao， born in 2001， M. S. candidate. His research interests include AI security.
WANG Hongmei， born in 1981， M. S.， lecturer. Her research interests include statistical machine learning， probabilistic graph theory， numerical computation.
Supported by:
Fundamental Research Funds for the Central Universities(2023JC006)

摘要/Abstract

摘要：

针对在多模态特征融合中面临多模态证据间以及声明与证据间的语义差异问题，提出一种跨模态语义关联（CMSA）的多模态事实验证（MFV）方法，以实现跨层次语义对齐与自适应特征交互，有效弥补多源信息间的语义鸿沟，提升复杂声明验证的分类性能。在证据检索阶段，通过文本声明检索相关的文本证据，并进一步利用文本证据筛选语义相关的图像证据，以确保多模态证据的高相关性；在声明验证阶段，利用CLIP（Contrastive Language-Image Pretraining）模型实现文本与多模态证据的语义对齐，并设计声明?证据联合注意力（LCEA）模块，进一步强化文本声明、文本证据和图像证据三者之间的语义关联。实验结果表明，CMSA在公开数据集及自建数据集CEAD（Cross-modal Evidence Augmented Dataset）上的F1分数分别比MOCHEG模型分别至少提升了7.27%和6.65%，验证了它在MFV任务中的有效性。

关键词: 多模态事实验证, 语义关联, 注意力机制, CLIP模型, 特征融合

Abstract:

Semantic differences between multimodal evidences and among claims and evidences during feature fusion were addressed through a proposed Cross-Modal Semantic Association （CMSA）-based Multimodal Fact Verification （MFV） method， so as to realize cross-level semantic alignment and adaptive feature interaction， thereby eliminating semantic gaps across multi-source information， and enhancing classification performance of complex claim verification. During evidence retrieval， relevant textual evidence was retrieved from claim text， and semantically related image evidence was further filtered using the textual evidence， so as to ensure high cross-modal relevance. During claim verification， semantic alignment between text and multimodal evidences was achieved using the CLIP （Contrastive Language-Image Pretraining） model， and a Linked Claim and Evidence Attention （LCEA） module was designed， so as to reinforce semantic associations among claim text， textual evidence， and image evidence. Experimental results show that CMSA improves the F1 score on the public and self-constructed datasets CEAD （Cross-modal Evidence Augmented Dataset） by 7.27% and 6.65% at least， respectively， compared to the MOCHEG model， demonstrating its effectiveness in MFV tasks.

Key words: Multimodal Fact Verification (MFV), semantic association, attention mechanism, CLIP (Contrastive Language-Image Pretraining) model, feature fusion

中图分类号:

TP391.1

刘欢娴, 王洪涛, 王宪奥, 王洪梅, 徐伟峰. 跨模态语义关联的多模态事实验证[J]. 计算机应用, 2026, 46(4): 1069-1076.

Huanxian LIU, Hongtao WANG, Xian’ao WANG, Hongmei WANG, Weifeng XU. Multimodal fact verification with cross-modal semantic association[J]. Journal of Computer Applications, 2026, 46(4): 1069-1076.

图/表 8

参考文献 27

[1]	陈建贵，张儒清，郭嘉丰，等. 基于反事实推理的事实验证去偏方法［J］. 中文信息学报， 2023， 37（10）： 97-105.
	CHEN J G， ZHANG R Q， GUO J F， et al. Counterfactual inference for fact verification debiasing［J］. Journal of Chinese Information Processing， 2023， 37（10）： 97-105.
[2]	YAO B M， SHAH A， SUN L， et al. End-to-end multimodal fact-checking and explanation generation： a challenging dataset and models［C］// Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York： ACM， 2023： 2733-2743.
[3]	MISHRA S， SURYAVARDAN S， BHASKAR A， et al. FACTIFY： a multi-modal fact verification dataset［C］// Proceedings of the 2022 Workshop on Multi-Modal Fact Checking and Hate-Speech Detection co-located with 36th AAAI Conference on Artificial Intelligence. Aachen： CEUR-WS.org， 2022： No.18.
[4]	WU Y， ZHAN P， ZHANG Y， et al. Multimodal fusion with co-attention networks for fake news detection［C］// Findings of the Association for Computational Linguistics： ACL-IJCNLP 2021. Stroudsburg： ACL， 2021： 2560-2569.
[5]	QIAN S， WANG J， HU J， et al. Hierarchical multi-modal contextual attention network for fake news detection［C］// Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York： ACM， 2021： 153-162.
[6]	RADFORD A， KIM J W， HALLACY C， et al. Learning transferable visual models from natural language supervision［C］// Proceedings of the 38th International Conference on Machine Learning. New York： JMLR.org， 2021： 8748-8763.
[7]	THORNE J， VLACHOS A， CHRISTODOULOPOULOS C， et al. FEVER： a large-scale dataset for fact extraction and verification［C］// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long Papers）. Stroudsburg： ACL， 2018： 809-819.
[8]	NIE Y， CHEN H， BANSAL M. Combining fact extraction and verification with neural semantic matching networks［C］// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2019： 6859-6866.
[9]	NIE Y， WANG S， BANSAL M. Revealing the importance of semantic retrieval for machine reading at scale［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg： ACL， 2019： 2553-2566.
[10]	CHEN J， ZHANG R， GUO J， et al. GERE： generative evidence retrieval for fact verification［C］// Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York： ACM， 2022： 2184-2189.
[11]	LIU Z， XIONG C， SUN M， et al. Fine-grained fact verification with kernel graph attention network［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg： ACL， 2020： 7342-7351.
[12]	HANSELOWSKI A， ZHANG H， LI Z， et al. UKP-Athene： multi-sentence textual entailment for claim verification［C］// Proceedings of the 1st Workshop on Fact Extraction and VERification. Stroudsburg： ACL， 2018： 103-108.
[13]	FAJCIK M， MOTLICEK P， SMRZ P. Claim-Dissector： an interpretable fact-checking system with joint re-ranking and veracity prediction［C］// Findings of the Association for Computational Linguistics： ACL 2023. Stroudsburg： ACL， 2023： 10184-10205.
[14]	SANTOSH T Y S S， VISHAL G， SAHA A， et al. AttentiveChecker： a bi-directional attention flow mechanism for fact verification［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsburg： ACL， 2019： 2218-2222.
[15]	ZHOU J， HAN X， YANG C， et al. GEAR： graph-based evidence aggregating and reasoning for fact verification［C］// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg： ACL， 2019： 892-901.
[16]	LEE N， LI B Z， WANG S， et al. Language models as fact checkers？［C］// Proceedings of the 3rd Workshop on Fact Extraction and VERification. Stroudsburg： ACL， 2020： 36-41.
[17]	KOTONYA N， TONI F. Explainable automated fact-checking： a survey［C］// Proceedings of the 28th International Conference on Computational Linguistics. ［S.l.］： International Committee on Computational Linguistics， 2020： 5430-5443.
[18]	ROY A， EKBAL A. MulCoB-MulFav： multimodal content based multilingual fact verification［C］// Proceedings of the 2021 International Joint Conference on Neural Networks. Piscataway： IEEE， 2021： 1-8.
[19]	DHANKAR A， ZAÏANE O， BOLDUC F. UofA-Truth at Factify 2022： a simple approach to multi-modal fact-checking［C］// Proceedings of the 2022 Workshop on Multi-Modal Fake News and Hate-Speech Detection co-located with the 36th AAAI Conference on Artificial Intelligence. Aachen： CEUR-WS.org， 2022： No.10.
[20]	DEVLIN J， CHANG M W， LEE K， et al. BERT： pre-training of deep bidirectional Transformers for language understanding［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsburg： ACL， 2019： 4171-4186.
[21]	HAN K， XIAO A， WU E， et al. Transformer in Transformer［C］// Proceedings of the 35th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2021： 15908-15919.
[22]	LIU Y， OTT M， GOYAL N， et al. RoBERTa： a robustly optimized BERT pretraining approach［EB/OL］. ［2025-07-14］..
[23]	ARNAB A， DEHGHANI M， HEIGOLD G， et al. ViViT： a video vision Transformer［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 6816-6826.
[24]	THECKEDATH D， SEDAMKAR R. Detecting affect states using VGG16， ResNet50 and SE-ResNet50 networks［J］. SN Computer Science， 2020， 1（2）： No.79.
[25]	ZHANG F， LIU J， XIE J， et al. ESCNet： entity-enhanced and stance checking network for multi-modal fact-checking［C］// Proceedings of the ACM Web Conference 2024. New York： ACM， 2024： 2429-2440.
[26]	CHEN T C， TANG C W， THOMAS C. MetaSumPerceiver： multimodal multi-document evidence summarization for fact-checking［C］// Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2024： 8742-8757.
[27]	CEKINEL R F， KARAGOZ P， ÇÖLTEKIN Ç. Multimodal fact-checking with vision language models： a probing classifier based solution with embedding strategies［C］// Proceedings of the 31st International Conference on Computational Linguistics. Stroudsburg： ACL， 2025： 4622-4633.

数据集	数据划分	Supported	Refuted	NEI
MOCHEG	Train	3 826	4 542	3 301
	Val	501	488	501
	Test	817	825	800
CEAD	Train	1 860	2 370	790
	Val	610	900	340
	Test	1 034	1 592	504

数据集	数据划分	Supported	Refuted	NEI
MOCHEG	Train	3 826	4 542	3 301
	Val	501	488	501
	Test	817	825	800
CEAD	Train	1 860	2 370	790
	Val	610	900	340
	Test	1 034	1 592	504

证据类型	模型	Top-5			Top-10			Top-15
证据类型	模型	Pre	Rec	F₁	Pre	Rec	F₁	Pre	Rec	F₁
文本证据	RoBERTa	32.05	31.01	31.52	29.77	27.65	28.67	28.54	27.92	28.23
	BERT	35.16	33.66	34.39	31.79	33.50	32.62	29.91	31.98	30.91
	MOCHEG-Text	42.59	42.96	40.37	41.74	41.48	41.61	39.22	38.53	38.88
	ESCNet-Text	42.23	41.72	41.97	40.42	40.18	40.80	38.31	38.78	38.54
	MSP-Text	43.92	45.14	44.52	42.71	44.27	43.48	41.15	42.66	41.89
	Qwen-7B-Text	43.48	44.35	43.91	42.38	43.63	42.99	41.02	42.21	41.61
多模态证据	RoBERTa+ViT	33.93	33.91	33.92	31.48	34.70	33.01	37.12	35.17	36.12
	BERT+ViT	35.56	35.95	35.75	34.34	33.74	34.04	32.71	32.39	32.55
	RoBERTa+ResNet50	36.00	36.53	36.26	35.74	35.91	35.82	37.42	36.73	37.07
	BERT+ResNet50	39.93	40.75	40.33	39.44	37.10	38.24	37.61	36.04	36.81
	MOCHEG	43.12	42.55	42.83	44.00	44.35	44.17	43.75	44.43	43.38
	ESCNet	46.82	47.02	46.92	45.55	46.01	45.78	44.81	45.42	45.11
	MSP	48.03	48.10	48.06	47.01	47.42	47.21	46.42	47.01	46.71
	Qwen-7B	47.33	47.60	47.46	46.41	46.95	46.67	45.85	46.64	46.24
	CMSA	46.56	47.62	47.09	47.01	47.75	47.38	48.61	47.95	48.28

证据类型	模型	Top-5			Top-10			Top-15
证据类型	模型	Pre	Rec	F₁	Pre	Rec	F₁	Pre	Rec	F₁
文本证据	RoBERTa	32.05	31.01	31.52	29.77	27.65	28.67	28.54	27.92	28.23
	BERT	35.16	33.66	34.39	31.79	33.50	32.62	29.91	31.98	30.91
	MOCHEG-Text	42.59	42.96	40.37	41.74	41.48	41.61	39.22	38.53	38.88
	ESCNet-Text	42.23	41.72	41.97	40.42	40.18	40.80	38.31	38.78	38.54
	MSP-Text	43.92	45.14	44.52	42.71	44.27	43.48	41.15	42.66	41.89
	Qwen-7B-Text	43.48	44.35	43.91	42.38	43.63	42.99	41.02	42.21	41.61
多模态证据	RoBERTa+ViT	33.93	33.91	33.92	31.48	34.70	33.01	37.12	35.17	36.12
	BERT+ViT	35.56	35.95	35.75	34.34	33.74	34.04	32.71	32.39	32.55
	RoBERTa+ResNet50	36.00	36.53	36.26	35.74	35.91	35.82	37.42	36.73	37.07
	BERT+ResNet50	39.93	40.75	40.33	39.44	37.10	38.24	37.61	36.04	36.81
	MOCHEG	43.12	42.55	42.83	44.00	44.35	44.17	43.75	44.43	43.38
	ESCNet	46.82	47.02	46.92	45.55	46.01	45.78	44.81	45.42	45.11
	MSP	48.03	48.10	48.06	47.01	47.42	47.21	46.42	47.01	46.71
	Qwen-7B	47.33	47.60	47.46	46.41	46.95	46.67	45.85	46.64	46.24
	CMSA	46.56	47.62	47.09	47.01	47.75	47.38	48.61	47.95	48.28

证据类型	模型	Top-5			Top-10			Top-15
证据类型	模型	Pre	Rec	F₁	Pre	Rec	F₁	Pre	Rec	F₁
文本证据	BERT	31.89	35.39	33.55	30.53	35.30	32.74	29.89	34.57	32.06
	RoBERTa	34.07	38.48	36.14	33.21	37.26	35.12	30.45	36.58	33.23
	MOCHEG-Text	42.32	42.70	42.51	42.15	41.44	41.79	39.57	39.18	39.37
	ESCNet-Text	42.88	43.72	43.30	41.42	42.80	42.10	39.82	40.30	40.04
	MSP-Text	43.28	43.93	43.60	42.46	42.82	42.64	41.00	41.60	41.30
	Qwen-7B-Text	42.65	43.00	42.82	41.70	42.38	42.03	40.50	41.30	40.89
多模态证据	RoBERTa+ViT	33.68	40.46	36.76	32.82	38.21	35.31	31.79	37.73	34.51
	BERT+ViT	34.29	40.15	36.82	34.16	39.78	36.76	34.02	38.52	36.13
	RoBERTa+ResNet50	35.63	39.30	37.38	33.71	41.74	37.30	31.72	39.04	35.00
	BERT+ResNet50	37.92	40.46	39.15	37.56	39.43	38.47	36.14	38.57	37.32
	MOCHEG	43.80	43.10	43.45	42.93	43.66	43.29	43.27	43.06	43.16
	ESCNet	44.38	44.92	44.65	43.90	44.20	44.05	42.88	43.16	43.02
	MSP	46.05	46.12	46.08	45.31	45.52	45.41	44.60	45.00	44.80
	Qwen-7B	45.32	45.68	45.50	44.79	44.85	44.81	43.85	44.30	44.07
	CMSA	46.36	47.58	46.96	45.42	46.94	46.17	45.34	46.84	46.08

跨模态语义关联的多模态事实验证

Multimodal fact verification with cross-modal semantic association

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 8

参考文献 27

相关文章 15

编辑推荐

Metrics

模型	F₁
RoBERTa+ViT	36.43
BERT+ViT	36.58
RoBERTa+ResNet50	37.71
BERT+ResNet50	41.25
MOCHEG	44.06
ESCNet	44.20
MSP	48.60
Qwen-7B	48.10
CMSA	48.65

[1]	梁豪, 乔少杰. 融合双向序列嵌入的复杂查询问答模型[J]. 《计算机应用》唯一官方网站, 2026, 46(4): 1096-1103.
[2]	邵培荣, 蔺素珍, 王彦博. 以人为中心的细节增强虚拟试衣方法[J]. 《计算机应用》唯一官方网站, 2026, 46(3): 915-923.
[3]	刘汉卿, 桑国明, 张益嘉. 结合密集多尺度特征融合和特征知识增强Transformer的遥感图像描述模型[J]. 《计算机应用》唯一官方网站, 2026, 46(3): 741-749.
[4]	张祖习, 张战成, 胡伏原. 局部与长程时序互补建模的视频动作识别[J]. 《计算机应用》唯一官方网站, 2026, 46(3): 758-766.
[5]	罗虎, 张明书. 基于跨模态注意力机制与对比学习的谣言检测方法[J]. 《计算机应用》唯一官方网站, 2026, 46(2): 361-367.
[6]	韩锋, 卜永丰, 梁浩翔, 黄舒雯, 张朝阳, 孙士杰. 基于多层次时空交互依赖的车辆轨迹异常检测[J]. 《计算机应用》唯一官方网站, 2026, 46(2): 604-612.
[7]	吴俊锐, 杨江川, 喻海生, 邹赛, 汪文勇. 基于复增强注意力机制图神经网络的确定性网络性能评估方法[J]. 《计算机应用》唯一官方网站, 2026, 46(2): 505-517.
[8]	张日丰, 李广明, 欧阳裕荣. 反射先验图引导的低光图像增强网络[J]. 《计算机应用》唯一官方网站, 2026, 46(2): 546-554.
[9]	徐千惠, 钮可, 朱顺哲, 石林, 李军. 增强型可逆神经网络视频隐写网络GAB3D-SEVSN[J]. 《计算机应用》唯一官方网站, 2026, 46(2): 467-474.
[10]	付锦程, 杨仕友. 基于贝叶斯优化和特征融合混合模型的短期风电功率预测[J]. 《计算机应用》唯一官方网站, 2026, 46(2): 652-658.
[11]	林金娇, 张灿舜, 陈淑娅, 王天鑫, 连剑, 徐庸辉. 基于改进图注意力网络的车险欺诈检测方法[J]. 《计算机应用》唯一官方网站, 2026, 46(2): 437-444.
[12]	李名, 王孟齐, 张爱丽, 任花, 窦育强. 基于条件生成对抗网络和混合注意力机制的图像隐写方法[J]. 《计算机应用》唯一官方网站, 2026, 46(2): 475-484.
[13]	张四中, 刘建阳, 李林峰. 基于X3D的轨迹引导感知学习的动作质量评估模型[J]. 《计算机应用》唯一官方网站, 2026, 46(2): 555-563.
[14]	李亚男, 郭梦阳, 邓国军, 陈允峰, 任建吉, 原永亮. 基于多模态融合特征的并分支发动机寿命预测方法[J]. 《计算机应用》唯一官方网站, 2026, 46(1): 305-313.
[15]	梁瑾裕, 高宏娟, 杜晓飞. 基于潜在特征增强进行解耦的三维人脸生成方法[J]. 《计算机应用》唯一官方网站, 2026, 46(1): 216-223.

模型	Top-5			Top-10			Top-15
模型	Pre	Rec	F₁	Pre	Rec	F₁	Pre	Rec	F₁
w/o声明‒证据注意力	45.67	45.37	45.52	46.01	46.19	46.10	44.36	45.41	44.88
w/o多模态注意力	46.04	45.86	45.95	45.59	45.50	45.54	43.40	44.47	43.93
w/o声明‒图像证据注意力	46.21	45.94	46.07	45.68	45.57	45.62	45.01	45.24	45.12
w/o声明‒文本证据注意力	46.24	46.33	46.28	46.32	46.47	46.39	45.83	45.49	45.66
CMSA	46.56	47.62	47.09	47.01	47.75	47.38	48.61	47.95	48.28

模型	Top-5			Top-10			Top-15
模型	Pre	Rec	F₁	Pre	Rec	F₁	Pre	Rec	F₁
w/o声明‒证据注意力	45.67	45.37	45.52	46.01	46.19	46.10	44.36	45.41	44.88
w/o多模态注意力	46.04	45.86	45.95	45.59	45.50	45.54	43.40	44.47	43.93
w/o声明‒图像证据注意力	46.21	45.94	46.07	45.68	45.57	45.62	45.01	45.24	45.12
w/o声明‒文本证据注意力	46.24	46.33	46.28	46.32	46.47	46.39	45.83	45.49	45.66
CMSA	46.56	47.62	47.09	47.01	47.75	47.38	48.61	47.95	48.28