Multimodal Fact Verification with Cross-Modal Semantic Association

doi:10.11772/j.issn.1001-9081.2025050526

Journal of Computer Applications

Received:2025-05-14 Revised:2025-07-14 Accepted:2025-08-08 Online:2025-08-22 Published:2025-08-22
Supported by:
the Fundamental Research Funds for the Central Universities

跨模态语义关联的多模态事实验证

刘欢娴¹,王洪涛²,王宪奥²,王洪梅¹,徐伟峰²

1. 华北电力大学
2. 华北电力大学（保定）

通讯作者: 徐伟峰
基金资助:
中央高校基本科研业务费专项

Abstract

Abstract: Multimodal fact verification aims to classify the veracity of claims by leveraging multimodal evidence, addressing the limitations of traditional unimodal methods in handling complex claims. However, existing approaches face challenges in bridging semantic gaps between multimodal evidence and between claims and evidence during feature fusion. To address this, the proposed method introduces a cross-modal semantic association approach, achieving cross-level semantic alignment and adaptive feature interaction through a cross-modal attention mechanism, effectively mitigating semantic discrepancies among multi-source information and improving classification performance for complex claims. In the evidence retrieval phase, textual evidence is retrieved based on the claim and used to filter semantically related image evidence, ensuring high relevance of multimodal evidence. During verification, the method employs the Contrastive Language–Image Pretraining (CLIP) model for semantic alignment between text and multimodal evidence and designs a claim-evidence joint attention module to enhance semantic associations among claim text, textual evidence, and image evidence. Experimental results on the MOCHEG and CEAD datasets show significant improvements in accuracy, recall, and F1 score, demonstrating the method's effectiveness in multimodal fact verification.

Key words: multimodal fact verification, semantic association, attention mechanism, Contrastive Language–Image Pretraining (CLIP), feature fusion

摘要： 多模态事实验证旨在利用多模态证据对声明的真实性进行分类，解决了传统单模态事实验证方法在处理复杂声明时的局限性。然而，现有研究在多模态特征融合中面临多模态证据间以及声明与证据间的语义差异问题。与现有工作不同，该文提出了一种跨模态语义关联的多模态事实验证方法，实现跨层次语义对齐与自适应特征交互，通过构建模态间注意力机制，有效弥补多源信息间的语义鸿沟，提升复杂声明验证的分类性能。在证据检索阶段，通过声明文本检索相关的文本证据，并进一步利用文本证据筛选语义相关的图像证据，以确保多模态证据的高相关性。在声明验证阶段，该文利用CLIP模型实现文本与多模态证据的语义对齐，并设计了声明-证据联合注意力模块，进一步强化声明文本、文本证据和图像证据三者之间的语义关联。实验结果表明，该方法在MOCHEG及CEAD数据集上的准确率、召回率和F1分数均显著优于现有方法，验证了其在多模态事实验证任务中的有效性。

关键词: 多模态事实验证, 语义关联, 注意力机制, CLIP模型, 特征融合

CLC Number:

TP391.1

刘欢娴王洪涛王宪奥王洪梅徐伟峰. 跨模态语义关联的多模态事实验证[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2025050526.

[1]	Chengzhi YAN, Ying CHEN, Kai ZHONG, Han GAO. 3D object detection algorithm based on multi-scale network and axial attention [J]. Journal of Computer Applications, 2025, 45(8): 2537-2545.
[2]	Haifeng WU, Liqing TAO, Yusheng CHENG. Partial label regression algorithm integrating feature attention and residual connection [J]. Journal of Computer Applications, 2025, 45(8): 2530-2536.
[3]	Jin ZHOU, Yuzhi LI, Xu ZHANG, Shuo GAO, Li ZHANG, Jiachuan SHENG. Modulation recognition network for complex electromagnetic environments [J]. Journal of Computer Applications, 2025, 45(8): 2672-2682.
[4]	Yimeng XI, Zhen DENG, Qian LIU, Libo LIU. Cross-modal information fusion for video-text retrieval [J]. Journal of Computer Applications, 2025, 45(8): 2448-2456.
[5]	Chao JING, Yutao QUAN, Yan CHEN. Improved multi-layer perceptron and attention model-based power consumption prediction algorithm [J]. Journal of Computer Applications, 2025, 45(8): 2646-2655.
[6]	Jinhao LIN, Chuan LUO, Tianrui LI, Hongmei CHEN. Thoracic disease classification method based on cross-scale attention network [J]. Journal of Computer Applications, 2025, 45(8): 2712-2719.
[7]	Chen LIANG, Yisen WANG, Qiang WEI, Jiang DU. Source code vulnerability detection method based on Transformer-GCN [J]. Journal of Computer Applications, 2025, 45(7): 2296-2303.
[8]	Haoyu LIU, Pengwei KONG, Yaoli WANG, Qing CHANG. Pedestrian detection algorithm based on multi-view information [J]. Journal of Computer Applications, 2025, 45(7): 2325-2332.
[9]	Xiaoqiang ZHAO, Yongyong LIU, Yongyong HUI, Kai LIU. Batch process quality prediction model using improved time-domain convolutional network with multi-head self-attention mechanism [J]. Journal of Computer Applications, 2025, 45(7): 2245-2252.
[10]	Huibin WANG, Zhan’ao HU, Jie HU, Yuanwei XU, Bo WEN. Time series forecasting model based on segmented attention mechanism [J]. Journal of Computer Applications, 2025, 45(7): 2262-2268.
[11]	Liang CHEN, Xuan WANG, Kun LEI. Helmet wearing detection algorithm for complex scenarios based on cross-layer multi-scale feature fusion [J]. Journal of Computer Applications, 2025, 45(7): 2333-2341.
[12]	Yihan WANG, Chong LU, Zhongyuan CHEN. Multimodal sentiment analysis model with cross-modal text information enhancement [J]. Journal of Computer Applications, 2025, 45(7): 2237-2244.
[13]	Wenjing YAN, Ruidong WANG, Min ZUO, Qingchuan ZHANG. Recipe recommendation model based on hierarchical learning of flavor embedding heterogeneous graph [J]. Journal of Computer Applications, 2025, 45(6): 1869-1878.
[14]	Zonghang WU, Dong ZHANG, Guanyu LI. Multimodal fusion recommendation algorithm based on joint self-supervised learning [J]. Journal of Computer Applications, 2025, 45(6): 1858-1868.
[15]	Linjia SUN, Lei QIN, Meijin KANG, Yinglin WANG. Automatic speech segmentation algorithm based on syllable type recognition [J]. Journal of Computer Applications, 2025, 45(6): 2034-2042.