Multimodal bio-coupling correlation driven audio-visual deepfake detection

doi:10.11772/j.issn.1001-9081.2026010057

Journal of Computer Applications

Received:2026-01-22 Revised:2026-04-09 Online:2026-05-13 Published:2026-05-13

多模态生物耦合性关联的音视频深度伪造检测

白小宇¹,杨高明²,吴兰兰¹,肖禧年¹,李雪莲³

1. 安徽理工大学
2. 计算机科学与工程学院
3. 安徽理工大学计算机科学与工程学院

通讯作者: 白小宇

Abstract

Abstract: Since existing multimodal deepfake detection methods fail to effectively capture and exploit fine-grained biological coupling relationships across different modalities, a Multimodal Bio-Coupling Correlation Driven Audio-Visual Deepfake Detection model (MBCD-AV) was proposed. Multimodal bio-coupling correlation refers to stable physiological and semantic synergistic relationship between speech signals and facial movements (especially lip movements) in real audio and video. Firstly, a frame-level multi-scale spatiotemporal encoder was introduced to jointly model dynamic variations of facial and lip regions. Subsequently, a cross-modal lip movement feature extraction module was constructed based on CLIP (Contrastive Language-Image Pre-training) to model the physiological consistency between speech signals and lip movements. Finally, a dual-branch biological collaborative fusion module was designed to achieve hierarchical alignment and fusion of facial, lip, and audio representations. Experimental results show that an average AUC exceeding 93% is achieved on the five subsets of single-modal spoofing dataset FaceForensics++. On audio-visual spoofing dataset FakeAVCeleb, the AUC reaches 98.94%, while on DeepFakeTIMIT, an AUC of 100% is obtained across subsets of different quality levels. The average performance is better than that of comparative experiments, which verifies effectiveness of the method in multimodal deep forgery detection.

摘要： 针对现有多模态深度伪造检测方法未能有效捕获并利用不同模态之间细粒度的生物耦合关系的问题，提出了一种多模态生物耦合性关联的音视频深度伪造检测模型(Multimodal Bio-Coupling Correlation Driven Audio-Visual Deepfake Detection，MBCD-AV)。所谓多模态生物耦合关系，是指在真实音视频中，语音信号与面部运动(尤其是唇部运动)之间存在稳定的生理与语义协同关系。首先，引入帧级多尺度时空编码器，联合建模面部与唇部区域的动态变化；随后，基于CLIP(Contrastive Language-Image Pre-training)构建跨模态唇动特征提取模块，建模语音信号与唇部运动之间的生理一致性；最后，设计了一种双分支生物协同融合模块，实现面部、唇部与音频特征的层级对齐与融合。在单模态伪造数据集FaceForensics++的5个子集上，本文方法平均AUC超过93%；在音视频伪造数据集FakeAVCeleb上，AUC达到98.94%；在DeepFakeTIMIT的不同质量子集上，AUC均达到100%。平均性能均优于对比实验，验证了该方法在多模态深度伪造检测中的有效性。

CLC Number:

TP391.41、TP18

白小宇杨高明吴兰兰肖禧年李雪莲. 多模态生物耦合性关联的音视频深度伪造检测[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2026010057.