《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (10): 3134-3140.DOI: 10.11772/j.issn.1001-9081.2023101506

• 网络空间安全 • 上一篇    下一篇

多模态部分伪造数据集的构建与基准检测

郑盛有, 陈雁翔(), 赵祖兴, 刘海洋   

  1. 合肥工业大学 计算机与信息学院,合肥 230601
  • 收稿日期:2023-11-06 修回日期:2023-12-30 接受日期:2024-01-03 发布日期:2024-10-15 出版日期:2024-10-10
  • 通讯作者: 陈雁翔
  • 作者简介:郑盛有(1998—),男,安徽宿州人,硕士研究生,主要研究方向:多媒体内容安全
    陈雁翔(1972—),女,安徽巢湖人,教授,博士,CCF会员,主要研究方向:多媒体信息处理、多媒体内容安全 chenyx@hfut.edu.cn
    赵祖兴(1998—),男,江西上饶人,硕士研究生,主要研究方向:多媒体内容安全
    刘海洋(1999—),男,安徽亳州人,硕士研究生,主要研究方向:多媒体内容安全、多示例学习。
  • 基金资助:
    国家自然科学基金资助项目(61972127)

Construction and benchmark detection of multimodal partial forgery dataset

Shengyou ZHENG, Yanxiang CHEN(), Zuxing ZHAO, Haiyang LIU   

  1. School of Computer Science and Information Engineering,Hefei University of Technology,Hefei Anhui 230601,China
  • Received:2023-11-06 Revised:2023-12-30 Accepted:2024-01-03 Online:2024-10-15 Published:2024-10-10
  • Contact: Yanxiang CHEN
  • About author:ZHENG Shengyou, born in 1998, M. S. candidate. His research interests include multimedia content safety.
    ZHAO Zuxing, born in 1998, M. S. candidate. His research interests include multimedia content safety.
    LIU Haiyang, born in 1999, M. S. candidate. His research interests include multimedia content safety, multi-instance learning.
  • Supported by:
    National Natural Science Foundation of China(61972127)

摘要:

针对现有视频伪造数据集缺少多模态伪造场景与部分伪造场景的问题,构建一个综合使用多种音、视频伪造方法的、伪造比例可调的多模态部分伪造数据集PartialFAVCeleb。所提数据集基于FakeAVCeleb多模态伪造数据集,并通过拼接真伪数据构建,其中伪造数据由FaceSwap、FSGAN(Face Swapping Generative Adversarial Network)、Wav2Lip(Wave to Lip)和SV2TTS(Speaker Verification to Text-To-Speech)这4种方法生成。在拼接过程中,使用概率方法生成伪造片段在时域与模态上的定位,并对边界进行随机化处理以贴合实际伪造场景,并通过素材筛选避免背景跳变现象。最终生成的数据集对于每个伪造比例可产生3 970条视频数据。在基准检测中,使用多种音视频特征提取器,并分别进行强、弱监督两种条件下的测试,其中弱监督测试基于层次多示例学习(HMIL)方法实现。测试结果显示,各个测试模型在伪造比例较低数据上的性能表现显著低于在伪造比例较高数据上的性能,且弱监督条件下各模型的性能表现显著低于强监督条件下的表现,这验证了该部分伪造数据集的弱监督检测困难性。以上结果表明,以所提数据集为代表的多模态部分伪造场景有充分的研究价值。

关键词: 深度伪造检测, 多模态伪造检测, 部分伪造, 多示例学习, 深度伪造数据集, 内容安全

Abstract:

Aiming at the lack of multimodal forgery scenarios and partial forgery scenarios in existing video forgery datasets, a multimodal partial forgery dataset with adjustable forgery ratios — PartialFAVCeleb was constructed by using a wide varieties of audio and video forgery methods. The proposed dataset was based on the FakeAVCeleb multimodal forgery dataset and was with the real and forged data spliced, in which the forgery data were generated by four methods, that is, FaceSwap, FSGAN (Face Swapping Generative Adversarial Network), Wav2Lip (Wave to Lip), and SV2TTS (Speaker Verification to Text-To-Speech). In the splicing process, probabilistic methods were used to generate the locations of the forgery segments in the time domain and modality, then the boundary was randomized to fit the actual forged scenario. And, the phenomenon of background hopping was avoided through material screening. The finally obtained dataset contains forgery videos of different ratios, with one ratio corresponding to 3 970 video data. In the benchmark detection, several audio and video feature extractors were used. And the data was tested in strong supervised and weakly-supervised conditions respectively, and Hierarchical Multi-Instance Learning (HMIL) method was used to realize the latter condition. As the test results indicate, for each test model, the performance on data with low forgery ratio is significantly inferior to that on data with high forgery ratio, and the performance under weakly-supervised condition is significantly inferior to that under strong supervised condition. The difficulty of weakly-supervised detection of proposed partial forgery dataset is verified. Experimental results show that the multimodal partial forgery scenario represented by the proposed dataset has sufficient research value.

Key words: deepfake detection, multimodal forgery detection, partial forgery, multi-instance learning, deepfake dataset, content safety

中图分类号: