《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (6): 1724-1728.DOI: 10.11772/j.issn.1001-9081.2021061432

• 2021年全国开放式分布与并行计算学术年会(DPCS 2021)论文 • 上一篇    

基于ResNet的音频场景声替换造假的检测算法

董明宇1, 严迪群1,2()   

  1. 1.宁波大学 信息科学与工程学院,浙江 宁波 315211
    2.东南数字经济发展研究院,浙江 衢州 324000
  • 收稿日期:2021-08-10 修回日期:2021-11-10 接受日期:2021-11-17 发布日期:2022-01-10 出版日期:2022-06-10
  • 通讯作者: 严迪群
  • 作者简介:董明宇(1997—),男,浙江宁海人,硕士研究生,CCF会员,主要研究方向:机器学习、多媒体取证、对抗样本
  • 基金资助:
    国家自然科学基金资助项目(U1736215);浙江省自然科学基金资助项目(LY20F020010);宁波市自然科学基金资助项目(202003N4089)

Detection algorithm of audio scene sound replacement falsification based on ResNet

Mingyu DONG1, Diqun YAN1,2()   

  1. 1.Faculty of Electrical Engineering and Computer Science,Ningbo University,Ningbo Zhejiang 315211,China
    2.Southeast Digital Economic Development Institute,Quzhou Zhejiang 324000,China
  • Received:2021-08-10 Revised:2021-11-10 Accepted:2021-11-17 Online:2022-01-10 Published:2022-06-10
  • Contact: Diqun YAN
  • About author:DONG Mingyu,born in 1997,M. S. candidate. His research interests include machine learning,multimedia forensics,adversarial example.
  • Supported by:
    National Natural Science Foundation of China(U1736215);Zhejiang Provincial Natural Science Foundation(LY20F020010);Ningbo Natural Science Foundation(202003N4089)

摘要:

针对造假成本低、不易察觉的音频场景声替换的造假样本检测问题,提出了基于ResNet的造假样本检测算法。该算法首先提取音频的常数Q频谱系数(CQCC)特征,之后由残差网络(ResNet)结构学习输入的特征,结合网络的多层的残差块以及特征归一化,最后输出分类结果。在TIMIT和Voicebank数据库上,所提算法的检测准确率最高可达100%,错误接收率最低仅为1.37%。在现实场景下检测由多种不同录音设备录制的带有设备本底噪声以及原始场景声音频,该算法的检测准确率最高可达99.27%。实验结果表明,在合适的模型下利用音频的CQCC特征来检测音频的场景替换痕迹是有效的。

关键词: 音频造假, 音频场景声替换, 残差网络, 常数Q频谱系数

Abstract:

A ResNet-based faked sample detection algorithm was proposed for the detection of faked samples in audio scenes with low faking cost and undetectable sound replacement. The Constant Q Cepstral Coefficient (CQCC) features of the audio were extracted firstly, then the input features were learnt by the Residual Network (ResNet) structure, by combining the multi-layer residual blocks of the network and feature normalization, the classification results were output finally. On TIMIT and Voicebank databases, the highest detection accuracy of the proposed algorithm can reach 100%, and the lowest false acceptance rate of the algorithm can reach 1.37%. In realistic scenes, the highest detection accuracy of this algorithm is up to 99.27% when detecting the audios recorded by three different recording devices with the background noise of the device and the audio of the original scene. Experimental results show that it is effective to use the CQCC features of audio to detect the scene replacement trace of audio.

Key words: audio falsification, audio scene sound replacement, Residual Network (ResNet), Constant Q Cepstral Coefficient (CQCC)

中图分类号: