基于ResNet的音频场景声替换造假的检测算法

doi:10.11772/j.issn.1001-9081.2021061432

《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (6): 1724-1728.DOI: 10.11772/j.issn.1001-9081.2021061432

• 2021年全国开放式分布与并行计算学术年会(DPCS 2021)论文 • 上一篇

基于ResNet的音频场景声替换造假的检测算法

董明宇¹, 严迪群¹^,²()

^1.宁波大学信息科学与工程学院，浙江宁波 315211
^2.东南数字经济发展研究院，浙江衢州 324000

收稿日期:2021-08-10 修回日期:2021-11-10 接受日期:2021-11-17 发布日期:2022-01-10 出版日期:2022-06-10
通讯作者: 严迪群
作者简介:董明宇（1997—），男，浙江宁海人，硕士研究生，CCF会员，主要研究方向：机器学习、多媒体取证、对抗样本
基金资助:
国家自然科学基金资助项目(U1736215);浙江省自然科学基金资助项目(LY20F020010);宁波市自然科学基金资助项目(202003N4089)

Detection algorithm of audio scene sound replacement falsification based on ResNet

Mingyu DONG¹, Diqun YAN¹^,²()

^1.Faculty of Electrical Engineering and Computer Science，Ningbo University，Ningbo Zhejiang 315211，China
^2.Southeast Digital Economic Development Institute，Quzhou Zhejiang 324000，China

Received:2021-08-10 Revised:2021-11-10 Accepted:2021-11-17 Online:2022-01-10 Published:2022-06-10
Contact: Diqun YAN
About author:DONG Mingyu，born in 1997，M. S. candidate. His research interests include machine learning，multimedia forensics，adversarial example.
Supported by:
National Natural Science Foundation of China(U1736215);Zhejiang Provincial Natural Science Foundation(LY20F020010);Ningbo Natural Science Foundation(202003N4089)

摘要/Abstract

摘要：

针对造假成本低、不易察觉的音频场景声替换的造假样本检测问题，提出了基于ResNet的造假样本检测算法。该算法首先提取音频的常数Q频谱系数（CQCC）特征，之后由残差网络（ResNet）结构学习输入的特征，结合网络的多层的残差块以及特征归一化，最后输出分类结果。在TIMIT和Voicebank数据库上，所提算法的检测准确率最高可达100%，错误接收率最低仅为1.37%。在现实场景下检测由多种不同录音设备录制的带有设备本底噪声以及原始场景声音频，该算法的检测准确率最高可达99.27%。实验结果表明，在合适的模型下利用音频的CQCC特征来检测音频的场景替换痕迹是有效的。

关键词: 音频造假, 音频场景声替换, 残差网络, 常数Q频谱系数

Abstract:

A ResNet-based faked sample detection algorithm was proposed for the detection of faked samples in audio scenes with low faking cost and undetectable sound replacement. The Constant Q Cepstral Coefficient （CQCC） features of the audio were extracted firstly， then the input features were learnt by the Residual Network （ResNet） structure， by combining the multi-layer residual blocks of the network and feature normalization， the classification results were output finally. On TIMIT and Voicebank databases， the highest detection accuracy of the proposed algorithm can reach 100%， and the lowest false acceptance rate of the algorithm can reach 1.37%. In realistic scenes， the highest detection accuracy of this algorithm is up to 99.27% when detecting the audios recorded by three different recording devices with the background noise of the device and the audio of the original scene. Experimental results show that it is effective to use the CQCC features of audio to detect the scene replacement trace of audio.

Key words: audio falsification, audio scene sound replacement, Residual Network (ResNet), Constant Q Cepstral Coefficient (CQCC)

中图分类号:

TP391.4

董明宇, 严迪群. 基于ResNet的音频场景声替换造假的检测算法[J]. 计算机应用, 2022, 42(6): 1724-1728.

Mingyu DONG, Diqun YAN. Detection algorithm of audio scene sound replacement falsification based on ResNet[J]. Journal of Computer Applications, 2022, 42(6): 1724-1728.

图/表 7

图1 正负样本的语谱图

Fig. 1 Spectrograms of positive and negative samples

图2 CQCC特征提取流程

Fig.2 Flowchart of CQCC feature extraction

图3 正负样本频率分布

Fig. 3 Frequency distribution of positive and negative samples

图4 ResNet的结构示意图

Fig. 4 Schematic diagram of ResNet structure

图5 两种网络的收敛性分析

Fig.5 Convergence analysis of two networks

表1 不同条件下不同模型的准确率和错误接受率

Tab. 1 Accuracy and FAR for different models under different conditions

模型	音频特征	训练数据集	测试数据集准确率/%		FAR/%
模型	音频特征	训练数据集	Voicebank	TIMIT	FAR/%
SVM	MFCC	Voicebank	98.43	95.00	5.81
	MFCC	TIMIT	50.00	100.00	0.00
	CQCC	Voicebank	97.20	96.06	6.02
	CQCC	TIMIT	100.00	54.21	0.03
VGG	MFCC	Voicebank	92.94	88.72	15.76
	MFCC	TIMIT	66.38	100.00	0.37
	CQCC	Voicebank	91.14	65.11	9.37
	CQCC	TIMIT	62.38	100.00	0.25
ResNet	MFCC	Voicebank	90.34	83.17	19.71
	MFCC	TIMIT	50.00	100.00	0.00
	CQCC	Voicebank	86.04	89.33	15.08
	CQCC	TIMIT	94.24	100.00	1.73

表2 不同设备录制音频的准确率 ( %)

Tab. 2 Accuracy for audios recorded by different devices unit： %

设备	MFCC		CQCC
设备	Voicebank	TIMIT	Voicebank	TIMIT
Letv	56.33	50.67	73.89	63.67
OPPO	55.78	57.44	65.00	61.89
iPhone	79.93	50.00	99.27	89.05

参考文献 16

1	WESTERLUND M. The emergence of deepfake technology： a review［J］. Technology Innovation Management Review， 2019， 9（11）： 39-52. 10.22215/timreview/1282
2	WU H J， WANG Y， HUANG J W. Identification of electronic disguised voices［J］. IEEE Transactions on Information Forensics and Security， 2014， 9（3）： 489-500. 10.1109/tifs.2014.2301912
3	LIN X D， LIU J X， KANG X G. Audio recapture detection with convolutional neural networks［J］. IEEE Transactions on Multimedia， 2016， 18（8）： 1480-1487. 10.1109/tmm.2016.2571999
4	AL-ALI A K H， DEAN D， SENADJI B， et al. Enhanced forensic speaker verification using a combination of DWT and MFCC feature warping in the presence of noise and reverberation conditions［J］. IEEE Access， 2017， 5： 15400-15413. 10.1109/access.2017.2728801
5	LIDY T， SCHINDLER A. CQT-based convolutional neural networks for audio scene classification［C/OL］// Proceedings of the 2016 Workshop on Detection and Classification of Acoustic Scenes and Events. ［2021-04-21］..
6	WU Z F， SHEN C H， VAN DEN HENGEL A. Wider or deeper： revisiting the ResNet model for visual recognition［J］. Pattern Recognition， 2019， 90： 119-133. 10.1016/j.patcog.2019.01.006
7	HE K M， ZHANG X Y， REN S Q， et al. Identity mappings in deep residual networks［C］// Proceedings of the 2016 European Conference on Computer Vision， LNIP 9908. Cham： Springer， 2016： 630-645.
8	REN Y Z， LIU D K， XIONG Q C， et al. Spec-ResNet： a general audio steganalysis scheme based on deep residual network of spectrogram［EB/OL］. （2019-02-26）［2021-04-21］.. 10.1109/tdsc.2022.3141121
9	LIU M L， WANG W C， LI Y X. The system for acoustic scene classification using ResNet［R/OL］. ［2021-04-21］..
10	GAROFOLO J S， LAMEL L F， FISHER W M， et al. DARPA TIMIT： acoustic-phonetic continous speech corpus CD-ROM： NIST speech disc 1-1.1： NISTIR 4930［R］. Gaithersburg， MD： National Institute of Standards and Technology， 1993.
11	VEAUX C， YAMAGISHI J， KING S. The voice bank corpus： Design， collection and data analysis of a large regional accent speech database［C］// Proceedings of the 2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation. Piscataway： IEEE， 2013： 1-4. 10.1109/icsda.2013.6709856
12	THIEMANN J， ITO N， VINCENT E. The Diverse Environments Multi-channel Acoustic Noise Database （DEMAND）： a database of multichannel environmental noise recordings［J］. Proceedings of Meetings on Acoustics， 2013， 19（1）： No.035081. 10.1121/1.4799597
13	TODISCO M， DELGADO H， EVANS N. Constant Q cepstral coefficients： a spoofing countermeasure for automatic speaker verification［J］. Computer Speech and Language， 2017， 45： 516-535. 10.1016/j.csl.2017.01.001
14	ALZANTOT M， WANG Z Q， SRIVASTAVA M B. Deep residual neural networks for audio spoofing detection［C］// Proceedings of the Interspeech 2019. ［S.l.］： International Speech Communication Association， 2019： 1078-1082.
15	杨磊，赵红东. 基于轻量级深度神经网络的环境声音识别［J］. 计算机应用， 2020， 40（11）：3172-3177. 10.11772/j.issn.1001-9081.2020030433
	YANG L， ZHAO H D. Environment sound recognition based on lightweight deep neural network［J］. Journal of Computer Applications， 2020， 40（11）： 3172-3177. 10.11772/j.issn.1001-9081.2020030433
16	MATEEN M， WEN J H， NASRULLAH， et al. Fundus image classification using VGG-19 architecture with PCA and SVD［J］. Symmetry， 2019， 11（1）： No.1. 10.3390/sym11010001

[1]	张杨, 郝江波. 基于注意力机制和残差网络的恶意代码检测方法[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1708-1715.
[2]	王汇丰, 徐岩, 魏一铭, 王会真. 基于并联卷积与残差网络的图像超分辨率重建[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1570-1576.
[3]	包银鑫, 曹阳, 施佺. 基于改进时空残差卷积神经网络的城市路网短时交通流预测[J]. 《计算机应用》唯一官方网站, 2022, 42(1): 258-264.
[4]	王贺兵, 张春梅. 基于非对称卷积-压缩激发-次代残差网络的人脸关键点检测[J]. 计算机应用, 2021, 41(9): 2741-2747.
[5]	刘世泽, 朱奕达, 陈润泽, 罗海勇, 赵方, 孙艺, 王宝会. 基于残差时域注意力神经网络的交通模式识别算法[J]. 计算机应用, 2021, 41(6): 1557-1565.
[6]	刘世泽, 秦艳君, 王晨星, 高存远, 罗海勇, 赵方, 王宝会. 基于多尺度特征提取的交通模式识别算法[J]. 计算机应用, 2021, 41(6): 1573-1580.
[7]	任奕茗, 王让定, 严迪群, 林昱臻. 基于深度残差网络的语音隐写分析方法[J]. 计算机应用, 2021, 41(3): 774-779.
[8]	王永金, 左羽, 吴恋, 崔忠伟, 赵晨洁. 基于注意力机制的图像超分辨率重建[J]. 计算机应用, 2021, 41(3): 845-850.
[9]	钟莎, 黄玉清. 基于孪生区域候选网络的无人机指定目标跟踪[J]. 计算机应用, 2021, 41(2): 523-529.
[10]	佘玉龙, 张晓龙, 程若勤, 邓春华. 基于边缘关注模型的语义分割方法[J]. 计算机应用, 2021, 41(2): 343-349.
[11]	陈朗, 王让定, 严迪群, 林昱臻. 融合残差网络和极限梯度提升的音频隐写检测模型[J]. 计算机应用, 2021, 41(2): 449-455.
[12]	王海勇, 张开心, 管维正. 基于密集Inception的单图像超分辨率重建方法[J]. 《计算机应用》唯一官方网站, 2021, 41(12): 3666-3671.
[13]	戴朝霞, 曹堉栋, 朱光明, 沈沛意, 徐旭, 梅林, 张亮. 基于知识蒸馏的特定知识学习[J]. 《计算机应用》唯一官方网站, 2021, 41(12): 3426-3431.
[14]	闵鑫, 王海鹏, 牟长宁. 基于多头注意力机制和残差神经网络的肽谱匹配打分算法[J]. 计算机应用, 2020, 40(6): 1830-1836.
[15]	代强, 程曦, 王永梅, 牛子未, 刘飞. 基于轻量自动残差缩放网络的图像超分辨率重建[J]. 计算机应用, 2020, 40(5): 1446-1452.

基于ResNet的音频场景声替换造假的检测算法

Detection algorithm of audio scene sound replacement falsification based on ResNet

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献 16

相关文章 15

编辑推荐

Metrics