Detection algorithm of audio scene sound replacement falsification based on ResNet

doi:10.11772/j.issn.1001-9081.2021061432

Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (6): 1724-1728.DOI: 10.11772/j.issn.1001-9081.2021061432

• National Open Distributed and Parallel Computing Conference 2021 (DPCS 2021） • Previous Articles

Detection algorithm of audio scene sound replacement falsification based on ResNet

Mingyu DONG¹, Diqun YAN¹^,²()

^1.Faculty of Electrical Engineering and Computer Science，Ningbo University，Ningbo Zhejiang 315211，China
^2.Southeast Digital Economic Development Institute，Quzhou Zhejiang 324000，China

Received:2021-08-10 Revised:2021-11-10 Accepted:2021-11-17 Online:2022-01-10 Published:2022-06-10
Contact: Diqun YAN
About author:DONG Mingyu，born in 1997，M. S. candidate. His research interests include machine learning，multimedia forensics，adversarial example.
Supported by:
National Natural Science Foundation of China(U1736215);Zhejiang Provincial Natural Science Foundation(LY20F020010);Ningbo Natural Science Foundation(202003N4089)

基于ResNet的音频场景声替换造假的检测算法

董明宇¹, 严迪群¹^,²()

^1.宁波大学信息科学与工程学院，浙江宁波 315211
^2.东南数字经济发展研究院，浙江衢州 324000

通讯作者: 严迪群
作者简介:董明宇（1997—），男，浙江宁海人，硕士研究生，CCF会员，主要研究方向：机器学习、多媒体取证、对抗样本
基金资助:
国家自然科学基金资助项目(U1736215);浙江省自然科学基金资助项目(LY20F020010);宁波市自然科学基金资助项目(202003N4089)

Abstract

Abstract:

A ResNet-based faked sample detection algorithm was proposed for the detection of faked samples in audio scenes with low faking cost and undetectable sound replacement. The Constant Q Cepstral Coefficient （CQCC） features of the audio were extracted firstly， then the input features were learnt by the Residual Network （ResNet） structure， by combining the multi-layer residual blocks of the network and feature normalization， the classification results were output finally. On TIMIT and Voicebank databases， the highest detection accuracy of the proposed algorithm can reach 100%， and the lowest false acceptance rate of the algorithm can reach 1.37%. In realistic scenes， the highest detection accuracy of this algorithm is up to 99.27% when detecting the audios recorded by three different recording devices with the background noise of the device and the audio of the original scene. Experimental results show that it is effective to use the CQCC features of audio to detect the scene replacement trace of audio.

Key words: audio falsification, audio scene sound replacement, Residual Network (ResNet), Constant Q Cepstral Coefficient (CQCC)

摘要：

针对造假成本低、不易察觉的音频场景声替换的造假样本检测问题，提出了基于ResNet的造假样本检测算法。该算法首先提取音频的常数Q频谱系数（CQCC）特征，之后由残差网络（ResNet）结构学习输入的特征，结合网络的多层的残差块以及特征归一化，最后输出分类结果。在TIMIT和Voicebank数据库上，所提算法的检测准确率最高可达100%，错误接收率最低仅为1.37%。在现实场景下检测由多种不同录音设备录制的带有设备本底噪声以及原始场景声音频，该算法的检测准确率最高可达99.27%。实验结果表明，在合适的模型下利用音频的CQCC特征来检测音频的场景替换痕迹是有效的。

关键词: 音频造假, 音频场景声替换, 残差网络, 常数Q频谱系数

CLC Number:

TP391.4

Mingyu DONG, Diqun YAN. Detection algorithm of audio scene sound replacement falsification based on ResNet[J]. Journal of Computer Applications, 2022, 42(6): 1724-1728.

董明宇, 严迪群. 基于ResNet的音频场景声替换造假的检测算法[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1724-1728.

Figures/Tables 7

References 16

1	WESTERLUND M. The emergence of deepfake technology： a review［J］. Technology Innovation Management Review， 2019， 9（11）： 39-52. 10.22215/timreview/1282
2	WU H J， WANG Y， HUANG J W. Identification of electronic disguised voices［J］. IEEE Transactions on Information Forensics and Security， 2014， 9（3）： 489-500. 10.1109/tifs.2014.2301912
3	LIN X D， LIU J X， KANG X G. Audio recapture detection with convolutional neural networks［J］. IEEE Transactions on Multimedia， 2016， 18（8）： 1480-1487. 10.1109/tmm.2016.2571999
4	AL-ALI A K H， DEAN D， SENADJI B， et al. Enhanced forensic speaker verification using a combination of DWT and MFCC feature warping in the presence of noise and reverberation conditions［J］. IEEE Access， 2017， 5： 15400-15413. 10.1109/access.2017.2728801
5	LIDY T， SCHINDLER A. CQT-based convolutional neural networks for audio scene classification［C/OL］// Proceedings of the 2016 Workshop on Detection and Classification of Acoustic Scenes and Events. ［2021-04-21］..
6	WU Z F， SHEN C H， VAN DEN HENGEL A. Wider or deeper： revisiting the ResNet model for visual recognition［J］. Pattern Recognition， 2019， 90： 119-133. 10.1016/j.patcog.2019.01.006
7	HE K M， ZHANG X Y， REN S Q， et al. Identity mappings in deep residual networks［C］// Proceedings of the 2016 European Conference on Computer Vision， LNIP 9908. Cham： Springer， 2016： 630-645.
8	REN Y Z， LIU D K， XIONG Q C， et al. Spec-ResNet： a general audio steganalysis scheme based on deep residual network of spectrogram［EB/OL］. （2019-02-26）［2021-04-21］.. 10.1109/tdsc.2022.3141121
9	LIU M L， WANG W C， LI Y X. The system for acoustic scene classification using ResNet［R/OL］. ［2021-04-21］..
10	GAROFOLO J S， LAMEL L F， FISHER W M， et al. DARPA TIMIT： acoustic-phonetic continous speech corpus CD-ROM： NIST speech disc 1-1.1： NISTIR 4930［R］. Gaithersburg， MD： National Institute of Standards and Technology， 1993.
11	VEAUX C， YAMAGISHI J， KING S. The voice bank corpus： Design， collection and data analysis of a large regional accent speech database［C］// Proceedings of the 2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation. Piscataway： IEEE， 2013： 1-4. 10.1109/icsda.2013.6709856
12	THIEMANN J， ITO N， VINCENT E. The Diverse Environments Multi-channel Acoustic Noise Database （DEMAND）： a database of multichannel environmental noise recordings［J］. Proceedings of Meetings on Acoustics， 2013， 19（1）： No.035081. 10.1121/1.4799597
13	TODISCO M， DELGADO H， EVANS N. Constant Q cepstral coefficients： a spoofing countermeasure for automatic speaker verification［J］. Computer Speech and Language， 2017， 45： 516-535. 10.1016/j.csl.2017.01.001
14	ALZANTOT M， WANG Z Q， SRIVASTAVA M B. Deep residual neural networks for audio spoofing detection［C］// Proceedings of the Interspeech 2019. ［S.l.］： International Speech Communication Association， 2019： 1078-1082.
15	杨磊，赵红东. 基于轻量级深度神经网络的环境声音识别［J］. 计算机应用， 2020， 40（11）：3172-3177. 10.11772/j.issn.1001-9081.2020030433
	YANG L， ZHAO H D. Environment sound recognition based on lightweight deep neural network［J］. Journal of Computer Applications， 2020， 40（11）： 3172-3177. 10.11772/j.issn.1001-9081.2020030433
16	MATEEN M， WEN J H， NASRULLAH， et al. Fundus image classification using VGG-19 architecture with PCA and SVD［J］. Symmetry， 2019， 11（1）： No.1. 10.3390/sym11010001

模型	音频特征	训练数据集	测试数据集准确率/%		FAR/%
模型	音频特征	训练数据集	Voicebank	TIMIT	FAR/%
SVM	MFCC	Voicebank	98.43	95.00	5.81
	MFCC	TIMIT	50.00	100.00	0.00
	CQCC	Voicebank	97.20	96.06	6.02
	CQCC	TIMIT	100.00	54.21	0.03
VGG	MFCC	Voicebank	92.94	88.72	15.76
	MFCC	TIMIT	66.38	100.00	0.37
	CQCC	Voicebank	91.14	65.11	9.37
	CQCC	TIMIT	62.38	100.00	0.25
ResNet	MFCC	Voicebank	90.34	83.17	19.71
	MFCC	TIMIT	50.00	100.00	0.00
	CQCC	Voicebank	86.04	89.33	15.08
	CQCC	TIMIT	94.24	100.00	1.73

模型	音频特征	训练数据集	测试数据集准确率/%		FAR/%
模型	音频特征	训练数据集	Voicebank	TIMIT	FAR/%
SVM	MFCC	Voicebank	98.43	95.00	5.81
	MFCC	TIMIT	50.00	100.00	0.00
	CQCC	Voicebank	97.20	96.06	6.02
	CQCC	TIMIT	100.00	54.21	0.03
VGG	MFCC	Voicebank	92.94	88.72	15.76
	MFCC	TIMIT	66.38	100.00	0.37
	CQCC	Voicebank	91.14	65.11	9.37
	CQCC	TIMIT	62.38	100.00	0.25
ResNet	MFCC	Voicebank	90.34	83.17	19.71
	MFCC	TIMIT	50.00	100.00	0.00
	CQCC	Voicebank	86.04	89.33	15.08
	CQCC	TIMIT	94.24	100.00	1.73

设备	MFCC		CQCC
设备	Voicebank	TIMIT	Voicebank	TIMIT
Letv	56.33	50.67	73.89	63.67
OPPO	55.78	57.44	65.00	61.89
iPhone	79.93	50.00	99.27	89.05

设备	MFCC		CQCC
设备	Voicebank	TIMIT	Voicebank	TIMIT
Letv	56.33	50.67	73.89	63.67
OPPO	55.78	57.44	65.00	61.89
iPhone	79.93	50.00	99.27	89.05

[1]	Yang ZHANG, Jiangbo HAO. Malicious code detection method based on attention mechanism and residual network [J]. Journal of Computer Applications, 2022, 42(6): 1708-1715.
[2]	Yinxin BAO, Yang CAO, Quan SHI. Improved spatio-temporal residual convolutional neural network for urban road network short-term traffic flow prediction [J]. Journal of Computer Applications, 2022, 42(1): 258-264.
[3]	LIU Shize, ZHU Yida, CHEN Runze, LUO Haiyong, ZHAO Fang, SUN Yi, WANG Baohui. Traffic mode recognition algorithm based on residual temporal attention neural network [J]. Journal of Computer Applications, 2021, 41(6): 1557-1565.
[4]	LIU Shize, QIN Yanjun, WANG Chenxing, GAO Cunyuan, LUO Haiyong, ZHAO Fang, WANG Baohui. Transportation mode recognition algorithm based on multi-scale feature extraction [J]. Journal of Computer Applications, 2021, 41(6): 1573-1580.

Detection algorithm of audio scene sound replacement falsification based on ResNet

基于ResNet的音频场景声替换造假的检测算法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 7

References 16

Related Articles 4

Recommended Articles

Metrics