Dual branch synthetic speech detection based on attention and squeeze-excitation inception

doi:10.11772/j.issn.1001-9081.2023101458

Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (10): 3217-3222.DOI: 10.11772/j.issn.1001-9081.2023101458

• Multimedia computing and computer simulation • Previous Articles Next Articles

Dual branch synthetic speech detection based on attention and squeeze-excitation inception

Han WANG, Lasheng ZHAO(), Qiang ZHANG, Yinqing CHENG, Zepeng QIU

Key Laboratory of Advanced Design and Intelligent Computing，Ministry of Education （Dalian University），Dalian Liaoning 116622，China

Received:2023-10-27 Revised:2024-02-22 Accepted:2024-02-26 Online:2024-10-15 Published:2024-10-10
Contact: Lasheng ZHAO
About author:WANG Han， born in 1998， M. S. candidate. Her research interests include deep learning， spoof speech detection.
ZHANG Qiang， born in 1971， Ph. D.， professor. His research interests include biocomputing and artificial intelligence， big data analysis and processing.
CHENG Yinqing， born in 1999， M. S. candidate. Her research interests include deep learning， spoof speech detection.
QIU Zepeng， born in 1998， M. S. candidate. His research interests include deep learning， speech keyword detection.
Supported by:
Basic Scientific Research Project of Educational Department of Liaoning Province(LJKMZ20221838)

基于注意力和挤压‒激励Inception的双分支合成语音检测

王晗, 赵腊生(), 张强, 程银清, 邱泽鹏

先进设计与智能计算省部共建教育部重点实验室（大连大学），辽宁大连 116622

通讯作者: 赵腊生
作者简介:王晗（1998—），女，辽宁铁岭人，硕士研究生，主要研究方向：深度学习、语音鉴伪
赵腊生（1978—），男，山西朔州人，讲师，博士，主要研究方向：深度学习、语音信号处理 goodzls@126.com
张强（1971—），男，陕西西安人，教授，博士，主要研究方向：生物计算与人工智能、大数据分析与处理
程银清（1999—），女，湖北咸宁人，硕士研究生，主要研究方向：深度学习、语音鉴伪
邱泽鹏（1998—），男，山东潍坊人，硕士研究生，主要研究方向：深度学习、语音关键词识别。
基金资助:
辽宁省教育厅基本科研项目(LJKMZ20221838)

Abstract

Abstract:

Synthetic speech attacks can pose a significant threat to people’s lives. To address the issues of the existing models’ lack of the ability to extract key information from redundant data and the limitations of a single model in using the advantages of multiple detection models， a synthetic speech detection model based on Dual branch with Attention Branch and Squeeze-Excitation Inception （SE-Inc） Branch （Dual-ABIB） was proposed. Firstly， the initial feature maps extracted by Sinc-based Convolutional Neural Network （SincNet） were utilized to train the attention branch of the synthetic speech detection model， and the attention maps were output. Secondly， the attention maps were multiplied and superposed with the original feature maps， and the result was trained as the input for the SE-Inc branch. Finally， classification scores obtained by the two branches were processed through decision-level weighted fusion to achieve synthetic speech detection. Experimental results show that the proposed model achieves a minimum tandem Detection Cost Function （min t-DCF） of 0.033 2 and an Equal Error Rate （EER） of 1.15% on ASVspoof2019 dataset when the number of parameters is 539×10³. Compared with SE-ResABNet （Squeeze-Excitation ResNet Attention Branch Network）， when the number of parameters of the proposed model is only 56% of that of SE-ResABNet， the proposed model has the min t-DCF and EER reduced by 34.5% and 39.2% respectively. At the same time， the proposed model shows better generalization ability on ASVspoof2015 and ASVspoof2021 datasets. The above results verify that Dual-ABIB can obtain lower min t-DCF and EER with less of parameters.

Key words: attention mechanism, Squeeze-Excitation (SE) module, dual branch, synthetic speech detection, decision-level fusion

摘要：

合成语音攻击给人们的生活带来巨大的威胁。为了解决现有模型从冗余信息中提取关键信息能力不足和单一模型无法综合利用多检测模型优势的问题，提出一种基于注意力和挤压-激励（SE）模块Inception （SE-Inc）的双分支（Dual-ABIB）合成语音检测模型。首先，基于SincNet（Sinc-based convolutional neural Network）提取的初始特征图训练注意力分支合成语音检测模型，并输出注意力图；其次，将注意力图和初始特征图相乘后再叠加，并将结果作为SE-Inc分支的输入进行训练；最后，通过决策级加权融合处理2个分支获得的分类分数，从而实现合成语音检测。实验结果表明，所提模型在参数量为539×10³的情况下，在ASVspoof2019数据集上获得了0.033 2的最小串联检测代价函数（min t-DCF）和1.15%的等错误率（EER）；与SE-ResABNet （Squeeze-Excitation ResNet Attention Branch Network）相比，所提模型在参数量仅为它的56%的情况下，min t-DCF和EER分别下降了34.5%和39.2%；同时，在ASVspoof2015和ASVspoof2021数据集上所提模型表现了更好的泛化能力。以上结果验证了所提模型能够在参数量较小的情况下，获得更低的min t-DCF和EER。

关键词: 注意力机制, 挤压-激励模块, 双分支, 合成语音检测, 决策级融合

CLC Number:

TN912.3

Han WANG, Lasheng ZHAO, Qiang ZHANG, Yinqing CHENG, Zepeng QIU. Dual branch synthetic speech detection based on attention and squeeze-excitation inception[J]. Journal of Computer Applications, 2024, 44(10): 3217-3222.

王晗, 赵腊生, 张强, 程银清, 邱泽鹏. 基于注意力和挤压‒激励Inception的双分支合成语音检测[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3217-3222.

Figures/Tables 9

References 33

1	SNYDER D， GARCIA-ROMERO D， SELL G， et al. X-vectors： robust DNN embeddings for speaker recognition［C］// Proceedings of the 2018 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2018： 5329-5333.
2	KAUR N， SINGH P. Conventional and contemporary approaches used in text to speech synthesis： a review［J］. Artificial Intelligence Review， 2022， 56（7）： 5837-5880.
3	SISMAN B， YAMAGISHI J， KING S， et al. An overview of voice conversion and its challenges： from statistical modeling to deep learning［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2021， 29： 132-157.
4	HU C， ZHOU R， YUAN Q. Replay speech detection based on dual-input hierarchical fusion network［J］. Applied Sciences， 2023， 13（9）： No.5350.
5	任延珍，刘晨雨，刘武洋，等. 语音伪造及检测技术研究综述［J］. 信号处理， 2021， 37（12）： 2412-2439.
	REN Y Z， LIU C Y， LIU W Y， et al. A survey on speech forgery and detection［J］. Journal of Signal Processing， 2021， 37（12）： 2412-2439.
6	WEI L， LONG Y， WEI H， et al. New acoustic features for synthetic and replay spoofing attack detection［J］. Symmetry， 2022， 14（2）： No.274.
7	TODISCO M， DELGADO H， EVANS N. Constant Q cepstral coefficients： a spoofing countermeasure for automatic speaker verification［J］. Computer Speech and Language， 2017， 45： 516-535.
8	PATEL T B， PATIL H A. Combining evidences from mel cepstral， cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech［C］// Proceedings of the INTERSPEECH 2015. ［S.l.］： International Speech Communication Association， 2015： 2062-2066.
9	CUI S， HUANG B， HUANG J， et al. Synthetic speech detection based on local autoregression and variance statistics［J］. IEEE Signal Processing Letters， 2022， 29： 1462-1466.
10	RAVANELLI M， BENGIO Y. Speaker recognition from raw waveform with SincNet［C］// Proceedings of the 2018 IEEE Spoken Language Technology Workshop. Piscataway： IEEE， 2018： 1021-1028.
11	DINKEL H， CHEN N， QIAN Y， et al. End-to-end spoofing detection with raw waveform CLDNNS［C］// Proceedings of the 2017 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2017： 4860-4864.
12	MA Y， REN Z， XU S. RW-ResNet： a novel speech anti-spoofing model using raw waveform［C］// Proceedings of the INTERSPEECH 2021. ［S.l.］： International Speech Communication Association， 2021： 4144-4148.
13	TAK H， PATINO J， TODISCO M， et al. End-to-end anti-spoofing with RawNet2［C］// Proceedings of the 2021 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2021： 6369-6373.
14	JUNG J W， HEO H S， TAK H， et al. AASIST： audio anti-spoofing using integrated spectro-temporal graph attention networks［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2022： 6367-6371.
15	ALZANTOT M， WANG Z， SRIVASTAVA M B. Deep residual neural networks for audio spoofing detection［C］// Proceedings of the INTERSPEECH 2019. ［S.l.］： International Speech Communication Association， 2019： 1078-1082.
16	WU Z， DAS R K， YANG J， et al. Light convolutional neural network with feature genuinization for detection of synthetic speech attacks［C］// Proceedings of the INTERSPEECH 2020. ［S.l.］： International Speech Communication Association， 2020： 1101-1105.
17	LUO A， LI E， LIU Y， et al. A capsule network based approach for detection of audio spoofing attacks［C］// Proceedings of the 2021 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2021： 6359-6363.
18	HUA G， TEOH A B J， ZHANG H. Towards end-to-end synthetic speech detection［J］. IEEE Signal Processing Letters， 2021， 28： 1265-1269.
19	SZEGEDY C， LIU W， JIA Y， et al. Going deeper with convolutions［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2015： 1-9.
20	LIU X， LIU M， WANG L， et al. Leveraging positional-related local-global dependency for synthetic speech detection［C］// Proceedings of the 2023 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2023： 1-5.
21	FUKUI H， HIRAKAWA T， YAMASHITA T， et al. Attention branch network： learning of attention mechanism for visual explanation［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 10697-10706.
22	张秋余，王煜坤. 基于改进Inception网络的语音分类模型［J］. 计算机应用， 2023， 43（3）： 909-915.
	ZHANG Q Y， WANG Y K. Speech classification model based on improved Inception network［J］. Journal of Computer Applications， 2023， 43（3）：909-915.
23	HU J， SHEN L， SUN G. Squeeze-and-excitation networks［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 7132-7141.
24	WANG X， YAMAGISHI J， TODISCO M， et al. ASVspoof 2019： a large-scale public database of synthesized， converted and replayed speech［J］. Computer Speech and Language， 2020， 64： No.101114 .
25	WU Z， KINNUNEN T， EVANS N， et al. ASVspoof 2015： the first automatic speaker verification spoofing and countermeasures challenge［C］// Proceedings of the INTERSPEECH 2015. ［S.l.］： International Speech Communication Association， 2015： 2037-2041.
26	LIU X， WANG X， SAHIDULLAH M， et al. ASVspoof 2021： towards spoofed and deepfake speech detection in the wild［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2023， 31： 2507-2522.
27	BRÜMMER N， DE VILLIERS E. The BOSARIS Toolkit user guide： theory， algorithms and code for binary classifier score processing［EB/OL］. （2013-04-10）［2023-08-10］. .
28	KINNUNEN T， DELGADO H， EVANS N， et al. Tandem assessment of spoofing countermeasures and automatic speaker verification： fundamentals［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2020， 28： 2195-2210.
29	WANG X， YAMAGISHI J. A comparative study on recent neural spoofing countermeasures for synthetic speech detection［C］// Proceedings of the INTERSPEECH 2021. ［S.l.］： International Speech Communication Association， 2021： 4259-4263.
30	ROSTAMI A M， HOMAYOUNPOUR M M， NICKABADI A. Efficient attention branch network with combined loss function for automatic speaker verification spoof detection［J］. Circuits， Systems， and Signal Processing， 2023， 42（7）： 4252-4270.
31	GE W， PATINO J， TODISCO M， et al. Raw differentiable architecture Search for speech deepfake and spoofing detection［C］// Proceedings of the 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. ［S.l.］： International Speech Communication Association， 2021： 22-28.
32	GONG J， CHEN N. Synthetic voice spoofing detection based on feature pyramid conformer［C］// Proceedings of the INTERSPEECH 2023. ［S.l.］： International Speech Communication Association， 2023： 2803-2807.
33	WANG X， YAMAGISHI J. Investigating self-supervised front ends for speech spoofing countermeasures［C］// Proceedings of the 2021 Speaker and Language Recognition Workshop. ［S.l.］： International Speech Communication Association， 2022： 100-106.

子集	真实语音样本数	合成语音样本数	伪造种类
训练集	2 580	22 800	A01~A06
开发集	2 548	22 296	A01~A06
测试集	7 355	63 882	A07~A19

子集	真实语音样本数	合成语音样本数	伪造种类
训练集	2 580	22 800	A01~A06
开发集	2 548	22 296	A01~A06
测试集	7 355	63 882	A07~A19

模型	EER
Inc-TSSDNet^［18］	4.04
Sinc-Attention	3.89
Sinc-Inception	3.49

模型	EER
Inc-TSSDNet^［18］	4.04
Sinc-Attention	3.89
Sinc-Inception	3.49

测试模型	EER
AB分支	1.81
IB分支	1.63
加权融合	1.15

Dual branch synthetic speech detection based on attention and squeeze-excitation inception

基于注意力和挤压‒激励Inception的双分支合成语音检测

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 9

References 33

Related Articles 15

Recommended Articles

Metrics

方法		EER/%	min t-DCF	计算效率/ms
Att	SE	EER/%	min t-DCF	计算效率/ms
×	×	3.49	0.086 9	5.67
×	√	2.13	0.066 2	7.17
√	×	1.33	0.037 2	6.28
√	√	1.15	0.033 2	9.97

模型	前端特征	测试集		参数量/10³
模型	前端特征	EER/%	min t-DCF	参数量/10³
RawNet2^［13］	Waveform	4.66	0.129 4	25 430
Inc-TSSDNet^［18］	Waveform	4.04	0.097 6	92
CapsNet^［17］	LFCC	1.97	0.053 8	—
LCNN-LSTM^［29］	LFCC	1.92	0.052 4	—
SE-ResABNet^［30］	LFCC	1.89	0.050 7	964
Raw PC-DARTS^［31］	Waveform	1.77	0.051 7	24 480
FPM+EM-Softmax^［32］	LFCC	1.65	0.047 0	—
LGF^［33］	Wav2Vec2.0	1.28	0.100 0	—
Dual-ABIB	Waveform	1.15	0.033 2	539

模型	EER
模型	19LA	15LA	21LA
LFCC-GMM^［24］	9.57	14.87	19.30
CQCC-GMM^［24］	8.09	36.31	15.62
Inc-TSSDNet^［18］	4.04	3.29	17.56
AASIST^［14］	1.13	3.22	10.90
Dual-ABIB	1.15	2.29	10.43

[1]	Zhiqiang ZHAO, Peihong MA, Xinhong HEI. Crowd counting method based on dual attention mechanism [J]. Journal of Computer Applications, 2024, 44(9): 2886-2892.
[2]	Jing QIN, Zhiguang QIN, Fali LI, Yueheng PENG. Diagnosis of major depressive disorder based on probabilistic sparse self-attention neural network [J]. Journal of Computer Applications, 2024, 44(9): 2970-2974.
[3]	Liting LI, Bei HUA, Ruozhou HE, Kuang XU. Multivariate time series prediction model based on decoupled attention mechanism [J]. Journal of Computer Applications, 2024, 44(9): 2732-2738.
[4]	Kaipeng XUE, Tao XU, Chunjie LIAO. Multimodal sentiment analysis network with self-supervision and multi-layer cross attention [J]. Journal of Computer Applications, 2024, 44(8): 2387-2392.
[5]	Pengqi GAO, Heming HUANG, Yonghong FAN. Fusion of coordinate and multi-head attention mechanisms for interactive speech emotion recognition [J]. Journal of Computer Applications, 2024, 44(8): 2400-2406.
[6]	Zhonghua LI, Yunqi BAI, Xuejin WANG, Leilei HUANG, Chujun LIN, Shiyu LIAO. Low illumination face detection based on image enhancement [J]. Journal of Computer Applications, 2024, 44(8): 2588-2594.
[7]	Shangbin MO, Wenjun WANG, Ling DONG, Shengxiang GAO, Zhengtao YU. Single-channel speech enhancement based on multi-channel information aggregation and collaborative decoding [J]. Journal of Computer Applications, 2024, 44(8): 2611-2617.
[8]	Wu XIONG, Congjun CAO, Xuefang SONG, Yunlong SHAO, Xusheng WANG. Handwriting identification method based on multi-scale mixed domain attention mechanism [J]. Journal of Computer Applications, 2024, 44(7): 2225-2232.
[9]	Huanhuan LI, Tianqiang HUANG, Xuemei DING, Haifeng LUO, Liqing HUANG. Public traffic demand prediction based on multi-scale spatial-temporal graph convolutional network [J]. Journal of Computer Applications, 2024, 44(7): 2065-2072.
[10]	Dianhui MAO, Xuebo LI, Junling LIU, Denghui ZHANG, Wenjing YAN. Chinese entity and relation extraction model based on parallel heterogeneous graph and sequential attention mechanism [J]. Journal of Computer Applications, 2024, 44(7): 2018-2025.
[11]	Li LIU, Haijin HOU, Anhong WANG, Tao ZHANG. Generative data hiding algorithm based on multi-scale attention [J]. Journal of Computer Applications, 2024, 44(7): 2102-2109.
[12]	Song XU, Wenbo ZHANG, Yifan WANG. Lightweight video salient object detection network based on spatiotemporal information [J]. Journal of Computer Applications, 2024, 44(7): 2192-2199.
[13]	Dahai LI, Zhonghua WANG, Zhendong WANG. Dual-branch low-light image enhancement network combining spatial and frequency domain information [J]. Journal of Computer Applications, 2024, 44(7): 2175-2182.
[14]	Wenliang WEI, Yangping WANG, Biao YUE, Anzheng WANG, Zhe ZHANG. Deep learning model for infrared and visible image fusion based on illumination weight allocation and attention [J]. Journal of Computer Applications, 2024, 44(7): 2183-2191.
[15]	Yan ZHOU, Yang LI. Rectified cross pseudo supervision method with attention mechanism for stroke lesion segmentation [J]. Journal of Computer Applications, 2024, 44(6): 1942-1948.