基于注意力和挤压‒激励Inception的双分支合成语音检测

doi:10.11772/j.issn.1001-9081.2023101458

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (10): 3217-3222.DOI: 10.11772/j.issn.1001-9081.2023101458

• 多媒体计算与计算机仿真 • 上一篇下一篇

基于注意力和挤压‒激励Inception的双分支合成语音检测

王晗, 赵腊生(), 张强, 程银清, 邱泽鹏

先进设计与智能计算省部共建教育部重点实验室（大连大学），辽宁大连 116622

收稿日期:2023-10-27 修回日期:2024-02-22 接受日期:2024-02-26 发布日期:2024-10-15 出版日期:2024-10-10
通讯作者: 赵腊生
作者简介:王晗（1998—），女，辽宁铁岭人，硕士研究生，主要研究方向：深度学习、语音鉴伪
赵腊生（1978—），男，山西朔州人，讲师，博士，主要研究方向：深度学习、语音信号处理 goodzls@126.com
张强（1971—），男，陕西西安人，教授，博士，主要研究方向：生物计算与人工智能、大数据分析与处理
程银清（1999—），女，湖北咸宁人，硕士研究生，主要研究方向：深度学习、语音鉴伪
邱泽鹏（1998—），男，山东潍坊人，硕士研究生，主要研究方向：深度学习、语音关键词识别。
基金资助:
辽宁省教育厅基本科研项目(LJKMZ20221838)

Dual branch synthetic speech detection based on attention and squeeze-excitation inception

Han WANG, Lasheng ZHAO(), Qiang ZHANG, Yinqing CHENG, Zepeng QIU

Key Laboratory of Advanced Design and Intelligent Computing，Ministry of Education （Dalian University），Dalian Liaoning 116622，China

Received:2023-10-27 Revised:2024-02-22 Accepted:2024-02-26 Online:2024-10-15 Published:2024-10-10
Contact: Lasheng ZHAO
About author:WANG Han， born in 1998， M. S. candidate. Her research interests include deep learning， spoof speech detection.
ZHANG Qiang， born in 1971， Ph. D.， professor. His research interests include biocomputing and artificial intelligence， big data analysis and processing.
CHENG Yinqing， born in 1999， M. S. candidate. Her research interests include deep learning， spoof speech detection.
QIU Zepeng， born in 1998， M. S. candidate. His research interests include deep learning， speech keyword detection.
Supported by:
Basic Scientific Research Project of Educational Department of Liaoning Province(LJKMZ20221838)

摘要/Abstract

摘要：

合成语音攻击给人们的生活带来巨大的威胁。为了解决现有模型从冗余信息中提取关键信息能力不足和单一模型无法综合利用多检测模型优势的问题，提出一种基于注意力和挤压-激励（SE）模块Inception （SE-Inc）的双分支（Dual-ABIB）合成语音检测模型。首先，基于SincNet（Sinc-based convolutional neural Network）提取的初始特征图训练注意力分支合成语音检测模型，并输出注意力图；其次，将注意力图和初始特征图相乘后再叠加，并将结果作为SE-Inc分支的输入进行训练；最后，通过决策级加权融合处理2个分支获得的分类分数，从而实现合成语音检测。实验结果表明，所提模型在参数量为539×10³的情况下，在ASVspoof2019数据集上获得了0.033 2的最小串联检测代价函数（min t-DCF）和1.15%的等错误率（EER）；与SE-ResABNet （Squeeze-Excitation ResNet Attention Branch Network）相比，所提模型在参数量仅为它的56%的情况下，min t-DCF和EER分别下降了34.5%和39.2%；同时，在ASVspoof2015和ASVspoof2021数据集上所提模型表现了更好的泛化能力。以上结果验证了所提模型能够在参数量较小的情况下，获得更低的min t-DCF和EER。

关键词: 注意力机制, 挤压-激励模块, 双分支, 合成语音检测, 决策级融合

Abstract:

Synthetic speech attacks can pose a significant threat to people’s lives. To address the issues of the existing models’ lack of the ability to extract key information from redundant data and the limitations of a single model in using the advantages of multiple detection models， a synthetic speech detection model based on Dual branch with Attention Branch and Squeeze-Excitation Inception （SE-Inc） Branch （Dual-ABIB） was proposed. Firstly， the initial feature maps extracted by Sinc-based Convolutional Neural Network （SincNet） were utilized to train the attention branch of the synthetic speech detection model， and the attention maps were output. Secondly， the attention maps were multiplied and superposed with the original feature maps， and the result was trained as the input for the SE-Inc branch. Finally， classification scores obtained by the two branches were processed through decision-level weighted fusion to achieve synthetic speech detection. Experimental results show that the proposed model achieves a minimum tandem Detection Cost Function （min t-DCF） of 0.033 2 and an Equal Error Rate （EER） of 1.15% on ASVspoof2019 dataset when the number of parameters is 539×10³. Compared with SE-ResABNet （Squeeze-Excitation ResNet Attention Branch Network）， when the number of parameters of the proposed model is only 56% of that of SE-ResABNet， the proposed model has the min t-DCF and EER reduced by 34.5% and 39.2% respectively. At the same time， the proposed model shows better generalization ability on ASVspoof2015 and ASVspoof2021 datasets. The above results verify that Dual-ABIB can obtain lower min t-DCF and EER with less of parameters.

Key words: attention mechanism, Squeeze-Excitation (SE) module, dual branch, synthetic speech detection, decision-level fusion

中图分类号:

TN912.3

王晗, 赵腊生, 张强, 程银清, 邱泽鹏. 基于注意力和挤压‒激励Inception的双分支合成语音检测[J]. 计算机应用, 2024, 44(10): 3217-3222.

Han WANG, Lasheng ZHAO, Qiang ZHANG, Yinqing CHENG, Zepeng QIU. Dual branch synthetic speech detection based on attention and squeeze-excitation inception[J]. Journal of Computer Applications, 2024, 44(10): 3217-3222.

图/表 9

参考文献 33

1	SNYDER D， GARCIA-ROMERO D， SELL G， et al. X-vectors： robust DNN embeddings for speaker recognition［C］// Proceedings of the 2018 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2018： 5329-5333.
2	KAUR N， SINGH P. Conventional and contemporary approaches used in text to speech synthesis： a review［J］. Artificial Intelligence Review， 2022， 56（7）： 5837-5880.
3	SISMAN B， YAMAGISHI J， KING S， et al. An overview of voice conversion and its challenges： from statistical modeling to deep learning［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2021， 29： 132-157.
4	HU C， ZHOU R， YUAN Q. Replay speech detection based on dual-input hierarchical fusion network［J］. Applied Sciences， 2023， 13（9）： No.5350.
5	任延珍，刘晨雨，刘武洋，等. 语音伪造及检测技术研究综述［J］. 信号处理， 2021， 37（12）： 2412-2439.
	REN Y Z， LIU C Y， LIU W Y， et al. A survey on speech forgery and detection［J］. Journal of Signal Processing， 2021， 37（12）： 2412-2439.
6	WEI L， LONG Y， WEI H， et al. New acoustic features for synthetic and replay spoofing attack detection［J］. Symmetry， 2022， 14（2）： No.274.
7	TODISCO M， DELGADO H， EVANS N. Constant Q cepstral coefficients： a spoofing countermeasure for automatic speaker verification［J］. Computer Speech and Language， 2017， 45： 516-535.
8	PATEL T B， PATIL H A. Combining evidences from mel cepstral， cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech［C］// Proceedings of the INTERSPEECH 2015. ［S.l.］： International Speech Communication Association， 2015： 2062-2066.
9	CUI S， HUANG B， HUANG J， et al. Synthetic speech detection based on local autoregression and variance statistics［J］. IEEE Signal Processing Letters， 2022， 29： 1462-1466.
10	RAVANELLI M， BENGIO Y. Speaker recognition from raw waveform with SincNet［C］// Proceedings of the 2018 IEEE Spoken Language Technology Workshop. Piscataway： IEEE， 2018： 1021-1028.
11	DINKEL H， CHEN N， QIAN Y， et al. End-to-end spoofing detection with raw waveform CLDNNS［C］// Proceedings of the 2017 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2017： 4860-4864.
12	MA Y， REN Z， XU S. RW-ResNet： a novel speech anti-spoofing model using raw waveform［C］// Proceedings of the INTERSPEECH 2021. ［S.l.］： International Speech Communication Association， 2021： 4144-4148.
13	TAK H， PATINO J， TODISCO M， et al. End-to-end anti-spoofing with RawNet2［C］// Proceedings of the 2021 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2021： 6369-6373.
14	JUNG J W， HEO H S， TAK H， et al. AASIST： audio anti-spoofing using integrated spectro-temporal graph attention networks［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2022： 6367-6371.
15	ALZANTOT M， WANG Z， SRIVASTAVA M B. Deep residual neural networks for audio spoofing detection［C］// Proceedings of the INTERSPEECH 2019. ［S.l.］： International Speech Communication Association， 2019： 1078-1082.
16	WU Z， DAS R K， YANG J， et al. Light convolutional neural network with feature genuinization for detection of synthetic speech attacks［C］// Proceedings of the INTERSPEECH 2020. ［S.l.］： International Speech Communication Association， 2020： 1101-1105.
17	LUO A， LI E， LIU Y， et al. A capsule network based approach for detection of audio spoofing attacks［C］// Proceedings of the 2021 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2021： 6359-6363.
18	HUA G， TEOH A B J， ZHANG H. Towards end-to-end synthetic speech detection［J］. IEEE Signal Processing Letters， 2021， 28： 1265-1269.
19	SZEGEDY C， LIU W， JIA Y， et al. Going deeper with convolutions［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2015： 1-9.
20	LIU X， LIU M， WANG L， et al. Leveraging positional-related local-global dependency for synthetic speech detection［C］// Proceedings of the 2023 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2023： 1-5.
21	FUKUI H， HIRAKAWA T， YAMASHITA T， et al. Attention branch network： learning of attention mechanism for visual explanation［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 10697-10706.
22	张秋余，王煜坤. 基于改进Inception网络的语音分类模型［J］. 计算机应用， 2023， 43（3）： 909-915.
	ZHANG Q Y， WANG Y K. Speech classification model based on improved Inception network［J］. Journal of Computer Applications， 2023， 43（3）：909-915.
23	HU J， SHEN L， SUN G. Squeeze-and-excitation networks［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 7132-7141.
24	WANG X， YAMAGISHI J， TODISCO M， et al. ASVspoof 2019： a large-scale public database of synthesized， converted and replayed speech［J］. Computer Speech and Language， 2020， 64： No.101114 .
25	WU Z， KINNUNEN T， EVANS N， et al. ASVspoof 2015： the first automatic speaker verification spoofing and countermeasures challenge［C］// Proceedings of the INTERSPEECH 2015. ［S.l.］： International Speech Communication Association， 2015： 2037-2041.
26	LIU X， WANG X， SAHIDULLAH M， et al. ASVspoof 2021： towards spoofed and deepfake speech detection in the wild［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2023， 31： 2507-2522.
27	BRÜMMER N， DE VILLIERS E. The BOSARIS Toolkit user guide： theory， algorithms and code for binary classifier score processing［EB/OL］. （2013-04-10）［2023-08-10］. .
28	KINNUNEN T， DELGADO H， EVANS N， et al. Tandem assessment of spoofing countermeasures and automatic speaker verification： fundamentals［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2020， 28： 2195-2210.
29	WANG X， YAMAGISHI J. A comparative study on recent neural spoofing countermeasures for synthetic speech detection［C］// Proceedings of the INTERSPEECH 2021. ［S.l.］： International Speech Communication Association， 2021： 4259-4263.
30	ROSTAMI A M， HOMAYOUNPOUR M M， NICKABADI A. Efficient attention branch network with combined loss function for automatic speaker verification spoof detection［J］. Circuits， Systems， and Signal Processing， 2023， 42（7）： 4252-4270.
31	GE W， PATINO J， TODISCO M， et al. Raw differentiable architecture Search for speech deepfake and spoofing detection［C］// Proceedings of the 2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge. ［S.l.］： International Speech Communication Association， 2021： 22-28.
32	GONG J， CHEN N. Synthetic voice spoofing detection based on feature pyramid conformer［C］// Proceedings of the INTERSPEECH 2023. ［S.l.］： International Speech Communication Association， 2023： 2803-2807.
33	WANG X， YAMAGISHI J. Investigating self-supervised front ends for speech spoofing countermeasures［C］// Proceedings of the 2021 Speaker and Language Recognition Workshop. ［S.l.］： International Speech Communication Association， 2022： 100-106.

子集	真实语音样本数	合成语音样本数	伪造种类
训练集	2 580	22 800	A01~A06
开发集	2 548	22 296	A01~A06
测试集	7 355	63 882	A07~A19

子集	真实语音样本数	合成语音样本数	伪造种类
训练集	2 580	22 800	A01~A06
开发集	2 548	22 296	A01~A06
测试集	7 355	63 882	A07~A19

模型	EER
Inc-TSSDNet^［18］	4.04
Sinc-Attention	3.89
Sinc-Inception	3.49

模型	EER
Inc-TSSDNet^［18］	4.04
Sinc-Attention	3.89
Sinc-Inception	3.49

测试模型	EER
AB分支	1.81
IB分支	1.63
加权融合	1.15

基于注意力和挤压‒激励Inception的双分支合成语音检测

Dual branch synthetic speech detection based on attention and squeeze-excitation inception

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 33

相关文章 15

编辑推荐

Metrics

方法		EER/%	min t-DCF	计算效率/ms
Att	SE	EER/%	min t-DCF	计算效率/ms
×	×	3.49	0.086 9	5.67
×	√	2.13	0.066 2	7.17
√	×	1.33	0.037 2	6.28
√	√	1.15	0.033 2	9.97

模型	前端特征	测试集		参数量/10³
模型	前端特征	EER/%	min t-DCF	参数量/10³
RawNet2^［13］	Waveform	4.66	0.129 4	25 430
Inc-TSSDNet^［18］	Waveform	4.04	0.097 6	92
CapsNet^［17］	LFCC	1.97	0.053 8	—
LCNN-LSTM^［29］	LFCC	1.92	0.052 4	—
SE-ResABNet^［30］	LFCC	1.89	0.050 7	964
Raw PC-DARTS^［31］	Waveform	1.77	0.051 7	24 480
FPM+EM-Softmax^［32］	LFCC	1.65	0.047 0	—
LGF^［33］	Wav2Vec2.0	1.28	0.100 0	—
Dual-ABIB	Waveform	1.15	0.033 2	539

模型	EER
模型	19LA	15LA	21LA
LFCC-GMM^［24］	9.57	14.87	19.30
CQCC-GMM^［24］	8.09	36.31	15.62
Inc-TSSDNet^［18］	4.04	3.29	17.56
AASIST^［14］	1.13	3.22	10.90
Dual-ABIB	1.15	2.29	10.43

[1]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[2]	李力铤, 华蓓, 贺若舟, 徐况. 基于解耦注意力机制的多变量时序预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2732-2738.
[3]	赵志强, 马培红, 黑新宏. 基于双重注意力机制的人群计数方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2886-2892.
[4]	薛凯鹏, 徐涛, 廖春节. 融合自监督和多层交叉注意力的多模态情感分析网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2387-2392.
[5]	汪雨晴, 朱广丽, 段文杰, 李书羽, 周若彤. 基于交互注意力机制的心理咨询文本情感分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2393-2399.
[6]	高鹏淇, 黄鹤鸣, 樊永红. 融合坐标与多头注意力机制的交互语音情感识别[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2400-2406.
[7]	李钟华, 白云起, 王雪津, 黄雷雷, 林初俊, 廖诗宇. 基于图像增强的低照度人脸检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2588-2594.
[8]	莫尚斌, 王文君, 董凌, 高盛祥, 余正涛. 基于多路信息聚合协同解码的单通道语音增强[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2611-2617.
[9]	刘丽, 侯海金, 王安红, 张涛. 基于多尺度注意力的生成式信息隐藏算法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2102-2109.
[10]	徐松, 张文博, 王一帆. 基于时空信息的轻量视频显著性目标检测网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2192-2199.
[11]	李大海, 王忠华, 王振东. 结合空间域和频域信息的双分支低光照图像增强网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2175-2182.
[12]	魏文亮, 王阳萍, 岳彪, 王安政, 张哲. 基于光照权重分配和注意力的红外与可见光图像融合深度学习模型[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2183-2191.
[13]	熊武, 曹从军, 宋雪芳, 邵云龙, 王旭升. 基于多尺度混合域注意力机制的笔迹鉴别方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2225-2232.
[14]	李欢欢, 黄添强, 丁雪梅, 罗海峰, 黄丽清. 基于多尺度时空图卷积网络的交通出行需求预测[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2065-2072.
[15]	毛典辉, 李学博, 刘峻岭, 张登辉, 颜文婧. 基于并行异构图和序列注意力机制的中文实体关系抽取模型[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2018-2025.