《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (10): 3217-3222.DOI: 10.11772/j.issn.1001-9081.2023101458

• 多媒体计算与计算机仿真 • 上一篇    下一篇

基于注意力和挤压‒激励Inception的双分支合成语音检测

王晗, 赵腊生(), 张强, 程银清, 邱泽鹏   

  1. 先进设计与智能计算省部共建教育部重点实验室(大连大学),辽宁 大连 116622
  • 收稿日期:2023-10-27 修回日期:2024-02-22 接受日期:2024-02-26 发布日期:2024-10-15 出版日期:2024-10-10
  • 通讯作者: 赵腊生
  • 作者简介:王晗(1998—),女,辽宁铁岭人,硕士研究生,主要研究方向:深度学习、语音鉴伪
    赵腊生(1978—),男,山西朔州人,讲师,博士,主要研究方向:深度学习、语音信号处理 goodzls@126.com
    张强(1971—),男,陕西西安人,教授,博士,主要研究方向:生物计算与人工智能、大数据分析与处理
    程银清(1999—),女,湖北咸宁人,硕士研究生,主要研究方向:深度学习、语音鉴伪
    邱泽鹏(1998—),男,山东潍坊人,硕士研究生,主要研究方向:深度学习、语音关键词识别。
  • 基金资助:
    辽宁省教育厅基本科研项目(LJKMZ20221838)

Dual branch synthetic speech detection based on attention and squeeze-excitation inception

Han WANG, Lasheng ZHAO(), Qiang ZHANG, Yinqing CHENG, Zepeng QIU   

  1. Key Laboratory of Advanced Design and Intelligent Computing,Ministry of Education (Dalian University),Dalian Liaoning 116622,China
  • Received:2023-10-27 Revised:2024-02-22 Accepted:2024-02-26 Online:2024-10-15 Published:2024-10-10
  • Contact: Lasheng ZHAO
  • About author:WANG Han, born in 1998, M. S. candidate. Her research interests include deep learning, spoof speech detection.
    ZHANG Qiang, born in 1971, Ph. D., professor. His research interests include biocomputing and artificial intelligence, big data analysis and processing.
    CHENG Yinqing, born in 1999, M. S. candidate. Her research interests include deep learning, spoof speech detection.
    QIU Zepeng, born in 1998, M. S. candidate. His research interests include deep learning, speech keyword detection.
  • Supported by:
    Basic Scientific Research Project of Educational Department of Liaoning Province(LJKMZ20221838)

摘要:

合成语音攻击给人们的生活带来巨大的威胁。为了解决现有模型从冗余信息中提取关键信息能力不足和单一模型无法综合利用多检测模型优势的问题,提出一种基于注意力和挤压-激励(SE)模块Inception (SE-Inc)的双分支(Dual-ABIB)合成语音检测模型。首先,基于SincNet(Sinc-based convolutional neural Network)提取的初始特征图训练注意力分支合成语音检测模型,并输出注意力图;其次,将注意力图和初始特征图相乘后再叠加,并将结果作为SE-Inc分支的输入进行训练;最后,通过决策级加权融合处理2个分支获得的分类分数,从而实现合成语音检测。实验结果表明,所提模型在参数量为539×103的情况下,在ASVspoof2019数据集上获得了0.033 2的最小串联检测代价函数(min t-DCF)和1.15%的等错误率(EER);与SE-ResABNet (Squeeze-Excitation ResNet Attention Branch Network)相比,所提模型在参数量仅为它的56%的情况下,min t-DCF和EER分别下降了34.5%和39.2%;同时,在ASVspoof2015和ASVspoof2021数据集上所提模型表现了更好的泛化能力。以上结果验证了所提模型能够在参数量较小的情况下,获得更低的min t-DCF和EER。

关键词: 注意力机制, 挤压-激励模块, 双分支, 合成语音检测, 决策级融合

Abstract:

Synthetic speech attacks can pose a significant threat to people’s lives. To address the issues of the existing models’ lack of the ability to extract key information from redundant data and the limitations of a single model in using the advantages of multiple detection models, a synthetic speech detection model based on Dual branch with Attention Branch and Squeeze-Excitation Inception (SE-Inc) Branch (Dual-ABIB) was proposed. Firstly, the initial feature maps extracted by Sinc-based Convolutional Neural Network (SincNet) were utilized to train the attention branch of the synthetic speech detection model, and the attention maps were output. Secondly, the attention maps were multiplied and superposed with the original feature maps, and the result was trained as the input for the SE-Inc branch. Finally, classification scores obtained by the two branches were processed through decision-level weighted fusion to achieve synthetic speech detection. Experimental results show that the proposed model achieves a minimum tandem Detection Cost Function (min t-DCF) of 0.033 2 and an Equal Error Rate (EER) of 1.15% on ASVspoof2019 dataset when the number of parameters is 539×103. Compared with SE-ResABNet (Squeeze-Excitation ResNet Attention Branch Network), when the number of parameters of the proposed model is only 56% of that of SE-ResABNet, the proposed model has the min t-DCF and EER reduced by 34.5% and 39.2% respectively. At the same time, the proposed model shows better generalization ability on ASVspoof2015 and ASVspoof2021 datasets. The above results verify that Dual-ABIB can obtain lower min t-DCF and EER with less of parameters.

Key words: attention mechanism, Squeeze-Excitation (SE) module, dual branch, synthetic speech detection, decision-level fusion

中图分类号: