《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (1): 308-317.DOI: 10.11772/j.issn.1001-9081.2023121877

• 多媒体计算与计算机仿真 • 上一篇    下一篇

利用全局-局部特征依赖的反欺骗说话人验证系统

张嘉琳1, 任庆桦1, 毛启容1,2()   

  1. 1.江苏大学 计算机科学与通信工程学院,江苏 镇江 212013
    2.江苏省大数据泛在感知与智能农业应用工程研究中心(江苏大学),江苏 镇江 212013
  • 收稿日期:2024-01-06 修回日期:2024-02-27 接受日期:2024-03-04 发布日期:2024-04-01 出版日期:2025-01-10
  • 通讯作者: 毛启容
  • 作者简介:张嘉琳(1998—),男,山东莱州人,硕士研究生,主要研究方向:合成语音检测;
    任庆桦(1992—),男,江苏淮安人,讲师,博士,CCF会员,主要研究方向:图像分割、迁移学习;
  • 基金资助:
    国家自然科学基金面上项目(62176106);江苏大学应急管理学院专项科研项目(KY-A-01)

Speaker verification system utilizing global-local feature dependency for anti-spoofing

Jialin ZHANG1, Qinghua REN1, Qirong MAO1,2()   

  1. 1.School of Computer Science and Communication Engineering,Jiangsu University,Zhenjiang Jiangsu 212013,China
    2.Jiangsu Province Big Data Ubiquitous Perception and Intelligent Agriculture Application Engineering Research Center (Jiangsu University) Zhenjiang Jiangsu 212013,China
  • Received:2024-01-06 Revised:2024-02-27 Accepted:2024-03-04 Online:2024-04-01 Published:2025-01-10
  • Contact: Qirong MAO
  • About author:ZHANG Jialin, born in 1998, M. S. candidate. His research interests include synthetic speech detection.
    REN Qinghua, born in 1992, Ph. D., lecturer. His research interests include image segmentation, transfer learning.
  • Supported by:
    Surface Project of National Natural Science Foundation of China(62176106);Special Scientific Research Project of School of Emergency Management, Jiangsu University(KY-A-01)

摘要:

针对现有卷积模型为主的反欺骗说话人验证系统捕获全局特征依赖不理想的问题,提出一种利用全局-局部特征依赖的反欺骗说话人验证系统。首先,对于欺骗语音检测模块,设计两种滤波器组合方式对原始语音进行滤波,并通过对频率子带的掩蔽实现样本扩充;其次,提出多维全局注意力机制,通过对信道维度、频率维度和时间维度分别进行池化,获得每个维度的全局依赖关系,并将全局信息通过加权的方式与原始特征相融合;最后,在说话人验证部分引入统计金字塔池化时延神经网络(SPD-TDNN),在获取多尺度时频特征的同时计算特征的标准差,并加入全局信息。实验结果表明,与集成时频图卷积(AASIST)模型相比,在ASVspoof2019数据集上提出的欺骗语音检测系统将等错误率(EER)降低了65.4%;与单独的金字塔池化说话人验证系统相比,提出的反欺骗说话人验证系统将欺骗感知说话人验证等错误率降低了约97.8%。以上验证了所提两个模块借助全局特征依赖能实现更好的分类效果。

关键词: 说话人验证, 数据增强, 频率掩蔽, 注意力机制, 欺骗语音检测

Abstract:

Aiming at the problem that the existing speaker verification systems for anti-spoofing, with convolutional model as main part, cannot capture global feature dependency well, an speaker verification system utilizing global-local feature dependency for anti-spoofing was proposed. Firstly, for the speech spoofing detection module, two filter combination ways were designed to filter the original speech, and sample augmentation was achieved by masking the frequency sub-bands. Secondly, a multi-dimensional global attention mechanism was proposed, where the global dependencies of each dimension were obtained by pooling the channel dimension, frequency dimension, and time dimension, respectively, and the global information was fused with the original features by weighting. Finally, for the speaker verification part, a Statistical Pyramid Dense Time Delay Neural Network (SPD-TDNN) was introduced to compute the standard deviation of the features and add the global information while obtaining the multi-scale time-frequency features. Experimental results show that on ASVspoof2019 dataset, the proposed speech spoofing detection system reduces the Equal Error Rate (EER) by 65.4% compared to Audio Anti-Spoofing using Integrated Spectro-Temporal graph attention network (AASIST) model, the proposed speaker verification system for anti-spoofing reduces the spoofing-aware speaker verification EER by 97.8% compared to the separate pyramid pooling speaker verification system. The above verifies that the proposed two modules achieve better classification results with the help of global feature dependency.

Key words: speaker verification, data augmentation, frequency masking, attention mechanism, speech spoofing detection

中图分类号: