《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (1): 227-231.DOI: 10.11772/j.issn.1001-9081.2021101845

所属专题: 多媒体计算与计算机仿真

• 多媒体计算与计算机仿真 • 上一篇    下一篇

基于时域波形的半监督端到端虚假语音检测方法

方昕1,2, 黄泽鑫3, 张聿晗2, 高天2, 潘嘉2, 付中华4, 高建清2, 刘俊华2, 邹亮3   

  1. 1.语音及语言信息处理国家工程实验室(中国科学技术大学),合肥 230027
    2.科大讯飞股份有限公司 AI研究院,合肥 230088
    3.中国矿业大学 信息与控制工程学院,江苏 徐州 221116
    4.西安讯飞超脑信息科技有限公司,西安 710000
  • 收稿日期:2021-11-01 修回日期:2022-01-13 发布日期:2022-03-04
  • 通讯作者: 高天(1991—),男,安徽阜阳人,工程师,博士,主要研究方向:说话人识别、语音信号处理tiangao5@iflytek.com
  • 作者简介:方昕(1988—),男,安徽池州人,工程师,博士研究生,主要研究方向:语音识别、说话人识别;黄泽鑫(1995—),男,福建泉州人,硕士研究生,主要研究方向:深度学习、语音信号处理;张聿晗(1995—),男,安徽合肥人,硕士,主要研究方向:说话人识别;潘嘉(1985—),男,安徽合肥人,博士,主要研究方向:深度学习、语音信号处理;付中华(1977—),男,陕西西安人,副教授,博士,主要研究方向:语音信号处理;高建清(1983—),男,安徽合肥人,高级工程师,博士,主要研究方向:深度学习、语音信号处理;刘俊华(1985—),男,安徽阜阳人,正高级工程师,博士,主要研究方向:深度学习、语音信号处理;邹亮(1987—),男,安徽阜阳人,副教授,博士,主要研究方向:深度学习、信号处理;
  • 基金资助:
    科技创新2030——“新一代人工智能”重大项目(2020AAA0103600)。

Semi‑supervised end‑to‑end fake speech detection method based on time‑domain waveforms

FANG Xin1,2, HUANG Zexin3, ZHANG Yuhan2, GAO Tian2, PAN Jia2, FU Zhonghua4, GAO Jianqing2, LIU Junhua2, ZOU Liang3   

  1. 1.National Engineering Laboratory for Speech and Language Information Processing (University of Science and Technology of China), Hefei Anhui 230027, China
    2.AI Institute, iFLYTEK Company Limited, Hefei Anhui 230088, China
    3.School of Information and Control Engineering, China University of Mining and Technology, Xuzhou Jiangsu 221116, China
    4.Xi'an iFLYTEK Hyper?brain Information Technology Company Limited, Xi'an Shaanxi 710000, China
  • Received:2021-11-01 Revised:2022-01-13 Online:2022-03-04
  • Contact: GAO Tian, born in 1991, Ph. D., engineer. His research interests include speaker recognition, speech signal processing.
  • About author:FANG Xin, born in 1988, Ph. D. candidate, engineer. His research interests include speech recognition, speaker recognition;HUANG Zexin, born in 1995, M. S. candidate. His research interests include deep learning, speech signal processing;ZHANG Yuhan, born in 1995, M. S. His research interests include speaker recognition;PAN Jia, born in 1985, Ph. D. His research interests include deep learning, speech signal processing;FU Zhonghua, born in 1977, Ph. D., associate professor. His research interests include speech signal processing;GAO Jianqing, born in 1983, Ph. D., senior engineer. His research interests include deep learning, speech signal processing;LIU Junhua, born in 1985, Ph. D., professor of engineering. His research interests include deep learning, speech signal processing;ZOU Liang, born in 1987, Ph. D., associate professor. His research interests include deep learning, signal processing;
  • Supported by:
    This work is partially supported by Science and Technology Innovation 2030 — "New Generation Artificial Intelligence" Major Project (2020AAA0103600).

摘要: 现代语音合成和音色转换系统产生的虚假语音对自动说话人识别系统构成了严重威胁。大多数现有的虚假语音检测系统对在训练中已知的攻击类型表现良好,但对实际应用中的未知攻击类型检测效果显著降低。因此,结合最近提出的双路径Res2Net(DP?Res2Net),提出一种基于时域波形的半监督端到端虚假语音检测方法。首先,为了解决训练数据集和测试数据集两者数据分布差异较大的问题,采用半监督学习进行领域迁移;然后,对于特征工程,直接将时域采样点输入DP?Res2Net中,增加局部的多尺度信息,并充分利用音频片段之间的依赖性;最后,输入特征经过浅层卷积模块、特征融合模块、全局平均池化模块得到嵌入张量,用来判别自然语音与虚假伪造语音。在公开可用的ASVspoof 2021 Speech Deep Fake评估集和VCC数据集上评估了所提出方法的性能,实验结果表明它的等错误率(EER)为19.97%,与官方最优基线系统相比降低了10.8%。基于时域波形的半监督端到端检测虚假语音检测方法面对未知攻击时是有效的,且具有更高的泛化能力。

关键词: 虚假语音检测, 语音合成, 音色转换, 说话人识别, 时域, 半监督学习

Abstract: The fake speech produced by modern speech synthesis and timbre conversion systems poses a serious threat to the automatic speaker recognition system. Most of the existing fake speech detection systems perform well for the known attack types in the training process, but degrades significantly in detecting unknown attack types in practical applications. Therefore, combined with the recently proposed Dual?Path Res2Net (DP?Res2Net), a semi?supervised end?to?end fake speech detection method based on time?domain waveforms was proposed. Firstly, semi?supervised learning was adopted for domain transfer to reduce the difference of data distribution between training set and test set. Then, for feature engineering, time-domain sampling points were input into DP?Res2Net directly, which increased the local multi?scale information and made full use of the dependence between audio segments. Finally, the embedded tensors were obtained to judge fake speech from natural speech after the input features going through the shallow convolution module, feature fusion module and global average pooling module. The performance of the proposed method was evaluated on the publicly available ASVspoof 2021 Speech Deep Fake evaluation set as well as the dataset VCC (Voice Conversion Challenge). Experimental results show that the Equal Error Rate (EER) of the proposed method is 19.97%, which is 10.8% less than that of the official optimal baseline system, verifying that the semi?supervised end?to?end fake speech detection method based on time?domain waveforms is effective when recognizing unknown attacks and has higher generalization capability.

Key words: fake speech detection, speech synthesis, timbre conversion, speaker recognition, time domain, semi?supervised learning

中图分类号: