基于时域波形的半监督端到端虚假语音检测方法

doi:10.11772/j.issn.1001-9081.2021101845

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (1): 227-231.DOI: 10.11772/j.issn.1001-9081.2021101845

所属专题：多媒体计算与计算机仿真

• 多媒体计算与计算机仿真 • 上一篇下一篇

基于时域波形的半监督端到端虚假语音检测方法

方昕^1,2, 黄泽鑫³, 张聿晗², 高天², 潘嘉², 付中华⁴, 高建清², 刘俊华², 邹亮³

1.语音及语言信息处理国家工程实验室(中国科学技术大学),合肥 230027
2.科大讯飞股份有限公司 AI研究院,合肥 230088
3.中国矿业大学信息与控制工程学院,江苏徐州 221116
4.西安讯飞超脑信息科技有限公司,西安 710000

收稿日期:2021-11-01 修回日期:2022-01-13 发布日期:2022-03-04
通讯作者: 高天（1991—），男，安徽阜阳人，工程师，博士，主要研究方向：说话人识别、语音信号处理tiangao5@iflytek.com
作者简介:方昕（1988—），男，安徽池州人，工程师，博士研究生，主要研究方向：语音识别、说话人识别；黄泽鑫（1995—），男，福建泉州人，硕士研究生，主要研究方向：深度学习、语音信号处理；张聿晗（1995—），男，安徽合肥人，硕士，主要研究方向：说话人识别；潘嘉（1985—），男，安徽合肥人，博士，主要研究方向：深度学习、语音信号处理；付中华（1977—），男，陕西西安人，副教授，博士，主要研究方向：语音信号处理；高建清（1983—），男，安徽合肥人，高级工程师，博士，主要研究方向：深度学习、语音信号处理；刘俊华（1985—），男，安徽阜阳人，正高级工程师，博士，主要研究方向：深度学习、语音信号处理；邹亮（1987—），男，安徽阜阳人，副教授，博士，主要研究方向：深度学习、信号处理；
基金资助:
科技创新2030——“新一代人工智能”重大项目（2020AAA0103600）。

Semi‑supervised end‑to‑end fake speech detection method based on time‑domain waveforms

FANG Xin^1,2, HUANG Zexin³, ZHANG Yuhan², GAO Tian², PAN Jia², FU Zhonghua⁴, GAO Jianqing², LIU Junhua², ZOU Liang³

1.National Engineering Laboratory for Speech and Language Information Processing （University of Science and Technology of China）， Hefei Anhui 230027， China
2.AI Institute， iFLYTEK Company Limited， Hefei Anhui 230088， China
3.School of Information and Control Engineering， China University of Mining and Technology， Xuzhou Jiangsu 221116， China
4.Xi'an iFLYTEK Hyper?brain Information Technology Company Limited， Xi'an Shaanxi 710000， China

Received:2021-11-01 Revised:2022-01-13 Online:2022-03-04
Contact: GAO Tian， born in 1991， Ph. D.， engineer. His research interests include speaker recognition， speech signal processing.
About author:FANG Xin， born in 1988， Ph. D. candidate， engineer. His research interests include speech recognition， speaker recognition；HUANG Zexin， born in 1995， M. S. candidate. His research interests include deep learning， speech signal processing；ZHANG Yuhan， born in 1995， M. S. His research interests include speaker recognition；PAN Jia， born in 1985， Ph. D. His research interests include deep learning， speech signal processing；FU Zhonghua， born in 1977， Ph. D.， associate professor. His research interests include speech signal processing；GAO Jianqing， born in 1983， Ph. D.， senior engineer. His research interests include deep learning， speech signal processing；LIU Junhua， born in 1985， Ph. D.， professor of engineering. His research interests include deep learning， speech signal processing；ZOU Liang， born in 1987， Ph. D.， associate professor. His research interests include deep learning， signal processing；
Supported by:
This work is partially supported by Science and Technology Innovation 2030 — "New Generation Artificial Intelligence" Major Project （2020AAA0103600）.

摘要/Abstract

摘要： 现代语音合成和音色转换系统产生的虚假语音对自动说话人识别系统构成了严重威胁。大多数现有的虚假语音检测系统对在训练中已知的攻击类型表现良好，但对实际应用中的未知攻击类型检测效果显著降低。因此，结合最近提出的双路径Res2Net（DP?Res2Net），提出一种基于时域波形的半监督端到端虚假语音检测方法。首先，为了解决训练数据集和测试数据集两者数据分布差异较大的问题，采用半监督学习进行领域迁移；然后，对于特征工程，直接将时域采样点输入DP?Res2Net中，增加局部的多尺度信息，并充分利用音频片段之间的依赖性；最后，输入特征经过浅层卷积模块、特征融合模块、全局平均池化模块得到嵌入张量，用来判别自然语音与虚假伪造语音。在公开可用的ASVspoof 2021 Speech Deep Fake评估集和VCC数据集上评估了所提出方法的性能，实验结果表明它的等错误率（EER）为19.97%，与官方最优基线系统相比降低了10.8%。基于时域波形的半监督端到端检测虚假语音检测方法面对未知攻击时是有效的，且具有更高的泛化能力。

关键词: 虚假语音检测, 语音合成, 音色转换, 说话人识别, 时域, 半监督学习

Abstract: The fake speech produced by modern speech synthesis and timbre conversion systems poses a serious threat to the automatic speaker recognition system. Most of the existing fake speech detection systems perform well for the known attack types in the training process， but degrades significantly in detecting unknown attack types in practical applications. Therefore， combined with the recently proposed Dual?Path Res2Net （DP?Res2Net）， a semi?supervised end?to?end fake speech detection method based on time?domain waveforms was proposed. Firstly， semi?supervised learning was adopted for domain transfer to reduce the difference of data distribution between training set and test set. Then， for feature engineering， time-domain sampling points were input into DP?Res2Net directly， which increased the local multi?scale information and made full use of the dependence between audio segments. Finally， the embedded tensors were obtained to judge fake speech from natural speech after the input features going through the shallow convolution module， feature fusion module and global average pooling module. The performance of the proposed method was evaluated on the publicly available ASVspoof 2021 Speech Deep Fake evaluation set as well as the dataset VCC （Voice Conversion Challenge）. Experimental results show that the Equal Error Rate （EER） of the proposed method is 19.97%， which is 10.8% less than that of the official optimal baseline system， verifying that the semi?supervised end?to?end fake speech detection method based on time?domain waveforms is effective when recognizing unknown attacks and has higher generalization capability.

Key words: fake speech detection, speech synthesis, timbre conversion, speaker recognition, time domain, semi?supervised learning

中图分类号:

TP391.5

方昕, 黄泽鑫, 张聿晗, 高天, 潘嘉, 付中华, 高建清, 刘俊华, 邹亮. 基于时域波形的半监督端到端虚假语音检测方法[J]. 计算机应用, 2023, 43(1): 227-231.

FANG Xin, HUANG Zexin, ZHANG Yuhan, GAO Tian, PAN Jia, FU Zhonghua, GAO Jianqing, LIU Junhua, ZOU Liang. Semi‑supervised end‑to‑end fake speech detection method based on time‑domain waveforms[J]. Journal of Computer Applications, 2023, 43(1): 227-231.

参考文献

1 王康，董元菲. 基于角度间隔嵌入特征的端到端声纹识别模型［J］. 计算机应用， 2019， 39（10）： 2937-2941. 10.11772/j.issn.1001-9081.2019040757 WANG K， DONG Y F. Angular interval embedding based end?to?end voiceprint recognition model［J］. Journal of Computer Applications， 2019， 39（10）： 2937-2941. 10.11772/j.issn.1001-9081.2019040757
2 FANG X， GAO T， ZOU L， et al. Bidirectional attention for text?dependent speaker verification［J］. Sensors， 2020， 20（23）： No.6784. 10.3390/s20236784
3 DAS R K， TIAN X H， KINNUNEN T， et al. The attacker’s perspective on automatic speaker verification： an overview［C］// Proceedings of the Interspeech 2020. ［S.l.］： International Speech Communication Association， 2020： 4213-4217. 10.21437/interspeech.2020-1052
4 TODISCO M， WANG X， VESTMAN V， et al. ASVspoof 2019： future horizons in spoofed and fake audio detection［C］// Proceedings of the Interspeech 2019. ［S.l.］： International Speech Communication Association， 2019： 1008-1012. 10.21437/interspeech.2019-2249
5 CHEN X H， ZHANG Y， ZHU G， et al. UR channel?robust synthetic speech detection system for ASVspoof 2021［EB/OL］. （2021-09-16）［2021-10-25］.https：//www.isca-speech.org/archive/pdfs/asvspoof_2021/chen21_asvspoof.pdf. 10.21437/asvspoof.2021-12
6 刘振焘，徐建平，吴敏，等. 语音情感特征提取及其降维方法综述［J］. 计算机学报， 2018， 41（12）： 2833-2851. 10.11897/SP.J.1016.2018.02833 LIU Z T， XU J P， WU M， et al. Review of emotional feature extraction and dimension reduction method for speech emotion recognition［J］. Chinese Journal of Computers， 2018， 41（12）： 2833-2851. 10.11897/SP.J.1016.2018.02833
7 FANG X， ZOU L， LI J， et al. Channel adversarial training for cross?channel text?independent speaker recognition［C］// Proceedings of the 2019 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2019： 6221-6225. 10.1109/icassp.2019.8682327
8 PATEL T B， PATIL H A. Combining evidences from Mel cepstral， cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech［C］// Proceedings of the Interspeech 2015. ［S.l.］： International Speech Communication Association， 2015： 2062-2066. 10.21437/interspeech.2015-467
9 WITKOWSKI M， KACPRZAK S， ?ELASKO P， et al. Audio replay attack detection using high?frequency features［C］// Proceedings of the Interspeech 2017. ［S.l.］： International Speech Communication Association， 2017： 27-31. 10.21437/interspeech.2017-776
10 TOM F， JAIN M， DEY P. End?to?end audio replay attack detection using deep convolutional networks with attention［C］// Proceedings of the Interspeech 2018. ［S.l.］： International Speech Communication Association， 2018： 681-685. 10.21437/interspeech.2018-2279
11 ZEINALI H， STAFYLAKIS T， ATHANASOPOULOU G， et al. Detecting spoofing attacks using VGG and SincNet： BUT?Omilia submission to ASVspoof 2019 challenge［C］// Proceedings of the Interspeech 2019. ［S.l.］： International Speech Communication Association， 2019： 1073-1077. 10.21437/interspeech.2019-2892
12 JUNG J W， HEO H S， KIM J H， et al. RawNet： advanced end?to?end deep neural network using raw waveforms for text?independent speaker verification［C］// Proceedings of the Interspeech 2019. ［S.l.］： International Speech Communication Association， 2019： 1268-1272. 10.21437/interspeech.2019-1982
13 WANG F， CHENG J， LIU W Y， et al. Additive margin softmax for face verification［J］. IEEE Signal Processing Letters， 2018， 25（7）： 926-930. 10.1109/lsp.2018.2822810
14 CHEN X K， YUAN Y H， ZENG G， et al. Semi?supervised semantic segmentation with cross pseudo supervision［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 2613-2622. 10.1109/cvpr46437.2021.00264
15 CHEN Z X， XIE Z F， ZHANG W B， et al. ResNet and model fusion for automatic spoofing detection［C］// Proceedings of the Interspeech 2017. ［S.l.］： International Speech Communication Association， 2017： 102-106. 10.21437/interspeech.2017-1085
16 GAO S H， CHENG M M， ZHAO K， et al. Res2Net： a new multi?scale backbone architecture［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2021， 43（2）： 652-662. 10.1109/tpami.2019.2938758
17 FANG X， DU H J， GAO T， et al. Voice spoofing detection with raw waveform based on dual path Res2Net ［C］// Proceedings of the 5th International Conference on Crowd Science and Engineering. New York： ACM， 2021： 160-165. 10.1145/3503181.3503218
18 HE K, ZHANG X, REN S, et al. Delving deep into rectifiers: surpassing human-level performance on imagenet classification[C]// Proceedings of the 2018 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2015: 1026-1034. 10.1109/iccv.2015.123
19 GLOROT X, BENGIO Y. Understanding the difficulty of training deep feedforward neural networks[C]// Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. [S.l.]: JMLR Workshop and Conference Proceedings, 2010: 249-256.

[1]	伏博毅, 彭云聪, 蓝鑫, 秦小林. 基于深度学习的标签噪声学习算法综述[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 674-684.
[2]	李锦烨, 黄瑞章, 秦永彬, 陈艳平, 田小瑜. 基于反绎学习的裁判文书量刑情节识别[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1802-1807.
[3]	聂青青, 万定生, 朱跃龙, 李致家, 姚成. 基于时域卷积网络的水文模型[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1756-1761.
[4]	邱永茹, 姚光乐, 冯杰, 崔昊宇. 基于半监督学习的单幅图像去雨算法[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1577-1582.
[5]	殷雨昌, 王洪元, 陈莉, 冯尊登, 肖宇. 基于单标注样本的多损失学习与联合度量视频行人重识别[J]. 《计算机应用》唯一官方网站, 2022, 42(3): 764-769.
[6]	吴洁, 张师天, 谢海滨, 杨光. 基于多影像中心磁共振成像数据的半监督膝盖异常分类[J]. 《计算机应用》唯一官方网站, 2022, 42(1): 316-324.
[7]	张师鹏, 李永忠, 杜祥通. 基于半监督学习和三支决策的入侵检测模型[J]. 计算机应用, 2021, 41(9): 2602-2608.
[8]	毛铭泽, 曹芮浩, 闫春钢. 基于权值多样性的半监督分类算法[J]. 计算机应用, 2021, 41(9): 2473-2480.
[9]	曹玉红, 徐海, 刘荪傲, 王紫霄, 李宏亮. 基于深度学习的医学影像分割研究综述[J]. 《计算机应用》唯一官方网站, 2021, 41(8): 2273-2287.
[10]	刘世泽, 朱奕达, 陈润泽, 罗海勇, 赵方, 孙艺, 王宝会. 基于残差时域注意力神经网络的交通模式识别算法[J]. 计算机应用, 2021, 41(6): 1557-1565.
[11]	朱玉娜, 张玉涛, 闫少阁, 范钰丹, 陈韩托. 基于半监督子空间聚类的协议识别方法[J]. 计算机应用, 2021, 41(10): 2900-2904.
[12]	吕亚丽, 苗钧重, 胡玮昕. 基于标签进行度量学习的图半监督学习算法[J]. 计算机应用, 2020, 40(12): 3430-3436.
[13]	程凯, 王妍, 刘剑飞. 基于生成对抗网络的自动细胞核分割半监督学习方法[J]. 计算机应用, 2020, 40(10): 2917-2922.
[14]	牛晓可, 黄伊鑫, 徐华兴, 蒋震阳. 基于听皮层神经元感受野的强噪声环境下说话人识别[J]. 计算机应用, 2020, 40(10): 3034-3040.
[15]	杨健, 李振鹏, 苏鹏. 语音分割与端点检测研究综述[J]. 计算机应用, 2020, 40(1): 1-7.

基于时域波形的半监督端到端虚假语音检测方法

Semi‑supervised end‑to‑end fake speech detection method based on time‑domain waveforms

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics