基于不确定性学习的文本无关的说话人确认方法

doi:10.11772/j.issn.1001-9081.2022121902

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (12): 3727-3732.DOI: 10.11772/j.issn.1001-9081.2022121902

所属专题：人工智能

基于不确定性学习的文本无关的说话人确认方法

张玉莲, 姚姗姗(), 王超, 畅江

山西大学大数据科学与产业研究院，太原 030006

收稿日期:2022-12-29 修回日期:2023-03-07 接受日期:2023-03-08 发布日期:2023-03-17 出版日期:2023-12-10
通讯作者: 姚姗姗
作者简介:张玉莲（1997—），女，山西晋城人，硕士研究生，主要研究方向：声纹识别
王超（1995—），男，山西大同人，硕士研究生，主要研究方向：声纹识别
畅江（1988—），女，山西运城人，讲师，博士，主要研究方向：语音情感分析。
基金资助:
国家自然科学基金资助项目(61906115);山西省青年科学基金资助项目(20210302124556)

Text-independent speaker verification method based on uncertainty learning

Yulian ZHANG, Shanshan YAO(), Chao WANG, Jiang CHANG

Institute of Big Data Science and Industry，Shanxi University，Taiyuan Shanxi 030006，China

Received:2022-12-29 Revised:2023-03-07 Accepted:2023-03-08 Online:2023-03-17 Published:2023-12-10
Contact: Shanshan YAO
About author:ZHANG Yulian， born in 1997， M. S. candidate. Her research interests include voiceprint recognition.
WANG Chao， born in 1995， M. S. candidate. His research interests include voiceprint recognition.
CHANG Jiang， born in 1988， Ph. D.， lecturer. Her research interests include speech sentiment analysis.
Supported by:
National Natural Science Foundation of China(61906115);Shanxi Province Science Foundation for Youths(20210302124556)

摘要/Abstract

摘要：

说话人确认任务旨在判断注册语音与测试语音是否属于同一说话人。针对说话人识别系统提取的声纹特征通常会受到与身份信息无关的因素干扰进而导致系统的准确性严重下降的问题，提出一种基于不确定性学习（UL）的文本无关的说话人确认（TISV）方法。首先，在说话人主干网络中引入不确定性同时学习声纹特征（均值）和话语数据的不确定性（方差），以建模语音数据集中的不确定性；其次，通过重采样技巧得到特征的分布表示；最后，在说话人分类损失中引入KL散度正则化约束噪声的分布，从而解决计算分类损失过程中的退化问题。实验结果表明，当训练集为VoxCeleb1和VoxCeleb2开发集时，与基于确定性方法的Thin ResNet34模型相比，所提方法的模型在VoxCeleb1-O测试集上的等错误率（EER）分别降低了9.9%和10.4%，最小检测代价函数（minDCF）分别降低了10.9%和4.5%。可见，所提方法在有噪声、无约束场景下的准确度有所提高。

关键词: 说话人确认, 数据不确定性, 分布嵌入, AAM-softmax, KL散度

Abstract:

The speaker verification task aims to determine whether a registration speech and a test speech belong to the same speaker. A Text-Independent Speaker Verification （TISV） method based on Uncertainty Learning （UL） was proposed to address the problem that the voiceprint features extracted by speaker recognition systems are usually disturbed by factors unrelated to identity information， thereby leading to serious degradation of the system accuracy. Firstly， uncertainty was introduced in the speaker backbone network to simultaneously learn the voiceprint features （mean） and the uncertainty of the speech data （variance）， so as to model the uncertainty in the speech dataset. Then， the distribution representation of the features was obtained by a resampling technique. Finally， the degradation problem in the calculation process of classification loss was solved by constraining the distribution of the noise through the introduction of KL （Kullback-Leibler） divergence regularization into the speaker classification loss. Experimental results show that after training on VoxCeleb1 and VoxCeleb2 development sets and testing on VoxCeleb1-O test set， compared with the certainty method-based model Thin ResNet34， the model of the proposed method has the Equal Error Rate （EER） reduced by 9.9% and 10.4% respectively， and minimum Detection Cost Function （minDCF） reduced by 10.9% and 4.5% respectively. It can be seen that the accuracy of the proposed method is improved in noisy and unconstrained scenarios.

Key words: speaker verification, data uncertainty, distribution embedding, AAM-softmax (Additive Angular Margin-softmax), KL (Kullback-Leibler) divergence

中图分类号:

TN912.34

张玉莲, 姚姗姗, 王超, 畅江. 基于不确定性学习的文本无关的说话人确认方法[J]. 计算机应用, 2023, 43(12): 3727-3732.

Yulian ZHANG, Shanshan YAO, Chao WANG, Jiang CHANG. Text-independent speaker verification method based on uncertainty learning[J]. Journal of Computer Applications, 2023, 43(12): 3727-3732.

图/表 8

图1 不确定性示意图

Fig.1 Schematic diagram of uncertainty

图2 确定性的说话人确认模型

Fig.2 Certainty-based speaker verification model

图3 基于UL的TISV方法的模型架构

Fig.3 Model architecture of TISV method based on UL

表1 基线Thin ResNet34和本文方法的网络架构

Tab.1 Network architectures of baseline Thin ResNet34 and proposed method

方法	层	卷积核大小	步长	输出尺寸
基线Thin ResNet34	Conv1	3×3×32	1×1	32×64×L
	Res1	3×3×32	1×1	32×64×L
	Res2	3×3×64	2×2	64×32× $L /$ 2
	Res3	3×3×128	2×2	128×16×L/4
	Res4	3×3×256	2×2	256×8×L/8
	Flatten	—	—	2 048×L/8
	SAP	—	—	4 096
	Linear	256	—	256
本文方法	Conv1	3×3×32	1×1	32×64×L
	Res1	3×3×32	1×1	32×64×L
	Res2	3×3×64	2×2	64×32× $L /$ 2
	Res3	3×3×128	2×2	128×16×L/4
	Res4	3×3×256	2×2	256×8×L/8
	Flatten	—	—	2 048×L/8
	SAP	—	—	4 096
	mu	256	—	256
	logvar	256	—	256

表1 基线Thin ResNet34和本文方法的网络架构

Tab.1 Network architectures of baseline Thin ResNet34 and proposed method

方法	层	卷积核大小	步长	输出尺寸
基线Thin ResNet34	Conv1	3×3×32	1×1	32×64×L
	Res1	3×3×32	1×1	32×64×L
	Res2	3×3×64	2×2	64×32× $L /$ 2
	Res3	3×3×128	2×2	128×16×L/4
	Res4	3×3×256	2×2	256×8×L/8
	Flatten	—	—	2 048×L/8
	SAP	—	—	4 096
	Linear	256	—	256
本文方法	Conv1	3×3×32	1×1	32×64×L
	Res1	3×3×32	1×1	32×64×L
	Res2	3×3×64	2×2	64×32× $L /$ 2
	Res3	3×3×128	2×2	128×16×L/4
	Res4	3×3×256	2×2	256×8×L/8
	Flatten	—	—	2 048×L/8
	SAP	—	—	4 096
	mu	256	—	256
	logvar	256	—	256

表2 训练集为VoxCeleb1开发集时，VoxCeleb1-O测试集上的评估结果

Tab.2 Evaluation results on VoxCeleb1-O test set after training on VoxCeleb1 development set

方法	VoxCeleb1-O
方法	EER/%	minDCF
基线Thin ResNet34	2.413	0.304
本文方法	2.175	0.271

表3 训练集为VoxCeleb2开发集时，VoxCeleb1-O、VoxCeleb1-E和VoxCeleb1-H测试集上的评估结果

Tab.3 Evaluation results on VoxCeleb1-O、VoxCeleb1-E and VoxCeleb1-H test sets after training on VoxCeleb2 development set

评估集	方法	EER/%	minDCF
VoxCeleb1-O	基线Thin ResNet34	1.186	0.132
VoxCeleb1-O	本文方法	1.063	0.126
VoxCeleb1-E	基线Thin ResNet34	1.332	0.152
VoxCeleb1-E	本文方法	1.261	0.144
VoxCeleb1-H	基线Thin ResNet34	2.433	0.244
VoxCeleb1-H	本文方法	2.361	0.237

表4 超参数λ取不同值，不同训练集和测试集上所提方法的评估结果

Tab.4 Evaluation results of proposed method on different test sets and development sets with different hyperparameter λ values

λ	训练集为VoxCeleb1开发集		训练集为Voxceleb2开发集
	VoxCeleb1-O测试集		VoxCeleb1-O测试集		VoxCeleb1-E测试集		VoxCeleb1-H测试集
	EER/%	minDCF	EER/%	minDCF	EER/%	minDCF	EER/%	minDCF
0	2.319	0.302	1.164	0.131	1.312	0.150	2.403	0.239
10^-2	2.217	0.294	1.138	0.130	1.251	0.142	2.351	0.233
10^-3	2.175	0.271	1.063	0.126	1.261	0.144	2.361	0.237
10^-4	2.185	0.261	1.106	0.129	1.262	0.148	2.345	0.232

表5 消融实验结果

Tab.5 Ablation experimental results

约束方式	VoxCeleb1-O
约束方式	EER/%	minDCF
整个数据集噪声近似高斯分布	1.063	0.126
每个batch噪声近似高斯分布	1.130	0.129

参考文献 38

1	DENG J， DONG W， SOCHER R， et al. ImageNet： a large-scale hierarchical image database［C］// Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2009： 248-255. 10.1109/cvpr.2009.5206848
2	HUANG G B， MATTAR M， BERG T， et al. Labeled faces in the wild： a database for studying face recognition in unconstrained environments ［EB/OL］. ［2019-02-25］. . 10.1117/12.2080393
3	NAGRANI A， CHUNG J S， ZISSERMAN A. VoxCeleb： a large-scale speaker identification dataset［EB/OL］. ［2018-05-30］. . 10.21437/interspeech.2017-950
4	JOYCE J M. Kullback-Leibler divergence［M］// International Encyclopedia of Statistical Science. Cham： Springer， 2011： 720-722. 10.1007/978-3-642-04898-2_327
5	DEHAK N， KENNY P J， DEHAK R， et al. Front-end factor analysis for speaker verification ［J］. IEEE Transactions on Audio， Speech， and Language Processing， 2010， 19（4）： 788-798. 10.1109/tasl.2010.2064307
6	HANSEN J H L， HASAN T. Speaker recognition by machines and humans： a tutorial review［J］. IEEE Signal Processing Magazine， 2015， 32（6）： 74-99. 10.1109/msp.2015.2462851
7	SNYDER D， GARCIA-ROMERO D， POVEY D， et al. Deep neural network embeddings for text-independent speaker verification［C］// Proceedings of the INTERSPEECH 2017. ［S.l.］： International Speech Communication Association， 2017：999-1003. 10.21437/interspeech.2017-620
8	NOVOSELOV S， SHULIPA A， KREMNEV I， et al. On deep speaker embeddings for text-independent speaker recognition［EB/OL］. ［2018-04-26］. . 10.21437/odyssey.2018-53
9	OKABE K， KOSHINAKA T， SHINODA K. Attentive statistics pooling for deep speaker embedding ［EB/OL］. ［2019-02-25］. . 10.21437/interspeech.2018-993
10	ZHU Y， KO T， SNYDER D， et al. Self-attentive speaker embeddings for text-independent speaker verification［C］// Proceedings of the INTERSPEECH 2018. ［S.l.］： International Speech Communication Association， 2018： 3573-3577. 10.21437/interspeech.2018-1158
11	CAI W， CAI Z， ZHANG X， et al. A novel learnable dictionary encoding layer for end-to-end language identification［C］// Proceedings of the 2018 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2018： 5189-5193. 10.1109/icassp.2018.8462025
12	LI C， MA X， JIANG B， et al. Deep Speaker： an end-to-end neural speaker embedding system［EB/OL］. ［2017-05-05］. .
13	KWON Y， H-S HEO， LEE B-J， et al. The ins and outs of speaker recognition： lessons from VoxSRC 2020［C］// Proceedings of the 2021 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2021： 5809-5813. 10.1109/icassp39728.2021.9413948
14	DESPLANQUES B， THIENPONDT J， DEMUYNCK K. ECAPA-TDNN： emphasized channel attention， propagation and aggregation in TDNN based speaker verification［EB/OL］. ［2020-08-10］. . 10.21437/interspeech.2020-2650
15	WU Y， ZHAO J， GUO C， et al. Improving deep CNN architectures with variable-length training samples for text-independent speaker verification［C］// Proceedings of the INTERSPEECH 2021. ［S.l.］： International Speech Communication Association， 2021： 81-85. 10.21437/interspeech.2021-559
16	LIU T， DAS R K， LEE K A， et al. MFA： TDNN with multi-scale frequency-channel attention for text-independent speaker verification with short utterances［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2022： 7517-7521. 10.1109/icassp43922.2022.9747021
17	KIM S-H， NAM H， Y-H PARK. Temporal dynamic convolutional neural network for text-independent speaker verification and phonemic analysis［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2022： 6742-6746. 10.1109/icassp43922.2022.9747421
18	陈晨，肜娅峰，季超群，等. 基于深层信息散度最大化的说话人确认方法［J］. 通信学报， 2021， 42（7）： 231-237. 10.11959/j.issn.1000-436x.2021133
	CHEN C， RONG Y F， JI C Q， et al. Speaker verification method based on deep information divergence maximization［J］. Journal on Communication， 2021， 42（7）： 231-237. 10.11959/j.issn.1000-436x.2021133
19	姜珊，张二华，张晗. 基于 Bi-GRU+BFE模型的短语音说话人识别［J］. 计算机与数字工程， 2022， 50（10）： 2233-2239.
	JIANG S， ZHANG E H， ZHANG H. Speaker recognition under short utterance based on Bi-GRU+BFE model［J］. Computer and Digital Engineering， 2022， 50（10）： 2233-2239.
20	ISOBE S， ARAI S. Deep convolutional encoder-decoder network with model uncertainty for semantic segmentation［C］// Proceedings of the 2017 IEEE International Conference on INnovations in Intelligent SysTems and Applications. Piscataway： IEEE， 2017： 365-370. 10.1109/inista.2017.8001187
21	HU P， SCLAROFF S， SAENKO K. Uncertainty-aware learning for zero-shot semantic segmentation［J］. Advances in Neural Information Processing Systems， 2020， 33： 21713-21724.
22	CHOI J， CHUN D， KIM H， et al. Gaussian YOLOv3： an accurate and fast object detector using localization uncertainty for autonomous driving［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE，2019： 502-511. 10.1109/iccv.2019.00059
23	YU T， LI D， YANG Y， et al. Robust person re-identification by modelling feature uncertainty ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 552-561. 10.1109/iccv.2019.00064
24	SHI Y， JAIN A K. Probabilistic face embeddings［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE，2019： 6901-6910. 10.1109/iccv.2019.00700
25	CHANG J， LAN Z， CHENG C， et al. Data uncertainty learning in face recognition［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 5709-5718. 10.1109/cvpr42600.2020.00575
26	HE K， ZHANG X， REN S， et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE，2016： 770-778. 10.1109/cvpr.2016.90
27	CAI W， CHEN J， LI M. Exploring the encoding layer and loss function in end-to-end speaker and language recognition system［EB/OL］. ［2018-04-14］. . 10.21437/odyssey.2018-11
28	DENG J， GUO J， XUE N， et al. ArcFace： additive angular margin loss for deep face recognition ［C］// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE，2019： 4690-4699. 10.1109/cvpr.2019.00482
29	KINGMA D P， WELLING M. Auto-encoding variational Bayes［EB/OL］. ［2018-04-14］. . 10.1561/2200000056
30	ALEMI A A， FISCHER I， DILLON J V， et al. Deep variational information bottleneck［EB/OL］. ［2019-10-23］. .
31	NAGRANI A， CHUNG J S， XIE W， et al. VoxCeleb： large-scale speaker verification in the wild［J］. Computer Speech & Language， 2020， 60： 101027. 10.1016/j.csl.2019.101027
32	SNYDER D， CHEN G， POVEY D. MUSAN： a music， speech， and noise corpus［EB/OL］. ［2019-10-23］..
33	KO T， PEDDINTI V， POVEY D， et al. A study on data augmentation of reverberant speech for robust speech recognition［C］// Proceedings of the 2017 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2017： 5220-5224. 10.1109/icassp.2017.7953152
34	DODDINGTON G R， PRZYBOCKI M A， MARTIN A F， et al. The NIST speaker recognition evaluation — overview， methodology， systems， results， perspective［J］. Speech Communication， 2000， 31（2/3）： 225-254. 10.1016/s0167-6393(99)00080-1
35	CHUNG J S， NAGRANI A， ZISSERMAN A. VoxCeleb2： deep speaker recognition［EB/OL］. ［2018-04-14］. . 10.21437/interspeech.2018-1929
36	CAI W， CHEN J， LI M. Exploring the encoding layer and loss function in end-to-end speaker and language recognition system［EB/OL］. ［2018-04-14］.. 10.21437/odyssey.2018-11
37	PARK D S， CHAN W， ZHANG Y， et al. SpecAugment： a simple data augmentation method for automatic speech recognition ［EB/OL］. ［2019-12-03］.. 10.21437/interspeech.2019-2680
38	LOSHCHILOV I， HUTTER F. SGDR： stochastic gradient descent with warm restarts［EB/OL］. ［2018-04-14］. .

[1]	王超, 姚姗姗. 基于语音质量自适应和类三元组思想的说话人确认方法[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3899-3906.
[2]	张变兰, 路永钢, 张海涛. 基于KL散度和近邻点间距离的球面嵌入算法[J]. 计算机应用, 2017, 37(3): 680-683.
[3]	徐文轩, 张莉. 基于单核苷酸统计和支持向量机集成的人类基因启动子识别[J]. 计算机应用, 2015, 35(10): 2808-2812.
[4]	余艳. 融合KL散度和移地距离的高斯混合模型相似性度量方法[J]. 计算机应用, 2014, 34(3): 828-832.
[5]	赵靖龚卫国杨利平. 基于GMM的普通话和四川方言独立文本的说话人确认[J]. 计算机应用, 2008, 28(3): 792-794.
[6]	王明肖熙. 变帧长和变帧率在说话人确认中的应用[J]. 计算机应用, 2007, 27(8): 2051-2052.
[7]	石如亮王波李弼程高新建. 低速率信道环境下鲁棒的说话人确认[J]. 计算机应用, 2007, 27(4): 919-921.
[8]	高新建屈丹李弼程 . 说话人确认中分数归一化的一种新方法[J]. 计算机应用, 2007, 27(10): 2602-2604.

基于不确定性学习的文本无关的说话人确认方法

Text-independent speaker verification method based on uncertainty learning

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 8

参考文献 38

相关文章 8

编辑推荐

Metrics