Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (12): 3899-3906.DOI: 10.11772/j.issn.1001-9081.2023121857

• Multimedia computing and computer simulation • Previous Articles     Next Articles

Speaker verification method based on speech quality adaptation and triplet-like idea

Chao WANG, Shanshan YAO()   

  1. Institute of Big Data Science and Industry,Shanxi University,Taiyuan Shanxi 030006,China
  • Received:2024-01-05 Revised:2024-03-12 Accepted:2024-03-15 Online:2024-03-28 Published:2024-12-10
  • Contact: Shanshan YAO
  • About author:WANG Chao, born in 1995, M. S. candidate, His research interests include voiceprint recognition.
  • Supported by:
    National Natural Science Foundation of China(61906115);Fundamental Research Program of Shanxi Province(202303021221075)

基于语音质量自适应和类三元组思想的说话人确认方法

王超, 姚姗姗()   

  1. 山西大学 大数据科学与产业研究院,太原 030006
  • 通讯作者: 姚姗姗
  • 作者简介:王超(1995—),男,山西大同人,硕士研究生,主要研究方向:声纹识别;
    姚姗姗(1989—),女,山西晋中人,副教授,博士,CCF会员,主要研究方向:声纹识别、多媒体大数据检索。 yaoshanshan@sxu.edu.cn
  • 基金资助:
    国家自然科学基金资助项目(61906115);山西省基础研究计划项目(202303021221075)

Abstract:

Aiming at the problem that current Speaker Verification (SV) methods suffer from serious performance degradation in complex test scenarios or when the speech quality degradation is large, a speaker verification Method based on speech Quality Adaptation and Triplet-like idea (QATM) was proposed. Firstly, the feature norms of the speaker's voice were utilized to correlate the speech quality, Then, through judging the quality of the speech samples, the importance of speech samples of different qualities was adjusted by different loss functions, so as to pay attention to the hard samples with high speech quality and ignore the hard samples with low speech quality. Finally, the triplet-like idea was utilized to simultaneously improve AM-Softmax (Additive Margin Softmax) loss and AAM-Softmax (Additive Angular Margin Softmax) loss, aiming to pay more attention to hard speaker samples to cope with the damage of hard samples with too poor speech quality to the model. Experimental results show that when the training set is VoxCeleb2 development set, the proposed method reduces the Equal Error Rate (EER) compared to the AAM-Softmax loss-based method on VoxCeleb1-O test set in network architecture Half-ResNet34, ResNet34, and ECAPA-TDNN (Emphasized Channel Attention, Propagation and Aggregation in Time Delay Neural Network) by 6.41%, 3.89%, and 7.27%, respectively. When the training set is Cn-Celeb.Train, the proposed method reduces the EER by 5.25% on evaluation set Cn-Celeb.Eval compared to the AAM-Softmax loss-based method in network architecture Half-ResNet34. It can be seen that the accuracy of the proposed method is improved in both ordinary and complex scenarios.

Key words: Speaker Verification (SV), hard sample, speech quality, adaptation, triplet-like idea

摘要:

针对目前的说话人确认(SV)方法在复杂的测试场景或语音质量退化较大时性能下降严重的问题,提出一种基于语音质量自适应和类三元组思想的SV方法(QATM)。首先,利用说话人语音的特征范数关联语音质量;其次,通过判断语音质量好坏选取不同的损失函数,以调整不同质量语音样本的重要性,从而关注语音质量高的难样本,忽略语音质量低的难样本;最后,利用类三元组的思想同时改进AM-Softmax(Additive Margin Softmax)损失和AAM-Softmax(Additive Angular Margin Softmax)损失,旨在更关注困难的说话人样本,从而应对语音质量过差的难样本对模型的损害。实验结果表明,当训练集为VoxCeleb2开发集时,在Half-ResNet34、ResNet34和ECAPA-TDNN (Emphasized Channel Attention, Propagation and Aggregation in Time Delay Neural Network)网络架构中,所提方法与基于AAM-Softmax损失的方法相比,在VoxCeleb1-O测试集上的等错误率(EER)分别降低了6.41%、3.89%和7.27%;当训练集为Cn-Celeb.Train时,在Half-ResNet34网络架构中,所提方法与基于AAM-Softmax损失的方法相比,在评估集Cn-Celeb.Eval上的EER降低了5.25%。可见,所提方法在普通和复杂场景下的准确度均有所提高。

关键词: 说话人确认, 难样本, 语音质量, 自适应, 三元组思想

CLC Number: