Journal of Computer Applications ›› 2019, Vol. 39 ›› Issue (10): 2937-2941.DOI: 10.11772/j.issn.1001-9081.2019040757

• Artificial intelligence • Previous Articles     Next Articles

Angular interval embedding based end-to-end voiceprint recognition model

WANG Kang1, DONG Yuanfei1,2   

  1. 1. Nanjing Fiber Home World Communication Technology Company Limited, Nanjing Jiangsu 210019, China;
    2. Wuhan Research Institute of Posts and Telecommunications, Wuhan Hubei 430074, China
  • Received:2019-05-05 Revised:2019-07-06 Online:2019-08-21 Published:2019-10-10
  • Supported by:
    This work is partially supported by the National Key Research and Development Plan for China (2017YFB1400704).

基于角度间隔嵌入特征的端到端声纹识别模型

王康1, 董元菲1,2   

  1. 1. 南京烽火天地通信科技有限公司, 南京 210019;
    2. 武汉邮电科学研究院, 武汉 430074
  • 通讯作者: 王康
  • 作者简介:王康(1987-),男,江苏南京人,工程师,主要研究方向:视频目标跟踪及行为分析、图像识别、音频识别、高性能计算;董元菲(1995-),女,湖北武汉人,硕士研究生,主要研究方向:语音信号处理、深度学习。
  • 基金资助:
    国家重点研发计划项目(2017YFB1400704)。

Abstract: An end-to-end model with angular interval embedding was constructed to solve the problems of complicated multiple steps and weak generalization ability in the traditional voiceprint recognition model based on the combination of identity vector (i-vector) and Probabilistic Linear Discriminant Analysis (PLDA). A deep convolutional neural network was specially designed to extract deep speaker embedding from the acoustic features of voice data. The Angular Softmax (A-Softmax), which is based on angular improvement, was employed as the loss function to keep the angular interval between the different classes of features learned by the model and make the clustering of the similar features closer in the angle space. Compared with the method combining i-vector and PLDA, it shows that the proposed model has the identification accuracy of Top-1 and Top-5 increased by 58.9% and 30% respectively and has the minimum detection cost and equal error rate reduced by 47.9% and 45.3% respectively for speaker verification on the public dataset VoxCeleb2. The results verify that the proposed end-to-end model is more suitable for learning class-discriminating features from multi-channel and large-scale datasets.

Key words: voiceprint recognition, end-to-end model, loss function, convolutional neural network, deep speaker embedding

摘要: 针对传统身份认证矢量(i-vector)与概率线性判别分析(PLDA)结合的声纹识别模型步骤繁琐、泛化能力较弱等问题,构建了一个基于角度间隔嵌入特征的端到端模型。该模型特别设计了一个深度卷积神经网络,从语音数据的声学特征中提取深度说话人嵌入;选择基于角度改进的A-Softmax作为损失函数,在角度空间中使模型学习到的不同类别特征始终存在角度间隔并且同类特征间聚集更紧密。在公开数据集VoxCeleb2上进行的测试表明,与i-vector结合PLDA的方法相比,该模型在说话人辨认中的Top-1和Top-5上准确率分别提高了58.9%和30%;而在说话人确认中的最小检测代价和等错误率上分别减小了47.9%和45.3%。实验结果验证了所设计的端到端模型更适合在多信道、大规模的语音数据集上学习到有类别区分性的特征。

关键词: 声纹识别, 端到端模型, 损失函数, 卷积神经网络, 深度说话人嵌入

CLC Number: