Angular interval embedding based end-to-end voiceprint recognition model

doi:10.11772/j.issn.1001-9081.2019040757

Journal of Computer Applications ›› 2019, Vol. 39 ›› Issue (10): 2937-2941.DOI: 10.11772/j.issn.1001-9081.2019040757

• Artificial intelligence • Previous Articles Next Articles

Angular interval embedding based end-to-end voiceprint recognition model

WANG Kang¹, DONG Yuanfei^1,2

1. Nanjing Fiber Home World Communication Technology Company Limited, Nanjing Jiangsu 210019, China;
2. Wuhan Research Institute of Posts and Telecommunications, Wuhan Hubei 430074, China

Received:2019-05-05 Revised:2019-07-06 Online:2019-08-21 Published:2019-10-10
Supported by:
This work is partially supported by the National Key Research and Development Plan for China (2017YFB1400704).

基于角度间隔嵌入特征的端到端声纹识别模型

王康¹, 董元菲^1,2

1. 南京烽火天地通信科技有限公司, 南京 210019;
2. 武汉邮电科学研究院, 武汉 430074

通讯作者: 王康
作者简介:王康(1987-),男,江苏南京人,工程师,主要研究方向:视频目标跟踪及行为分析、图像识别、音频识别、高性能计算;董元菲(1995-),女,湖北武汉人,硕士研究生,主要研究方向:语音信号处理、深度学习。
基金资助:
国家重点研发计划项目（2017YFB1400704）。

Abstract

Abstract: An end-to-end model with angular interval embedding was constructed to solve the problems of complicated multiple steps and weak generalization ability in the traditional voiceprint recognition model based on the combination of identity vector (i-vector) and Probabilistic Linear Discriminant Analysis (PLDA). A deep convolutional neural network was specially designed to extract deep speaker embedding from the acoustic features of voice data. The Angular Softmax (A-Softmax), which is based on angular improvement, was employed as the loss function to keep the angular interval between the different classes of features learned by the model and make the clustering of the similar features closer in the angle space. Compared with the method combining i-vector and PLDA, it shows that the proposed model has the identification accuracy of Top-1 and Top-5 increased by 58.9% and 30% respectively and has the minimum detection cost and equal error rate reduced by 47.9% and 45.3% respectively for speaker verification on the public dataset VoxCeleb2. The results verify that the proposed end-to-end model is more suitable for learning class-discriminating features from multi-channel and large-scale datasets.

Key words: voiceprint recognition, end-to-end model, loss function, convolutional neural network, deep speaker embedding

摘要： 针对传统身份认证矢量（i-vector）与概率线性判别分析（PLDA）结合的声纹识别模型步骤繁琐、泛化能力较弱等问题，构建了一个基于角度间隔嵌入特征的端到端模型。该模型特别设计了一个深度卷积神经网络，从语音数据的声学特征中提取深度说话人嵌入；选择基于角度改进的A-Softmax作为损失函数，在角度空间中使模型学习到的不同类别特征始终存在角度间隔并且同类特征间聚集更紧密。在公开数据集VoxCeleb2上进行的测试表明，与i-vector结合PLDA的方法相比，该模型在说话人辨认中的Top-1和Top-5上准确率分别提高了58.9%和30%；而在说话人确认中的最小检测代价和等错误率上分别减小了47.9%和45.3%。实验结果验证了所设计的端到端模型更适合在多信道、大规模的语音数据集上学习到有类别区分性的特征。

关键词: 声纹识别, 端到端模型, 损失函数, 卷积神经网络, 深度说话人嵌入

CLC Number:

WANG Kang, DONG Yuanfei. Angular interval embedding based end-to-end voiceprint recognition model[J]. Journal of Computer Applications, 2019, 39(10): 2937-2941.

王康, 董元菲. 基于角度间隔嵌入特征的端到端声纹识别模型[J]. 计算机应用, 2019, 39(10): 2937-2941.

References

[1] KINNUNEN T, LI H. An overview of text-independent speaker recognition:from features to supervectors[J]. Speech Communication, 2010, 52(1):12-40.
[2] DEHAK N, KENNY P J, DEHAK R, et al. Front-end factor analysis for speaker verification[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(4):788-798.
[3] LI C, MA X, JIANG B, et al. Deep speaker:an end-to-end neural speaker embedding system[EB/OL].[2019-01-10]. https://arxiv.org/pdf/1705.02304.pdf.
[4] LEI Y, SCHEFFER N, FERRER L, et al. A novel scheme for speaker recognition using a phonetically-aware deep neural network[C]//Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway:IEEE, 2014:1695-1699.
[5] FU T, QIAN Y, LIU Y, et al. Tandem deep features for text-dependent speaker verification[EB/OL].[2019-01-10]. https://www.isca-speech.org/archive/archive_papers/interspeech_2014/i14_1327.pdf.
[6] TIAN Y, CAI M, HE L, et al. Investigation of bottleneck features and multilingual deep neural networks for speaker verification[EB/OL].[2019-01-10]. https://www.isca-speech.org/archive/interspeech_2015/papers/i15_1151.pdf.
[7] VARIANI E, LEI X, McDERMOTT E, et al. Deep neural networks for small footprint text-dependent speaker verification[C]//Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway:IEEE, 2014:4052-4056.
[8] CAI W, CHEN J, LI M. Analysis of length normalization in end-to-end speaker verification system[EB/OL].[2019-01-10]. https://arxiv.org/pdf/1806.03209.pdf.
[9] 王昕, 张洪冉. 基于DNN处理的鲁棒性I-Vector说话人识别算法[J]. 计算机工程与应用, 2018, 54(22):167-172. (WANG X, ZHANG H R. Robust i-vector speaker recognition method based on DNN processing[J]. Computer Engineering and Applications, 2018, 54(22):167-172.)
[10] LIU W, WEN Y, YU Z, et al. SphereFace:deep hypersphere embedding for face recognition[C]//Proceedings of the IEEE 2017 Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2017:6738-6746.
[11] HEIGOLD G, MORENO I, BENGIO S, et al. End-to-end text-dependent speaker verification[C]//Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway:IEEE, 2016:5115-5119.
[12] SNYDER D, GHAHREMANI P, POVEY D, et al. Deep neural network-based speaker embeddings for end-to-end speaker verification[C]//Proceedings of the 2016 IEEE Spoken Language Technology Workshop. Piscataway:IEEE, 2016:165-170.
[13] ZHANG Y, PEZESHKI M, BRAKEL P, et al. Towards end-to-end speech recognition with deep convolutional neural networks[EB/OL].[2019-01-10]. https://arxiv.org/pdf/1701.02720.pdf.
[14] ZHANG C, KOISHIDA K. End-to-end text-independent speaker verification with triplet loss on short utterances[EB/OL].[2019-01-10]. https://www.isca-speech.org/archive/Interspeech_2017/pdfs/1608.PDF.
[15] WEN Y, ZHANG K, LI Z, et al. A discriminative feature learning approach for deep face recognition[C]//Proceedings of the 2016 European Conference on Computer Vision, LNCS 9911. Cham:Springer, 2016:499-515.
[16] LIU W, WEN Y, YU Z, et al. Large-margin softmax loss for convolutional neural networks[EB/OL].[2019-01-10]. https://arxiv.org/pdf/1612.02295.pdf.
[17] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:770-778.
[18] CHUNG J S, NAGRANI A, ZISSERMAN A. VoxCeleb2:deep speaker recognition[EB/OL].[2019-01-10]. https://arxiv.org/pdf/1806.05622.pdf.
[19] NAGRANI A, CHUNG J S, ZISSERMAN A. VoxCeleb:a large-scale speaker identification dataset[EB/OL].[2019-01-10]. https://arxiv.org/pdf/1706.08612.pdf.

Angular interval embedding based end-to-end voiceprint recognition model

基于角度间隔嵌入特征的端到端声纹识别模型

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	WANG Hebing, ZHANG Chunmei. Facial landmark detection based on ResNeXt with asymmetric convolution and squeeze excitation [J]. Journal of Computer Applications, 2021, 41(9): 2741-2747.
[2]	SONG Zhongshan, LIANG Jiarui, ZHENG Lu, LIU Zhenyu, TIE Jun. Remote sensing scene classification based on bidirectional gated scale feature fusion [J]. Journal of Computer Applications, 2021, 41(9): 2726-2735.
[3]	LI Kangkang, ZHANG Jing. Multi-layer encoding and decoding model for image captioning based on attention mechanism [J]. Journal of Computer Applications, 2021, 41(9): 2504-2509.
[4]	ZHANG Yongbin, CHANG Wenxin, SUN Lianshan, ZHANG Hang. Detection method of domains generated by dictionary-based domain generation algorithm [J]. Journal of Computer Applications, 2021, 41(9): 2609-2614.
[5]	ZHAO Hong, KONG Dongyi. Chinese description of image content based on fusion of image feature attention and adaptive attention [J]. Journal of Computer Applications, 2021, 41(9): 2496-2503.
[6]	XU Jianglang, LI Linyan, WAN Xinjun, HU Fuyuan. Indoor scene recognition method combined with object detection [J]. Journal of Computer Applications, 2021, 41(9): 2720-2725.
[7]	MOU Changning, WANG Haipeng, ZHOU Piyu, HOU Xinhang. De novo peptide sequencing by tandem mass spectrometry based on graph convolutional neural network [J]. Journal of Computer Applications, 2021, 41(9): 2773-2779.
[8]	ZENG Xiangyin, ZHENG Bochuan, LIU Dan. Detection of left and right railway tracks based on deep convolutional neural network and clustering [J]. Journal of Computer Applications, 2021, 41(8): 2324-2329.
[9]	CAO Yuhong, XU Hai, LIU Sun'ao, WANG Zixiao, LI Hongliang. Review of deep learning-based medical image segmentation [J]. Journal of Computer Applications, 2021, 41(8): 2273-2287.
[10]	QIN Binbin, PENG Liangkang, LU Xiangming, QIAN Jiangbo. Research progress on driver distracted driving detection [J]. Journal of Computer Applications, 2021, 41(8): 2330-2337.
[11]	HUANG Chengcheng, DONG Xiaoxiao, LI Zhao. Deep pipeline 5×5 convolution method based on two-dimensional Winograd algorithm [J]. Journal of Computer Applications, 2021, 41(8): 2258-2264.
[12]	WU Zeju, JIAO Cuijuan, CHEN Liang. Tire defect detection method based on improved Faster R-CNN [J]. Journal of Computer Applications, 2021, 41(7): 1939-1946.
[13]	YANG Su, OUYANG Zhi, DU Nisuo. Unsupervised parallel hash image retrieval based on correlation distance [J]. Journal of Computer Applications, 2021, 41(7): 1902-1907.
[14]	WU Guangli, LI Leiting, GUO Zhenzhou, WANG Chengxiang. Video summarization generation model based on improved bi-directional long short-term memory network [J]. Journal of Computer Applications, 2021, 41(7): 1908-1914.
[15]	WANG Yue, JIANG Yiming, LAN Julong. Intrusion detection based on improved triplet network and K-nearest neighbor algorithm [J]. Journal of Computer Applications, 2021, 41(7): 1996-2002.