• •    

基于多尺度频率通道注意力融合的声纹库构建方法研究(BigData2023+P00358)

陈彤,杨丰玉,熊宇,严荭,邱福星   

  1. 南昌航空大学
  • 收稿日期:2023-09-18 修回日期:2023-09-26 发布日期:2023-12-18
  • 通讯作者: 杨丰玉

Construction method of voiceprint library based on multi-scale frequency channel attention fusion(BigData2023+P00358)

  • Received:2023-09-18 Revised:2023-09-26 Online:2023-12-18
  • Contact: Feng-Yu YANG

摘要: 摘 要: 为解决声纹识别准确性易受外部因素影响的问题,提出了一种基于多尺度频率通道注意力融合时延神经网络模型的声纹识别算法。该模型在 ECAPA-TDNN 模型的基础上进行了三点改进,包括:加入了多尺度频率通道注意力前端以从话语中获得高分辨率的特征表示、添加了多尺度通道注意力模块结合局部和全局的特征以融合多尺度信息、嵌入了特征注意力融合模块为多尺度的融合特征加权。这些改进使得模型更好地利用多尺度的时频信息,提高识别能力。为了证明以上改进点能够有效提高声纹识别的准确性和可靠性,基于公开电话语音数据集,从数据增强方面设计了相关对比实验、从特征提取方面设计了识别效果对比实验以及模型改进部分的消融实验。实验结果表明,与ECAPA-TDNN模型相比,MFCA-TDNN 模型在等错误率和最小检测代价函数两个指标中的下降幅度分别为5.9%和7.9%,其中最低的等错误率可以降低至3.83%,最小的检测代价函数可达到0.2202。

关键词: 关键词: 声纹库, 时延神经网络, 多尺度特征提取, 频率通道注意力, 特征注意力融合

Abstract: Abstract: In order to solve the problem that the accuracy of voicing recognition is easily affected by external factors, a voicing recognition algorithm based on multi-scale frequency channel attention fusion delay neural network model is proposed. Based on the ECAPA-TDNN model, the model has three improvements, including: A multi-scale frequency channel attention front end is added to obtain high-resolution feature representation from discourse, a multi-scale channel attention module is added to combine local and global features to fuse multi-scale information, and a feature attention fusion module is embedded to weigh multi-scale fusion features. With these improvements, the model makes better use of multi-scale time-frequency information and improve the recognition ability. In order to prove that the above improvement points can effectively improve the accuracy and reliability of voiceprint recognition, based on the public telephone voice data set, we designed correlation comparison experiments from the aspect of data enhancement, recognition effect comparison experiments from the aspect of feature extraction and ablation experiments from the model improvement part. The experimental results show that compared with ECAPA-TDNN model, the reduction rate of equal error rate and minimum detection cost function in MFCA-TDNN model is 5.9% and 7.9%, respectively. The lowest equal error rate can be reduced to 3.83%, and the smallest detection cost function can reach 0.2202.

Key words: Keywords: voiceprint library, delay neural network, multi-scale feature extraction, frequency channel attention, feature attention fusion

中图分类号: