Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (8): 2407-2413.DOI: 10.11772/j.issn.1001-9081.2023081276

• Artificial intelligence • Previous Articles     Next Articles

Construction method of voiceprint library based on multi-scale frequency-channel attention fusion

Tong CHEN, Fengyu YANG(), Yu XIONG, Hong YAN, Fuxing QIU   

  1. School of Software,Nanchang Hangkong University,Nanchang Jiangxi 330063,China
  • Received:2023-09-18 Revised:2023-09-26 Accepted:2023-10-09 Online:2024-08-22 Published:2024-08-10
  • Contact: Fengyu YANG
  • About author:CHEN Tong , born in 2002, M. S. candidate. Her researchinterests include trusted artificial intelligence.
    YANG Fengyu, born in 1980, M. S., associate professor. Hisresearch interests include trusted artificial intelligence, intelligentsoftware testing.
    XIONG Yu , born in 1985, Ph. D., lecturer. His research interestsinclude social media mining, multi-modal data fusion.
    YAN Hong , born in 1999, M. S. candidate. Her research interestsinclude trusted artificial intelligence.
    QIU Fuxing, born in 1998, M. S. candidate. His research interestsinclude software defect prediction, intelligent software testing.
  • Supported by:
    This work is partially supported by National Natural ScienceFoundation of China( 61762067).

基于多尺度频率通道注意力融合的声纹库构建方法

陈彤, 杨丰玉(), 熊宇, 严荭, 邱福星   

  1. 南昌航空大学 软件学院,南昌 330063
  • 通讯作者: 杨丰玉
  • 作者简介:陈彤(2002—),女,江西吉安人,硕士研究生,CCF会员,主要研究方向:可信人工智能
    杨丰玉(1980—),男,江西九江人,副教授,硕士,CCF会员,主要研究方向:可信人工智能、智能化软件测试 99770277@qq.com
    熊宇(1985—),男,江西南昌人,讲师,博士,主要研究方向:社会媒体挖掘、多模态数据融合
    严荭(1999—),女,江西上饶人,硕士研究生,CCF会员,主要研究方向:可信人工智能
    邱福星(1998—),男,江西赣州人,硕士研究生,主要研究方向:软件缺陷预测、智能化软件测试。
  • 基金资助:
    国家自然科学基金资助项目(61762067)

Abstract:

To address the problem that the accuracy of speaker verification is easily affected by external factors, a speaker verification algorithm was proposed based on a Multi-scale Frequency-Channel Attention fused Time-Delay Neural Network (MFCA-TDNN) model. Three improvements were made to MFCA-TDNN on the basis of the ECAPA-TDNN (Emphasized Channel Attention Propagation Aggregation Time Delay Neural Network), including: incorporating a multi-scale frequency-channel attention front-end to obtain high-resolution feature representations from speech, adding a multi-scale channel attention module to fuse multi-scale information by combining local and global features, and embedding a feature attention fusion module to weight the fusion features of multiple scales. These improvements enabled the model to make better use of multi-scale time-frequency information and improve recognition capability. Experimental results show that compared to the ECAPA-TDNN model, MFCA-TDNN model achieves a reduction of 5.9% and 7.9% in Equal Error Rate (EER) and minimum Detection Cost Function (minDCF), respectively, with the lowest EER of 3.83% and the lowest minDCF of 0.220 2.

Key words: voiceprint library, delay neural network, multi-scale feature extraction, frequency-channel attention, feature attention fusion

摘要:

为解决声纹识别准确性易受外部因素影响的问题,提出一种基于多尺度频率通道注意力融合时延神经网络(MFCA-TDNN)模型的声纹识别算法。MFCA-TDNN在ECAPA-TDNN(Emphasized Channel Attention Propagation Aggregation Time Delay Neural Network)的基础上作了3点改进,包括:加入了多尺度频率通道注意力前端以从话语中获得高分辨率的特征表示、添加了多尺度通道注意力模块结合局部和全局的特征以融合多尺度信息、嵌入了特征注意力融合模块为多尺度的融合特征加权。这些改进使模型更好地利用多尺度的时频信息,提高识别能力。实验结果表明,与ECAPA-TDNN模型相比,MFCA-TDNN模型等错误率(EER)和最小检测代价函数(minDCF)分别下降5.9%和7.9%;最低的EER可达到3.83%,最低的minDCF可达到0.220 2。

关键词: 声纹库, 时延神经网络, 多尺度特征提取, 频率通道注意力, 特征注意力融合

CLC Number: