基于多尺度频率通道注意力融合的声纹库构建方法

doi:10.11772/j.issn.1001-9081.2023081276

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (8): 2407-2413.DOI: 10.11772/j.issn.1001-9081.2023081276

基于多尺度频率通道注意力融合的声纹库构建方法

陈彤, 杨丰玉(), 熊宇, 严荭, 邱福星

南昌航空大学软件学院，南昌 330063

收稿日期:2023-09-18 修回日期:2023-09-26 接受日期:2023-10-09 发布日期:2024-08-22 出版日期:2024-08-10
通讯作者: 杨丰玉
作者简介:陈彤（2002—），女，江西吉安人，硕士研究生，CCF会员，主要研究方向：可信人工智能
杨丰玉（1980—），男，江西九江人，副教授，硕士，CCF会员，主要研究方向：可信人工智能、智能化软件测试 99770277@qq.com
熊宇（1985—），男，江西南昌人，讲师，博士，主要研究方向：社会媒体挖掘、多模态数据融合
严荭（1999—），女，江西上饶人，硕士研究生，CCF会员，主要研究方向：可信人工智能
邱福星（1998—），男，江西赣州人，硕士研究生，主要研究方向：软件缺陷预测、智能化软件测试。
基金资助:
国家自然科学基金资助项目(61762067)

Construction method of voiceprint library based on multi-scale frequency-channel attention fusion

Tong CHEN, Fengyu YANG(), Yu XIONG, Hong YAN, Fuxing QIU

School of Software，Nanchang Hangkong University，Nanchang Jiangxi 330063，China

Received:2023-09-18 Revised:2023-09-26 Accepted:2023-10-09 Online:2024-08-22 Published:2024-08-10
Contact: Fengyu YANG
About author:CHEN Tong ， born in 2002， M. S. candidate. Her researchinterests include trusted artificial intelligence.
YANG Fengyu， born in 1980， M. S.， associate professor. Hisresearch interests include trusted artificial intelligence， intelligentsoftware testing.
XIONG Yu ， born in 1985， Ph. D.， lecturer. His research interestsinclude social media mining， multi-modal data fusion.
YAN Hong ， born in 1999， M. S. candidate. Her research interestsinclude trusted artificial intelligence.
QIU Fuxing， born in 1998， M. S. candidate. His research interestsinclude software defect prediction， intelligent software testing.
Supported by:
This work is partially supported by National Natural ScienceFoundation of China（ 61762067）.

摘要/Abstract

摘要：

为解决声纹识别准确性易受外部因素影响的问题，提出一种基于多尺度频率通道注意力融合时延神经网络（MFCA-TDNN）模型的声纹识别算法。MFCA-TDNN在ECAPA-TDNN（Emphasized Channel Attention Propagation Aggregation Time Delay Neural Network）的基础上作了3点改进，包括：加入了多尺度频率通道注意力前端以从话语中获得高分辨率的特征表示、添加了多尺度通道注意力模块结合局部和全局的特征以融合多尺度信息、嵌入了特征注意力融合模块为多尺度的融合特征加权。这些改进使模型更好地利用多尺度的时频信息，提高识别能力。实验结果表明，与ECAPA-TDNN模型相比，MFCA-TDNN模型等错误率（EER）和最小检测代价函数（minDCF）分别下降5.9%和7.9%；最低的EER可达到3.83%，最低的minDCF可达到0.220 2。

关键词: 声纹库, 时延神经网络, 多尺度特征提取, 频率通道注意力, 特征注意力融合

Abstract:

To address the problem that the accuracy of speaker verification is easily affected by external factors， a speaker verification algorithm was proposed based on a Multi-scale Frequency-Channel Attention fused Time-Delay Neural Network （MFCA-TDNN） model. Three improvements were made to MFCA-TDNN on the basis of the ECAPA-TDNN （Emphasized Channel Attention Propagation Aggregation Time Delay Neural Network）， including： incorporating a multi-scale frequency-channel attention front-end to obtain high-resolution feature representations from speech， adding a multi-scale channel attention module to fuse multi-scale information by combining local and global features， and embedding a feature attention fusion module to weight the fusion features of multiple scales. These improvements enabled the model to make better use of multi-scale time-frequency information and improve recognition capability. Experimental results show that compared to the ECAPA-TDNN model， MFCA-TDNN model achieves a reduction of 5.9% and 7.9% in Equal Error Rate （EER） and minimum Detection Cost Function （minDCF）， respectively， with the lowest EER of 3.83% and the lowest minDCF of 0.220 2.

Key words: voiceprint library, delay neural network, multi-scale feature extraction, frequency-channel attention, feature attention fusion

中图分类号:

TP183

陈彤, 杨丰玉, 熊宇, 严荭, 邱福星. 基于多尺度频率通道注意力融合的声纹库构建方法[J]. 计算机应用, 2024, 44(8): 2407-2413.

Tong CHEN, Fengyu YANG, Yu XIONG, Hong YAN, Fuxing QIU. Construction method of voiceprint library based on multi-scale frequency-channel attention fusion[J]. Journal of Computer Applications, 2024, 44(8): 2407-2413.

图/表 11

图1 MFCA-TDNN模型的架构

Fig. 1 Architecture of MFCA-TDNN model

表1 多尺度频率通道注意力MFA模块层级结构

Tab. 1 Hierarchical structure of multi-scale frequency-channel attention MFA module

层级	简要描述	输出
Input Layer	Fbank特征	（B，D，T）
Conv2D	K=3，S=1，ReLU，BN	（B，Cmfa，D，T）
DMBlock1	将输入特征在通道维度Cmfa均分为scale块，而后对每小块的特征在（frequency，channel）维度上卷积（k=3）	（B，Cmfa/scale，D，T）
DMBlock2	对每小块特征在（frequency，channel）维度上进行注意力权重计算，而后将二维结果拉伸至一维再进行卷积（k=3），最后将结果在通道维度合并	（B，Cmfa*D，T）
Conv1D	将其通道维变换为ECAPA-TDNN所需通道数C	（B，C，T）

图2 利用MS-CAM优化的SE-Res2Block模块

Fig. 2 SE-Res2Block module optimized with MS-CAM

图3 利用AFF优化的SE-Res2Block模块

Fig. 3 SE-Res2Block module optimized with AFF

图4 基于MFCA-TDNN识别方法总体流程

Fig. 4 Overall flow of recognition method based on MFCA-TDNN

图5 四种模型训练损失对比

Fig. 5 Training loss comparison of four models

图6 四种模型验证集EER、minDCF评价指标对比

Fig. 6 EER and minDCF comparison of four models on validation set

表2 四种模型最佳性能对比

Tab. 2 Comparison of best performance among four models

模型	参数量/10⁶	时间/s	EER/%	minDCF
RawNet3	5.83	213	4.38	0.246 5
ResNet34	5.36	174	4.42	0.256 8
ECAPA-TDNN	6.19	298	4.19	0.239 1
MFCA-TDNN	8.92	402	3.94	0.2202

表3 数据增强效果

Tab. 3 Effect of data enhancement

是否数据增强	EER/%	minDCF
否	4.63	0.260 1
是	4.19	0.2391

表4 消融实验效果

Tab. 4 Results of ablation experiments

模型	参数量/10⁶	EER/%	minDCF
MFCA-TDNN	8.92	3.94	0.220 2
+MFA	8.15	3.84	0.224 1
+MS_CAM	7.42	3.83	0.223 6
+AFF	6.42	4.02	0.227 9

图7 不同说话人数的效果对比

Fig. 7 Effect comparison of different numbers of speakers

参考文献 19

1	陈晨，韩纪庆，陈德运，等.文本无关说话人识别中句级特征提取方法研究综述［J］.自动化学报， 2022， 48（3）： 664-688.
	CHEN C， HAN J Q， CHEN D Y，et al. Utterance-level feature extraction in text-independent speaker recognition： a review［J］. Acta Automatica Sinica， 2022， 48（3）： 664-688.
2	VARIANI E， LEI X， McDERMOTT E， et al. Deep neural networks for small footprint text-dependent speaker verification［C］// Proceedings of the 2014 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2014： 4052-4056.
3	SNYDER D， GARCIA-ROMERO D， SELL G， et al. X-vectors： robust DNN embeddings for speaker recognition［C］// Proceedings of the 2018 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2018： 5329-5333.
4	CHUNG J S， HUH J，MUN S， et al. In defence of metric learning for speaker recognition［EB/OL］. （2020-03-26）［2023-08-01］. .
5	DESPLANQUES B， THIENPONDT J， DEMUYNCK K. ECAPA-TDNN： emphasized channel attention， propagation and aggregation in TDNN based speaker verification［EB/OL］. （2020-05-14）［2023-08-01］. .
6	THIENPONDT J， DESPLANQUES B， DEMUYNCK K. Integrating frequency translational invariance in TDNNs and frequency positional information in 2D ResNets to enhance speaker verification［EB/OL］. （2021-04-06）［2023-08-01］..
7	ZHAO M， MA Y， LIU M， et al. The SpeakIn system for VoxCeleb Speaker Recognition Challange 2021［EB/OL］. （2021-09-05）［2023-08-01］. .
8	WAN Z-K， REN Q-H， QIN Y-C， et al. Statistical pyramid dense time delay neural network for speaker verification［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2022： 7532-7536.
9	HE K， ZHANG X， REN S， et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778.
10	GAO S-H， CHENG M-M， ZHAO K， et al. Res2Net： a new multi scale backbone architecture［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2021， 43（2）： 652- 662.
11	陈志高，李鹏，肖润秋，等.文本无关说话人识别的一种多尺度特征提取方法［J］. 电子与信息学报， 2021， 43（11）： 3266-3271.
	CEHN Z G， LI P， XIAO R Q，et al. A multi-scale feature extraction method for text-independent speaker recognition［J］. Journal of Electronics & Information Technology， 2021， 43（11）： 3266-3271.
12	邓力洪，邓飞，张葛祥，等.改进 Res2Net的多尺度端到端说话人识别系统［J］. 计算机工程与应用， 2023， 59（24）： 110-120.
	DENG L H， DENG F， ZHANG G X， et al. Multi-scale end-to-end speaker recognition system based on improved Res2Net［J］. Computer Engineering and Applications， 2023， 59（24）： 110-120.
13	WANG X， XUE F， WANG W， et al. A network model of speaker identification with new feature extraction methods and asymmetric BLSTM［J］. Neurocomputing， 2020， 403： 167-181.
14	ABRAHAM J V T， KHAN A N， SHAHINA A. A deep learning approach for robust speaker identification using chroma energy normalized statistics and Mel frequency cepstral coefficients［J］. International Journal of Speech Technology， 2023， 26： 579-587.
15	LIU T， DAS R K， LEE K A， et al. MFA： TDNN with multi-scale frequency-channel attention for text-independent speaker verification with short utterances［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2022： 7517-7521.
16	DAI Y， GIESEKE F， OEHMCKE S， et al. Attentional feature fusion［C］// Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision. Piscataway： IEEE， 2021： 3559-3568.
17	PARK D S， CHAN W， ZHANG Y， et al. SpecAugment： a simple data augmentation method for automatic speech recognition［EB/OL］. （2019-08-18）［2023-08-01］. .
18	SNYDER D， CHEN G， POVEY D. MUSAN： a music， speech， and noise corpus［EB/OL］. （2015-10-28）［2023-08-01］. .
19	JUNG J-W， KIM Y J， H-S HEO， et al. Pushing the limits of raw waveform speaker recognition［J］. （2022-03-16）［2023-08-01］. .

[1]	杨有, 张汝荟, 许鹏程, 康慷, 翟浩. 面向民国档案印章分割的改进U-Net[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 943-948.
[2]	仇天昊, 陈淑荣. 基于EfficientNet的双分路多尺度联合学习行人再识别[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2065-2071.

基于多尺度频率通道注意力融合的声纹库构建方法

Construction method of voiceprint library based on multi-scale frequency-channel attention fusion

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 19

相关文章 2

编辑推荐

Metrics