Construction method of voiceprint library based on multi-scale frequency-channel attention fusion

doi:10.11772/j.issn.1001-9081.2023081276

Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (8): 2407-2413.DOI: 10.11772/j.issn.1001-9081.2023081276

• Artificial intelligence • Previous Articles Next Articles

Construction method of voiceprint library based on multi-scale frequency-channel attention fusion

Tong CHEN, Fengyu YANG(), Yu XIONG, Hong YAN, Fuxing QIU

School of Software，Nanchang Hangkong University，Nanchang Jiangxi 330063，China

Received:2023-09-18 Revised:2023-09-26 Accepted:2023-10-09 Online:2024-08-22 Published:2024-08-10
Contact: Fengyu YANG
About author:CHEN Tong ， born in 2002， M. S. candidate. Her researchinterests include trusted artificial intelligence.
YANG Fengyu， born in 1980， M. S.， associate professor. Hisresearch interests include trusted artificial intelligence， intelligentsoftware testing.
XIONG Yu ， born in 1985， Ph. D.， lecturer. His research interestsinclude social media mining， multi-modal data fusion.
YAN Hong ， born in 1999， M. S. candidate. Her research interestsinclude trusted artificial intelligence.
QIU Fuxing， born in 1998， M. S. candidate. His research interestsinclude software defect prediction， intelligent software testing.
Supported by:
This work is partially supported by National Natural ScienceFoundation of China（ 61762067）.

基于多尺度频率通道注意力融合的声纹库构建方法

陈彤, 杨丰玉(), 熊宇, 严荭, 邱福星

南昌航空大学软件学院，南昌 330063

通讯作者: 杨丰玉
作者简介:陈彤（2002—），女，江西吉安人，硕士研究生，CCF会员，主要研究方向：可信人工智能
杨丰玉（1980—），男，江西九江人，副教授，硕士，CCF会员，主要研究方向：可信人工智能、智能化软件测试 99770277@qq.com
熊宇（1985—），男，江西南昌人，讲师，博士，主要研究方向：社会媒体挖掘、多模态数据融合
严荭（1999—），女，江西上饶人，硕士研究生，CCF会员，主要研究方向：可信人工智能
邱福星（1998—），男，江西赣州人，硕士研究生，主要研究方向：软件缺陷预测、智能化软件测试。
基金资助:
国家自然科学基金资助项目(61762067)

Abstract

Abstract:

To address the problem that the accuracy of speaker verification is easily affected by external factors， a speaker verification algorithm was proposed based on a Multi-scale Frequency-Channel Attention fused Time-Delay Neural Network （MFCA-TDNN） model. Three improvements were made to MFCA-TDNN on the basis of the ECAPA-TDNN （Emphasized Channel Attention Propagation Aggregation Time Delay Neural Network）， including： incorporating a multi-scale frequency-channel attention front-end to obtain high-resolution feature representations from speech， adding a multi-scale channel attention module to fuse multi-scale information by combining local and global features， and embedding a feature attention fusion module to weight the fusion features of multiple scales. These improvements enabled the model to make better use of multi-scale time-frequency information and improve recognition capability. Experimental results show that compared to the ECAPA-TDNN model， MFCA-TDNN model achieves a reduction of 5.9% and 7.9% in Equal Error Rate （EER） and minimum Detection Cost Function （minDCF）， respectively， with the lowest EER of 3.83% and the lowest minDCF of 0.220 2.

Key words: voiceprint library, delay neural network, multi-scale feature extraction, frequency-channel attention, feature attention fusion

摘要：

为解决声纹识别准确性易受外部因素影响的问题，提出一种基于多尺度频率通道注意力融合时延神经网络（MFCA-TDNN）模型的声纹识别算法。MFCA-TDNN在ECAPA-TDNN（Emphasized Channel Attention Propagation Aggregation Time Delay Neural Network）的基础上作了3点改进，包括：加入了多尺度频率通道注意力前端以从话语中获得高分辨率的特征表示、添加了多尺度通道注意力模块结合局部和全局的特征以融合多尺度信息、嵌入了特征注意力融合模块为多尺度的融合特征加权。这些改进使模型更好地利用多尺度的时频信息，提高识别能力。实验结果表明，与ECAPA-TDNN模型相比，MFCA-TDNN模型等错误率（EER）和最小检测代价函数（minDCF）分别下降5.9%和7.9%；最低的EER可达到3.83%，最低的minDCF可达到0.220 2。

关键词: 声纹库, 时延神经网络, 多尺度特征提取, 频率通道注意力, 特征注意力融合

CLC Number:

TP183

Tong CHEN, Fengyu YANG, Yu XIONG, Hong YAN, Fuxing QIU. Construction method of voiceprint library based on multi-scale frequency-channel attention fusion[J]. Journal of Computer Applications, 2024, 44(8): 2407-2413.

陈彤, 杨丰玉, 熊宇, 严荭, 邱福星. 基于多尺度频率通道注意力融合的声纹库构建方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2407-2413.

Figures/Tables 11

References 19

1	陈晨，韩纪庆，陈德运，等.文本无关说话人识别中句级特征提取方法研究综述［J］.自动化学报， 2022， 48（3）： 664-688.
	CHEN C， HAN J Q， CHEN D Y，et al. Utterance-level feature extraction in text-independent speaker recognition： a review［J］. Acta Automatica Sinica， 2022， 48（3）： 664-688.
2	VARIANI E， LEI X， McDERMOTT E， et al. Deep neural networks for small footprint text-dependent speaker verification［C］// Proceedings of the 2014 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2014： 4052-4056.
3	SNYDER D， GARCIA-ROMERO D， SELL G， et al. X-vectors： robust DNN embeddings for speaker recognition［C］// Proceedings of the 2018 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2018： 5329-5333.
4	CHUNG J S， HUH J，MUN S， et al. In defence of metric learning for speaker recognition［EB/OL］. （2020-03-26）［2023-08-01］. .
5	DESPLANQUES B， THIENPONDT J， DEMUYNCK K. ECAPA-TDNN： emphasized channel attention， propagation and aggregation in TDNN based speaker verification［EB/OL］. （2020-05-14）［2023-08-01］. .
6	THIENPONDT J， DESPLANQUES B， DEMUYNCK K. Integrating frequency translational invariance in TDNNs and frequency positional information in 2D ResNets to enhance speaker verification［EB/OL］. （2021-04-06）［2023-08-01］..
7	ZHAO M， MA Y， LIU M， et al. The SpeakIn system for VoxCeleb Speaker Recognition Challange 2021［EB/OL］. （2021-09-05）［2023-08-01］. .
8	WAN Z-K， REN Q-H， QIN Y-C， et al. Statistical pyramid dense time delay neural network for speaker verification［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2022： 7532-7536.
9	HE K， ZHANG X， REN S， et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778.
10	GAO S-H， CHENG M-M， ZHAO K， et al. Res2Net： a new multi scale backbone architecture［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2021， 43（2）： 652- 662.
11	陈志高，李鹏，肖润秋，等.文本无关说话人识别的一种多尺度特征提取方法［J］. 电子与信息学报， 2021， 43（11）： 3266-3271.
	CEHN Z G， LI P， XIAO R Q，et al. A multi-scale feature extraction method for text-independent speaker recognition［J］. Journal of Electronics & Information Technology， 2021， 43（11）： 3266-3271.
12	邓力洪，邓飞，张葛祥，等.改进 Res2Net的多尺度端到端说话人识别系统［J］. 计算机工程与应用， 2023， 59（24）： 110-120.
	DENG L H， DENG F， ZHANG G X， et al. Multi-scale end-to-end speaker recognition system based on improved Res2Net［J］. Computer Engineering and Applications， 2023， 59（24）： 110-120.
13	WANG X， XUE F， WANG W， et al. A network model of speaker identification with new feature extraction methods and asymmetric BLSTM［J］. Neurocomputing， 2020， 403： 167-181.
14	ABRAHAM J V T， KHAN A N， SHAHINA A. A deep learning approach for robust speaker identification using chroma energy normalized statistics and Mel frequency cepstral coefficients［J］. International Journal of Speech Technology， 2023， 26： 579-587.
15	LIU T， DAS R K， LEE K A， et al. MFA： TDNN with multi-scale frequency-channel attention for text-independent speaker verification with short utterances［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2022： 7517-7521.
16	DAI Y， GIESEKE F， OEHMCKE S， et al. Attentional feature fusion［C］// Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision. Piscataway： IEEE， 2021： 3559-3568.
17	PARK D S， CHAN W， ZHANG Y， et al. SpecAugment： a simple data augmentation method for automatic speech recognition［EB/OL］. （2019-08-18）［2023-08-01］. .
18	SNYDER D， CHEN G， POVEY D. MUSAN： a music， speech， and noise corpus［EB/OL］. （2015-10-28）［2023-08-01］. .
19	JUNG J-W， KIM Y J， H-S HEO， et al. Pushing the limits of raw waveform speaker recognition［J］. （2022-03-16）［2023-08-01］. .

层级	简要描述	输出
Input Layer	Fbank特征	（B，D，T）
Conv2D	K=3，S=1，ReLU，BN	（B，Cmfa，D，T）
DMBlock1	将输入特征在通道维度Cmfa均分为scale块，而后对每小块的特征在（frequency，channel）维度上卷积（k=3）	（B，Cmfa/scale，D，T）
DMBlock2	对每小块特征在（frequency，channel）维度上进行注意力权重计算，而后将二维结果拉伸至一维再进行卷积（k=3），最后将结果在通道维度合并	（B，Cmfa*D，T）
Conv1D	将其通道维变换为ECAPA-TDNN所需通道数C	（B，C，T）

层级	简要描述	输出
Input Layer	Fbank特征	（B，D，T）
Conv2D	K=3，S=1，ReLU，BN	（B，Cmfa，D，T）
DMBlock1	将输入特征在通道维度Cmfa均分为scale块，而后对每小块的特征在（frequency，channel）维度上卷积（k=3）	（B，Cmfa/scale，D，T）
DMBlock2	对每小块特征在（frequency，channel）维度上进行注意力权重计算，而后将二维结果拉伸至一维再进行卷积（k=3），最后将结果在通道维度合并	（B，Cmfa*D，T）
Conv1D	将其通道维变换为ECAPA-TDNN所需通道数C	（B，C，T）

模型	参数量/10⁶	时间/s	EER/%	minDCF
RawNet3	5.83	213	4.38	0.246 5
ResNet34	5.36	174	4.42	0.256 8
ECAPA-TDNN	6.19	298	4.19	0.239 1
MFCA-TDNN	8.92	402	3.94	0.2202

模型	参数量/10⁶	时间/s	EER/%	minDCF
RawNet3	5.83	213	4.38	0.246 5
ResNet34	5.36	174	4.42	0.256 8
ECAPA-TDNN	6.19	298	4.19	0.239 1
MFCA-TDNN	8.92	402	3.94	0.2202

是否数据增强	EER/%	minDCF
否	4.63	0.260 1
是	4.19	0.2391

Construction method of voiceprint library based on multi-scale frequency-channel attention fusion

基于多尺度频率通道注意力融合的声纹库构建方法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 11

References 19

Related Articles 2

Recommended Articles

Metrics

模型	参数量/10⁶	EER/%	minDCF
MFCA-TDNN	8.92	3.94	0.220 2
+MFA	8.15	3.84	0.224 1
+MS_CAM	7.42	3.83	0.223 6
+AFF	6.42	4.02	0.227 9

[1]	You YANG, Ruhui ZHANG, Pengcheng XU, Kang KANG, Hao ZHAI. Improved U-Net for seal segmentation of Republican archives [J]. Journal of Computer Applications, 2023, 43(3): 943-948.
[2]	Tianhao QIU, Shurong CHEN. EfficientNet based dual-branch multi-scale integrated learning for pedestrian re-identification [J]. Journal of Computer Applications, 2022, 42(7): 2065-2071.