Journal of Computer Applications
Next Articles
Received:
Revised:
Online:
Published:
Supported by:
周云龙1,2,陈德富3,刘小湖1,桑伊健2,周晗昀2
通讯作者:
基金资助:
Abstract: The Transformer based on the self-attention mechanism has demonstrated superior performance in most Natural Language Processing (NLP) tasks. However, previous studies have indicated that the Transformer is not highly competitive when applied to speaker verification, particularly in terms of local modeling capability, lightweight structure, and real-time inference performance. To address these issues, an end-to-end speaker verification model Deep Treatment Fusion-Transformer (DTF-Transformer) based on an improved Transformer was proposed from three aspects. First, a simplified multi-scale attention mechanism was employed in place of multi-head attention to enhance the model's local modeling ability and reduce the number of parameters. Second, a lightweight Feed Forward Network (FFN) was designed to further reduce parameters and accelerate inference. Finally, a fusion mechanism was applied to features at different depths to improve the model’s ability to represent and generalize deep features. Experimental results on the VoxCeleb and CN-Celeb public benchmark datasets demonstrate that, compared to the popular ResNet34 and ECAPA-TDNN networks, the DTF-Transformer reduces the Equal Error Rate (EER) by 14% and 23% on the VoxCeleb-O test set, and by 14% and 15% on the CN-Celeb(E) test set, respectively. Furthermore, the DTF-Transformer is more lightweight and offers better inference speed without sacrificing accuracy.
Key words: speaker verification, speaker embedding, Transformer, self-attention mechanism, feature fusion
摘要: 基于自注意机制的Transformer在大多数自然语言处理(NLP)任务中表现出了一流的性能,但以往的工作表明,将Transformer应用于说话人确认时竞争力不强。主要体现在局部建模能力、轻量级的结构以及实时推理能力上。针对这些问题,本文从三个方面提出一种改进Transformer的端到端说话人确认模型DTF-Transformer (Deep Treatment Fusion-Transformer)。首先使用了一种简化多尺度注意力代替多头注意力来提升模型局部建模能力并且降低参数,其次设计了轻量级的前馈网络(FFN)进一步降低模型参数同时加快推理速度,最后对于不同深度下的特征应用融合机制提高模型对深层特征的表达与泛化能力。通过在VoxCeleb和CN-Celeb公共基准数据集上的实验结果表明,相较于较为流行的ResNet34和ECAPA-TDNN网络,DTF-Transformer在VoxCeleb-O和CN-Celeb(E)测试集上的等错误率(EER)分别下降14%、23%和43%、15%,并且,DTF-Transformer在不失精度的情况下更轻量化并且有着较优的推理速度。
关键词: 说话人确认, 说话人嵌入, Transformer, 自注意力机制, 特征融合
CLC Number:
TN912.34
TP391.42
周云龙 陈德富 刘小湖 桑伊健 周晗昀. 基于改进Transformer的端到端说话人确认模型[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2024071044.
0 / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2024071044