Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (9): 2957-2965.DOI: 10.11772/j.issn.1001-9081.2025030268

• Multimedia computing and computer simulation • Previous Articles    

Speech enhancement network driven by complex frequency attention and multi-scale frequency enhancement

Jinggang LYU, Shaorui PENG, Shuo GAO, Jin ZHOU()   

  1. School of Science and Technology,Tianjin University of Finance and Economics,Tianjin 300222,China
  • Received:2025-03-19 Revised:2025-05-31 Accepted:2025-06-06 Online:2025-06-25 Published:2025-09-10
  • Contact: Jin ZHOU
  • About author:LYU Jinggang, born in 1977, Ph. D., associate professor. His research interests include speech signal processing, speech enhancement.
    PENG Shaorui, born in 2001, M. S. candidate. His research interests include speech signal processing, speech enhancement.
    GAO Shuo, born in 2000, M. S. candidate. His research interests include recognition network for collaboration of global and local features.
  • Supported by:
    Natural Science Foundation of Tianjin(22JCYBJC01550)

复频域注意力和多尺度频域增强驱动的语音增强网络

吕景刚, 彭绍睿, 高硕, 周金()   

  1. 天津财经大学 理工学院,天津 300222
  • 通讯作者: 周金
  • 作者简介:吕景刚(1977—),男,山东烟台人,副教授,博士,主要研究方向:语音信号处理、语音增强
    彭绍睿(2001—),男,福建武夷山人,硕士研究生,主要研究方向:语音信号处理、语音增强
    高硕(2000—),男,河北唐山人,硕士研究生,主要研究方向:全局及局部特征协作的识别网络
  • 基金资助:
    天津自然科学基金资助项目(22JCYBJC01550)

Abstract:

The current speech enhancement methods have complex spectrum signals as the target signals, while the training networks usually adopt real-valued networks. During the training process, the real and imaginary parts of the signals are processed in parallel, which reduces accuracy of feature extraction and extracts semantic features insufficiently in complex frequency domain. To address these issues, a complex domain network based on Complex Frequency Attention and Multi-Scale Frequency Domain Enhancement (CFAFE) was proposed for speech enhancement based on the U-Net architecture. Firstly, Short-Time Fourier Transform (STFT) was used to convert the noisy speech time-series signal to the complex frequency domain. Secondly, aiming at the complex frequency domain features, a complex domain multi-scale frequency-enhancement module was designed, and a local feature mining module for enhanced noisy speech under complex frequency domain conditions was constructed, so as to enhance abilities of interference in the frequency domain and recognizing the expected signal features. Thirdly, a self-attention algorithm based on the complex frequency domain was designed on the basis of ViT (Vision Transformer), so as to achieve parallel complex frequency domain feature enhancement. Finally, comparative experiments and ablation experiments were conducted on the benchmark dataset VoiceBank+Demand, and transfer generalization experiments were carried out on the Timit dataset with Noise92 noise addition. Experimental results show that on the VoiceBank+Demand dataset, the proposed network outperforms Deep Complex Convolution Recurrent Network (DCCRN) by 16.6%, 10.9%, 44.4%, and 14.1%, respectively, in terms of Perceptual Evaluation of Speech Quality (PESQ), MOS prediction of the signal distortion (CSIG), MOS predictor of intrusiveness of background noise (CBAK), and MOS prediction of the overall effect (COVL) indicators; on the Timit+Noise92 dataset, compared with DCCRN model under -5 dB Signal-to-Noise Ratio (SNR) babble noise conditions, the proposed network has the PESQ and STOI (Short-Time Objective Intelligibility) increased by 29.8% and 5.2%, respectively.

Key words: speech enhancement, complex neural network, U-Net, attention mechanism, Transformer

摘要:

现有语音增强方法的目标信号为复频谱信号,而训练网络通常采用实值网络,训练时分别并行处理实部和虚部信号降低了特征提取的准确度,并且对复频域的语义特征提取不充分。为解决上述问题,提出一种基于复频域注意力和多尺度频域增强(CFAFE)的复数域网络实现语音增强。该网络以U-Net为基本架构,首先,利用短时傅里叶变换(STFT)将语音时序含噪信号转换到复频域;其次,针对复频域特征,设计复数域多尺度频域增强模块,构建复频域条件下增强的含噪语音局部特征挖掘模块,从而增强频域干扰和识别期望信号特征的能力;再次,在ViT(Vision Transformer)的基础上设计基于复频域的自注意力算法,实现并行复频域特征的增强;最后,在基准数据集VoiceBank+Demand上进行对比实验和消融实验,并在使用Noise92加噪后的Timit数据集上进行迁移泛化实验。实验结果表明,在VoiceBank+Demand数据集上,相较于深度复卷积递归网络(DCCRN),所提网络在语音质量的感知评估(PESQ)、MOS信号失真(CSIG)、MOS噪声失真(CBAK)、MOS整体语音质量(COVL)指标上分别提升了16.6%、10.9%、44.4%和14.1%;在Timit+Noise92数据集上,相较于DCCRN模型,在babble噪声信噪比(SNR)为-5 dB的条件下,所提网络的PESQ和STOI(Short-Time Objective Intelligibility)分别提高了29.8%和5.2%。

关键词: 语音增强, 复神经网络, U-Net, 注意力机制, Transformer

CLC Number: