Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
Speech enhancement network driven by complex frequency attention and multi-scale frequency enhancement
Jinggang LYU, Shaorui PENG, Shuo GAO, Jin ZHOU
Journal of Computer Applications    2025, 45 (9): 2957-2965.   DOI: 10.11772/j.issn.1001-9081.2025030268
Abstract22)   HTML0)    PDF (2092KB)(7)       Save

The current speech enhancement methods have complex spectrum signals as the target signals, while the training networks usually adopt real-valued networks. During the training process, the real and imaginary parts of the signals are processed in parallel, which reduces accuracy of feature extraction and extracts semantic features insufficiently in complex frequency domain. To address these issues, a complex domain network based on Complex Frequency Attention and Multi-Scale Frequency Domain Enhancement (CFAFE) was proposed for speech enhancement based on the U-Net architecture. Firstly, Short-Time Fourier Transform (STFT) was used to convert the noisy speech time-series signal to the complex frequency domain. Secondly, aiming at the complex frequency domain features, a complex domain multi-scale frequency-enhancement module was designed, and a local feature mining module for enhanced noisy speech under complex frequency domain conditions was constructed, so as to enhance abilities of interference in the frequency domain and recognizing the expected signal features. Thirdly, a self-attention algorithm based on the complex frequency domain was designed on the basis of ViT (Vision Transformer), so as to achieve parallel complex frequency domain feature enhancement. Finally, comparative experiments and ablation experiments were conducted on the benchmark dataset VoiceBank+Demand, and transfer generalization experiments were carried out on the Timit dataset with Noise92 noise addition. Experimental results show that on the VoiceBank+Demand dataset, the proposed network outperforms Deep Complex Convolution Recurrent Network (DCCRN) by 16.6%, 10.9%, 44.4%, and 14.1%, respectively, in terms of Perceptual Evaluation of Speech Quality (PESQ), MOS prediction of the signal distortion (CSIG), MOS predictor of intrusiveness of background noise (CBAK), and MOS prediction of the overall effect (COVL) indicators; on the Timit+Noise92 dataset, compared with DCCRN model under -5 dB Signal-to-Noise Ratio (SNR) babble noise conditions, the proposed network has the PESQ and STOI (Short-Time Objective Intelligibility) increased by 29.8% and 5.2%, respectively.

Table and Figures | Reference | Related Articles | Metrics