Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (10): 3217-3224.DOI: 10.11772/j.issn.1001-9081.2022101533

• Multimedia computing and computer simulation • Previous Articles    

Double complex convolution and attention aggregating recurrent network for speech enhancement

Bennian YU1, Yongzhao ZHAN1(), Qirong MAO1,2, Wenlong DONG1, Honglin LIU1   

  1. 1.School of Computer Science and Communication Engineering,Jiangsu University,Zhenjiang Jiangsu 212013,China
    2.Jiangsu Province Big Data Ubiquitous Perception and Intelligent Agriculture Application Engineering Research Center,Zhenjiang Jiangsu 212013,China
  • Received:2022-10-12 Revised:2022-12-24 Accepted:2022-12-28 Online:2023-10-07 Published:2023-10-10
  • Contact: Yongzhao ZHAN
  • About author:YU Bennian, born in 1996, M. S. candidate. Her research interests include speech enhancement.
    MAO Qirong, born in 1975, Ph. D., professor. Her research interests include pattern recognition, multimedia analysis.
    DONG Wenlong, born in 1997, Ph. D. candidate. His research interests include multimedia computing.
    LIU Honglin, born in 1992, Ph. D. candidate. His research interests include image classification of pests and diseases.
  • Supported by:
    Key Research and Development Program of Jiangsu Province(BE2020036)

面向语音增强的双复数卷积注意聚合递归网络

余本年1, 詹永照1(), 毛启容1,2, 董文龙1, 刘洪麟1   

  1. 1.江苏大学 计算机科学与通信工程学院,江苏 镇江 212013
    2.江苏省大数据泛在感知与智能农业应用工程研究中心,江苏 镇江 212013
  • 通讯作者: 詹永照
  • 作者简介:余本年(1996—),女,安徽池州人,硕士研究生,主要研究方向:语音增强
    毛启容(1975—),女,四川泸州人,教授,博士,CCF会员,主要研究方向:模式识别、多媒体分析
    董文龙(1997—),男,江苏徐州人,博士研究生,主要研究方向:多媒体计算
    刘洪麟(1992—),男,江苏宿迁人,博士研究生,主要研究方向:病虫害图像分类。
  • 基金资助:
    江苏省重点研发计划项目(BE2020036)

Abstract:

Aiming at the problems of limited representation of spectrogram feature correlation information and unsatisfactory denoising effect in the existing speech enhancement methods, a speech enhancement method of Double Complex Convolution and Attention Aggregating Recurrent Network (DCCARN) was proposed. Firstly, a double complex convolutional network was established to encode the two-branch information of the spectrogram features after the short-time Fourier transform. Secondly, the codes in the two branches were used in the inter- and and intra-feature-block attention mechanisms respectively, and different speech feature information was re-labeled. Secondly, the long-term sequence information was processed by Long Short-Term Memory (LSTM) network, and the spectrogram features were restored and aggregated by two decoders. Finally, the target speech waveform was generated by short-time inverse Fourier transform to achieve the purpose of suppressing noise. Experiments were carried out on the public dataset VBD (Voice Bank+DMAND) and the noise added dataset TIMIT. The results show that compared with the phase-aware Deep Complex Convolution Recurrent Network (DCCRN), DCCARN has the Perceptual Evaluation of Speech Quality (PESQ) increased by 0.150 and 0.077 to 0.087 respectively. It is verified that the proposed method can capture the correlation information of spectrogram features more accurately, suppress noise more effectively, and improve speech intelligibility.

Key words: speech enhancement, attention mechanism, complex convolutional network, coding, Long Short-Term Memory (LSTM) network

摘要:

针对现有的语音增强方法对语谱图特征关联信息表达有限和去噪效果不理想的问题,提出一种双复数卷积注意聚合递归网络(DCCARN)的语音增强方法。首先,建立双复数卷积网络,对短时傅里叶变换后的语谱图特征进行两分支信息编码;其次,将两分支中编码分别使用特征块间和特征块内注意力机制对不同的语音特征信息进行重标注;再次,使用长短期记忆(LSTM)网络处理长时间序列信息,并用两解码器还原语谱图特征并聚合这些特征;最后,经短时逆傅里叶变换生成目标语音波形,以达到抑制噪声的目的。在公开数据集VBD(Voice Bank+DMAND)和加噪的TIMIT数据集上进行的实验的结果表明,与相位感知的深度复数卷积递归网络(DCCRN)相比,DCCARN在客观语音感知质量指标(PESQ)上分别提升了0.150和0.077~0.087。这验证了所提方法能更准确地捕获语谱图特征的关联信息,更有效地抑制噪声,并提高语音的清晰度。

关键词: 语音增强, 注意力机制, 复数卷积网络, 编码, 长短期记忆网络

CLC Number: