《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (4): 1317-1324.DOI: 10.11772/j.issn.1001-9081.2023040452

• 多媒体计算与计算机仿真 • 上一篇    

基于门控膨胀卷积循环网络的单声道语音增强

尤昕源, 王恒()   

  1. 武汉轻工大学 数学与计算机学院,武汉 430048
  • 收稿日期:2023-04-21 修回日期:2023-07-06 接受日期:2023-07-10 发布日期:2023-12-04 出版日期:2024-04-10
  • 通讯作者: 王恒
  • 作者简介:尤昕源(1998—),女,河南洛阳人,硕士研究生,CCF会员,主要研究方向:单声道语音增强; ∗
    王恒(1983—),男,湖北武汉人,副教授,博士,主要研究方向:声学空间参数的感知特征、人工智能、3D音频和视频在虚拟现实中的应用。wh825554@163.com
  • 基金资助:
    湖北省教育厅科学研究计划重点项目(D20201601);武汉工程大学智能机器人湖北省重点实验室开放基金资助项目(HBIR202101)

Monaural speech enhancement based on gated dilated convolutional recurrent network

Xinyuan YOU, Heng WANG()   

  1. School of Mathematics & Computer Science,Wuhan Polytechnic University,Wuhan Hubei 430048,China
  • Received:2023-04-21 Revised:2023-07-06 Accepted:2023-07-10 Online:2023-12-04 Published:2024-04-10
  • Contact: Heng WANG
  • About author:YOU Xinyuan, born in 1998, M. S. candidate. Her research interests include monaural speech enhancement.
  • Supported by:
    Key Project of Scientific Research Plan of Hubei Provincial Department of Education(D20201601);Hubei Key Laboratory of Intelligent Robot (Wuhan Institute of Technology) Open Fund(HBIR202101)

摘要:

上下文信息的使用在语音增强任务中具有重要作用。针对全局语音利用不充分的问题,提出一种用于复数频谱映射的门控膨胀卷积循环网络(GDCRN)。GDCRN包含编码器、门控时间卷积模块(GTCM)和解码器这3部分,编码器和解码器是非对称的网络结构。首先,编码器利用门控膨胀卷积模块(GDCM)扩大感受野,处理特征;其次,使用GTCM捕获更长的上下文信息,并选择性传递特征;最后,解码器使用结合门控线性单元(GLU)的反卷积,反卷积与编码器中对应层的卷积层使用跳跃连接,并引入通道时频注意力(CTFA)机制。实验结果表明,相较于时间卷积神经网络(TCNN)、门控卷积循环网络(GCRN)等网络,所提网络的参数量和训练时间更少,客观语音质量评估(PESQ)和短时客观可懂度(STOI)都有显著改善,最高可提升0.258 9和4.67个百分点,具有更好的增强效果与更强的泛化能力。

关键词: 语音增强, 复数频谱映射, 膨胀卷积, 门控机制, 注意力机制

Abstract:

The use of contextual information plays an important role in speech enhancement tasks. To address the under-utilization problem of global speech, a Gated Dilated Convolutional Recurrent Network (GDCRN) for complex spectral mapping was proposed. GDCRN was composed of an encoder, a Gated Temporal Convolution Module (GTCM) and a decoder. The encoder and decoder had asymmetric network structure. Firstly, features were processed by the encoder using a Gated Dilated Convolution Module (GDCM), which expanded the receptive field. Secondly, longer contextual information was captured and selectively passed through the use of the GTCM. Finally, the deconvolution combined with a Gated Linear Unit (GLU)was used by the decoder, which was connected to the corresponding convolution layer in the encoder using skip connection. Additionally, a Channel Time-Frequency Attention (CTFA) mechanism was introduced. Experimental results show that the proposed network has fewer parameters and shorter training time than other networks such as Temporal Convolutional Neural Network (TCNN) and Gated Convolutional Recurrent Network (GCRN). The proposed GDCRN significantly improves PESQ (Perceptual Evaluation of Speech Quality) and STOI(Short-Time Objective Intelligibility) up by 0.258 9 and 4.67 percentage points, demonstrating that the proposed network has better enhancement effect and stronger generalization ability.

Key words: speech enhancement, complex spectral mapping, dilated convolution, gating mechanism, attention mechanism

中图分类号: