Double complex convolution and attention aggregating recurrent network for speech enhancement

doi:10.11772/j.issn.1001-9081.2022101533

Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (10): 3217-3224.DOI: 10.11772/j.issn.1001-9081.2022101533

• Multimedia computing and computer simulation • Previous Articles

Double complex convolution and attention aggregating recurrent network for speech enhancement

Bennian YU¹, Yongzhao ZHAN¹(), Qirong MAO¹^,², Wenlong DONG¹, Honglin LIU¹

^1.School of Computer Science and Communication Engineering，Jiangsu University，Zhenjiang Jiangsu 212013，China
^2.Jiangsu Province Big Data Ubiquitous Perception and Intelligent Agriculture Application Engineering Research Center，Zhenjiang Jiangsu 212013，China

Received:2022-10-12 Revised:2022-12-24 Accepted:2022-12-28 Online:2023-10-07 Published:2023-10-10
Contact: Yongzhao ZHAN
About author:YU Bennian， born in 1996， M. S. candidate. Her research interests include speech enhancement.
MAO Qirong， born in 1975， Ph. D.， professor. Her research interests include pattern recognition， multimedia analysis.
DONG Wenlong， born in 1997， Ph. D. candidate. His research interests include multimedia computing.
LIU Honglin， born in 1992， Ph. D. candidate. His research interests include image classification of pests and diseases.
Supported by:
Key Research and Development Program of Jiangsu Province(BE2020036)

面向语音增强的双复数卷积注意聚合递归网络

余本年¹, 詹永照¹(), 毛启容¹^,², 董文龙¹, 刘洪麟¹

^1.江苏大学计算机科学与通信工程学院，江苏镇江 212013
^2.江苏省大数据泛在感知与智能农业应用工程研究中心，江苏镇江 212013

通讯作者: 詹永照
作者简介:余本年（1996—），女，安徽池州人，硕士研究生，主要研究方向：语音增强
毛启容（1975—），女，四川泸州人，教授，博士，CCF会员，主要研究方向：模式识别、多媒体分析
董文龙（1997—），男，江苏徐州人，博士研究生，主要研究方向：多媒体计算
刘洪麟（1992—），男，江苏宿迁人，博士研究生，主要研究方向：病虫害图像分类。
基金资助:
江苏省重点研发计划项目(BE2020036)

Abstract

Abstract:

Aiming at the problems of limited representation of spectrogram feature correlation information and unsatisfactory denoising effect in the existing speech enhancement methods， a speech enhancement method of Double Complex Convolution and Attention Aggregating Recurrent Network （DCCARN） was proposed. Firstly， a double complex convolutional network was established to encode the two-branch information of the spectrogram features after the short-time Fourier transform. Secondly， the codes in the two branches were used in the inter- and and intra-feature-block attention mechanisms respectively， and different speech feature information was re-labeled. Secondly， the long-term sequence information was processed by Long Short-Term Memory （LSTM） network， and the spectrogram features were restored and aggregated by two decoders. Finally， the target speech waveform was generated by short-time inverse Fourier transform to achieve the purpose of suppressing noise. Experiments were carried out on the public dataset VBD （Voice Bank+DMAND） and the noise added dataset TIMIT. The results show that compared with the phase-aware Deep Complex Convolution Recurrent Network （DCCRN）， DCCARN has the Perceptual Evaluation of Speech Quality （PESQ） increased by 0.150 and 0.077 to 0.087 respectively. It is verified that the proposed method can capture the correlation information of spectrogram features more accurately， suppress noise more effectively， and improve speech intelligibility.

Key words: speech enhancement, attention mechanism, complex convolutional network, coding, Long Short-Term Memory (LSTM) network

摘要：

针对现有的语音增强方法对语谱图特征关联信息表达有限和去噪效果不理想的问题，提出一种双复数卷积注意聚合递归网络（DCCARN）的语音增强方法。首先，建立双复数卷积网络，对短时傅里叶变换后的语谱图特征进行两分支信息编码；其次，将两分支中编码分别使用特征块间和特征块内注意力机制对不同的语音特征信息进行重标注；再次，使用长短期记忆（LSTM）网络处理长时间序列信息，并用两解码器还原语谱图特征并聚合这些特征；最后，经短时逆傅里叶变换生成目标语音波形，以达到抑制噪声的目的。在公开数据集VBD（Voice Bank+DMAND）和加噪的TIMIT数据集上进行的实验的结果表明，与相位感知的深度复数卷积递归网络（DCCRN）相比，DCCARN在客观语音感知质量指标（PESQ）上分别提升了0.150和0.077~0.087。这验证了所提方法能更准确地捕获语谱图特征的关联信息，更有效地抑制噪声，并提高语音的清晰度。

关键词: 语音增强, 注意力机制, 复数卷积网络, 编码, 长短期记忆网络

CLC Number:

TN912.34

Bennian YU, Yongzhao ZHAN, Qirong MAO, Wenlong DONG, Honglin LIU. Double complex convolution and attention aggregating recurrent network for speech enhancement[J]. Journal of Computer Applications, 2023, 43(10): 3217-3224.

余本年, 詹永照, 毛启容, 董文龙, 刘洪麟. 面向语音增强的双复数卷积注意聚合递归网络[J]. 《计算机应用》唯一官方网站, 2023, 43(10): 3217-3224.

Figures/Tables 8

Fig. 1 DCCARN method architecture

Fig. 2 Time-frequency decomposition

Fig. 3 Feature attention mechanisms

Fig. 4 Value analysis of a and b on VBD dataset

Tab. 1 Speech evaluation scores of different methods on VBD dataset

方法	输入信号	PESQ	CSIG	CBAK	COVL
NOISY		1.970	3.350	2.440	2.630
Wavenet^［17］		2.220	3.620	3.230	2.980
SEGAN^［12］	时域	2.160	3.480	2.940	2.800
CNN-GAN^［18］	频域	2.340	3.550	2.950	2.920
Wave-U-Net^［19］	时域	2.400	3.520	3.240	2.960
MMSE-GAN^［20］	频域	2.530	3.800	3.120	3.140
CRN^［3］	频域	2.610	3.780	3.260	3.170
MDPHD^［21］	频域	2.700	3.850	3.390	3.270
DCCRN^［7］	频域	2.680	3.730	2.460	3.190
TFT-Net^［22］	频域	2.750	3.930	3.440	3.340
PGGAN^［4］	频域	2.810	3.990	3.590	3.360
FTSC-GAN^［23］	时域	2.750	4.030	3.350	3.390
DCCARN	频域	2.830	3.910	3.600	3.430

Tab. 2 Speech evaluation scores of different methods on TIMIT dataset

方法	信噪比/dB	PESQ	CSIG	CBAK	COVL
NOISY	5	2.208
	0	1.934
	-5	1.626
DCCRN	5	2.881	4.054	2.781	3.515
	0	2.448	3.624	2.546	3.047
	-5	2.040	3.152	2.304	2.575
DCCARN	5	2.958	4.160	3.016	3.607
	0	2.535	3.763	2.762	3.162
	-5	2.119	3.317	2.490	2.700

Tab. 3 Ablation experimental results on VBD dataset

基准模型	特征块间注意力模块	特征块内注意力模块	双分支结构	$L o s s$ 损失函数	PESQ	CSIG	CBAK	COVL
√					2.680	3.730	2.460	3.190
√	√				2.700	3.690	2.500	3.170
√		√			2.710	3.650	2.970	3.200
√			√		2.740	3.680	2.930	3.290
√	√	√	√		2.750	3.890	3.330	3.400
√	√	√	√	√	2.830	3.910	3.600	3.430

Tab. 3 Ablation experimental results on VBD dataset

基准模型	特征块间注意力模块	特征块内注意力模块	双分支结构	$L o s s$ 损失函数	PESQ	CSIG	CBAK	COVL
√					2.680	3.730	2.460	3.190
√	√				2.700	3.690	2.500	3.170
√		√			2.710	3.650	2.970	3.200
√			√		2.740	3.680	2.930	3.290
√	√	√	√		2.750	3.890	3.330	3.400
√	√	√	√	√	2.830	3.910	3.600	3.430

Fig. 5 Speech quality comparison

References 23

1	CHOI H S， KIM J H， HUH J， et al. Phase-aware speech enhancement with deep complex U-Net［EB/OL］. （2023-08-06）［2023-08-08］..
2	HASANNEZHAD M， YU H， ZHU W P， et al. PACDNN： a phase-aware composite deep neural network for speech enhancement［J］. Speech Communication， 2022， 136： 1-13. 10.1016/j.specom.2021.10.002
3	TAN K， WANG D. A convolutional recurrent neural network for real-time speech enhancement［C］// Proceedings of the INTERSPEECH 2018. ［S.l.］： International Speech Communication Association， 2018： 3229-3233. 10.21437/interspeech.2018-1405
4	LI Y， SUN M， ZHANG X. Perception-guided generative adversarial network for end-to-end speech enhancement［J］. Applied Soft Computing， 2022， 128： No.109446. 10.1016/j.asoc.2022.109446
5	WANG Z， ZHANG T， SHAO Y， et al. LSTM-convolutional-BLSTM encoder-decoder network for minimum mean-square error approach to speech enhancement［J］. Applied Acoustics， 2021， 172： No.107647. 10.1016/j.apacoust.2020.107647
6	YU G， WANG Y， ZHENG C， et al. CycleGAN-based non-parallel speech enhancement with an adaptive attention-in-attention mechanism［C］// Proceedings of the 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Piscataway： IEEE， 2021： 523-529.
7	HU Y， LIU Y， LV S， et al. DCCRN： deep complex convolution recurrent network for phase-aware speech enhancement［C］// Proceedings of the INTERSPEECH 2020. ［S.l.］： International Speech Communication Association， 2020： 2472-2476. 10.21437/interspeech.2020-2537
8	WOO S， PARK J， LEE J Y， et al. CBAM： convolutional block attention module［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11211. Cham： Springer， 2018： 3-19.
9	KOIZUMI Y， YATABE K， DELCROIX M， et al. Speech enhancement using self-adaptation and multi-head self-attention［C］// Proceedings of the 2020 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2020： 181-185. 10.1109/icassp40776.2020.9053214
10	ZHANG Q， SONG Q， NI Z， et al. Time-frequency attention for monaural speech enhancement［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2022： 7852-7856. 10.1109/icassp43922.2022.9746454
11	高戈，王霄，曾邦，等. 基于时频联合损失函数的语音增强算法［J］. 计算机应用， 2022， 42（S1）：316-320.
	GAO G， WANG X， ZENG B， et al. Speech enhancement algorithm based on time-frequency joint loss function［J］. Journal of Computer Applications， 2022， 42（S1）：316-320.
12	PASCUAL S， BONAFONTE A， SERRÀ J. SEGAN： speech enhancement generative adversarial network［C］// Proceedings of the INTERSPEECH 2017. ［S.l.］： International Speech Communication Association， 2017： 3642-3646. 10.21437/interspeech.2017-1428
13	VEAUX C， YAMAGISHI J， KING S. The voice bank corpus： design， collection and data analysis of a large regional accent speech database［C］// Proceedings of the 2013 International Conference Oriental COCOSDA Held Jointly with Conference on Asian Spoken Language Research and Evaluation. Piscataway： IEEE， 2013： 1-4. 10.1109/icsda.2013.6709856
14	THIEMANN J， ITO N， VINCENT E. The Diverse Environments Multi-channel Acoustic Noise Database （DEMAND）： a database of multichannel environmental noise recordings［J］. Proceedings of Meetings on Acoustics， 2013， 19（1）： No.035081. 10.1121/1.4806631
15	GAROFOLO J S， LAMEL L F， FISHER W M. TIMIT acoustic phonetic continuous speech corpus［DS/OL］. ［2022-12-15］.. 10.6028/nist.ir.4930
16	VARGA A， STEENEKEN H J M. Assessment for automatic speech recognition： Ⅱ. NOISEX-92： a database and an experiment to study the effect of additive noise on speech recognition systems［J］. Speech Communication， 1993， 12（3）： 247-251. 10.1016/0167-6393(93)90095-3
17	RETHAGE D， PONS J， SERRA X. A Wavenet for speech denoising［C］// Proceedings of the 2018 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2018： 5069-5073. 10.1109/icassp.2018.8462417
18	SHAH N， PATIL H A， SONI M H. Time-frequency mask-based speech enhancement using convolutional generative adversarial network［C］// Proceedings of the 2018 AP sia-Pacific Signal and Information Processing Association Annual Summit and Conference. Piscataway： IEEE， 2018：1246-1251. 10.23919/apsipa.2018.8659692
19	MACARTNEY C， WEYDE T. Improved speech enhancement with the Wave-U-Net［EB/OL］. （2018-11-27）［2022-12-15］..
20	SONI M H， SHAH N， PATIL H A. Time-frequency masking-based speech enhancement using generative adversarial network［C］// Proceedings of the 2018 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2018： 5039-5043. 10.1109/icassp.2018.8462068
21	KIM J H， YOO J， CHUN S， et al. Multi-domain processing via hybrid denoising networks for speech enhancement［EB/OL］. （2018-12-21）［2022-12-15］.. 10.48550/arXiv.1812.08914
22	TANG C， LUO C， ZHAO Z， et al. Joint time-frequency and time domain learning for speech enhancement［C］// Proceedings of the 29th International Joint Conferences on Artificial Intelligence. California： ijcai.org， 2020： 3816-3822. 10.24963/ijcai.2020/528
23	沈梦强，于文年，易黎，等. 基于GAN的全时间尺度语音增强方法［J］.计算机工程， 2023， 49（6）：115-122， 130.
	SHEN M Q， YU W N， YI L， et al. Full-time scale speech enhancement method based on GAN［J］. Computer Engineering， 2023， 49（6）：115-122， 130.

[1]	Hao YANG, Yi ZHANG. Feature pyramid network algorithm based on context information and multi-scale fusion importance awareness [J]. Journal of Computer Applications, 2023, 43(9): 2727-2734.
[2]	Guolong YUAN, Yujin ZHANG, Yang LIU. Image tampering forensics network based on residual feedback and self-attention [J]. Journal of Computer Applications, 2023, 43(9): 2925-2931.
[3]	Xiaomin ZHOU, Fei TENG, Yi ZHANG. Automatic international classification of diseases coding model based on meta-network [J]. Journal of Computer Applications, 2023, 43(9): 2721-2726.
[4]	Zheng XIE, Zihao WANG, Dan TANG, Hang ZHANG, Hongliang CAI. Double fault tolerant array code with low compilation complexity [J]. Journal of Computer Applications, 2023, 43(9): 2766-2774.
[5]	Hong WANG, Qing QIAN, Huan WANG, Yong LONG. Lightweight image tamper localization algorithm based on large kernel attention convolution [J]. Journal of Computer Applications, 2023, 43(9): 2692-2699.
[6]	Zhong LI, Yajing WANG, Qiaomei MA. Super-resolution reconstruction algorithm of medical images based on dilated convolution [J]. Journal of Computer Applications, 2023, 43(9): 2940-2947.
[7]	Meijia LIANG, Xinwu LIU, Xiaopeng HU. Small target detection algorithm for train operating environment image based on improved YOLOv3 [J]. Journal of Computer Applications, 2023, 43(8): 2611-2618.
[8]	Jinghong WANG, Zhixia ZHOU, Hui WANG, Haokang LI. Attribute network representation learning with dual auto-encoder [J]. Journal of Computer Applications, 2023, 43(8): 2338-2344.
[9]	Shengwei DUAN, Xinyu CHENG, Haozhou WANG, Fei WANG. Dam surface disease detection algorithm based on improved YOLOv5 [J]. Journal of Computer Applications, 2023, 43(8): 2619-2629.
[10]	Yumeng CUI, Jingya WANG, Xiaowen LIU, Shangyi YAN, Zhizhong TAO. General text classification model combining attention and cropping mechanism [J]. Journal of Computer Applications, 2023, 43(8): 2396-2405.
[11]	Ailing QI, Xuanlin WANG. Fine-grained image recognition based on mid-level subtle feature extraction and multi-scale feature fusion [J]. Journal of Computer Applications, 2023, 43(8): 2556-2563.
[12]	Zexi JIN, Lei LI, Ji LIU. Transfer learning model based on improved domain separation network [J]. Journal of Computer Applications, 2023, 43(8): 2382-2389.
[13]	Yuan LIU, Yongquan DONG, Rui JIA, Haolin YANG. Hierarchical and phased attention network model for personalized course recommendation [J]. Journal of Computer Applications, 2023, 43(8): 2358-2363.
[14]	Yuan WEI, Yan LIN, Shengnan GUO, Youfang LIN, Huaiyu WAN. Prediction of taxi demands between urban regions by fusing origin-destination spatial-temporal correlation [J]. Journal of Computer Applications, 2023, 43(7): 2100-2106.
[15]	Zhongyu LI, Haodong SUN, Jiao LI. Lightweight gesture recognition algorithm for basketball referee [J]. Journal of Computer Applications, 2023, 43(7): 2173-2181.

Double complex convolution and attention aggregating recurrent network for speech enhancement

面向语音增强的双复数卷积注意聚合递归网络

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 8

References 23

Related Articles 15

Recommended Articles

Metrics