面向语音增强的双复数卷积注意聚合递归网络

doi:10.11772/j.issn.1001-9081.2022101533

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (10): 3217-3224.DOI: 10.11772/j.issn.1001-9081.2022101533

所属专题：多媒体计算与计算机仿真

• 多媒体计算与计算机仿真 • 上一篇下一篇

面向语音增强的双复数卷积注意聚合递归网络

余本年¹, 詹永照¹(), 毛启容¹^,², 董文龙¹, 刘洪麟¹

^1.江苏大学计算机科学与通信工程学院，江苏镇江 212013
^2.江苏省大数据泛在感知与智能农业应用工程研究中心，江苏镇江 212013

收稿日期:2022-10-12 修回日期:2022-12-24 接受日期:2022-12-28 发布日期:2023-10-07 出版日期:2023-10-10
通讯作者: 詹永照
作者简介:余本年（1996—），女，安徽池州人，硕士研究生，主要研究方向：语音增强
毛启容（1975—），女，四川泸州人，教授，博士，CCF会员，主要研究方向：模式识别、多媒体分析
董文龙（1997—），男，江苏徐州人，博士研究生，主要研究方向：多媒体计算
刘洪麟（1992—），男，江苏宿迁人，博士研究生，主要研究方向：病虫害图像分类。
基金资助:
江苏省重点研发计划项目(BE2020036)

Double complex convolution and attention aggregating recurrent network for speech enhancement

Bennian YU¹, Yongzhao ZHAN¹(), Qirong MAO¹^,², Wenlong DONG¹, Honglin LIU¹

^1.School of Computer Science and Communication Engineering，Jiangsu University，Zhenjiang Jiangsu 212013，China
^2.Jiangsu Province Big Data Ubiquitous Perception and Intelligent Agriculture Application Engineering Research Center，Zhenjiang Jiangsu 212013，China

Received:2022-10-12 Revised:2022-12-24 Accepted:2022-12-28 Online:2023-10-07 Published:2023-10-10
Contact: Yongzhao ZHAN
About author:YU Bennian， born in 1996， M. S. candidate. Her research interests include speech enhancement.
MAO Qirong， born in 1975， Ph. D.， professor. Her research interests include pattern recognition， multimedia analysis.
DONG Wenlong， born in 1997， Ph. D. candidate. His research interests include multimedia computing.
LIU Honglin， born in 1992， Ph. D. candidate. His research interests include image classification of pests and diseases.
Supported by:
Key Research and Development Program of Jiangsu Province(BE2020036)

摘要/Abstract

摘要：

针对现有的语音增强方法对语谱图特征关联信息表达有限和去噪效果不理想的问题，提出一种双复数卷积注意聚合递归网络（DCCARN）的语音增强方法。首先，建立双复数卷积网络，对短时傅里叶变换后的语谱图特征进行两分支信息编码；其次，将两分支中编码分别使用特征块间和特征块内注意力机制对不同的语音特征信息进行重标注；再次，使用长短期记忆（LSTM）网络处理长时间序列信息，并用两解码器还原语谱图特征并聚合这些特征；最后，经短时逆傅里叶变换生成目标语音波形，以达到抑制噪声的目的。在公开数据集VBD（Voice Bank+DMAND）和加噪的TIMIT数据集上进行的实验的结果表明，与相位感知的深度复数卷积递归网络（DCCRN）相比，DCCARN在客观语音感知质量指标（PESQ）上分别提升了0.150和0.077~0.087。这验证了所提方法能更准确地捕获语谱图特征的关联信息，更有效地抑制噪声，并提高语音的清晰度。

关键词: 语音增强, 注意力机制, 复数卷积网络, 编码, 长短期记忆网络

Abstract:

Aiming at the problems of limited representation of spectrogram feature correlation information and unsatisfactory denoising effect in the existing speech enhancement methods， a speech enhancement method of Double Complex Convolution and Attention Aggregating Recurrent Network （DCCARN） was proposed. Firstly， a double complex convolutional network was established to encode the two-branch information of the spectrogram features after the short-time Fourier transform. Secondly， the codes in the two branches were used in the inter- and and intra-feature-block attention mechanisms respectively， and different speech feature information was re-labeled. Secondly， the long-term sequence information was processed by Long Short-Term Memory （LSTM） network， and the spectrogram features were restored and aggregated by two decoders. Finally， the target speech waveform was generated by short-time inverse Fourier transform to achieve the purpose of suppressing noise. Experiments were carried out on the public dataset VBD （Voice Bank+DMAND） and the noise added dataset TIMIT. The results show that compared with the phase-aware Deep Complex Convolution Recurrent Network （DCCRN）， DCCARN has the Perceptual Evaluation of Speech Quality （PESQ） increased by 0.150 and 0.077 to 0.087 respectively. It is verified that the proposed method can capture the correlation information of spectrogram features more accurately， suppress noise more effectively， and improve speech intelligibility.

Key words: speech enhancement, attention mechanism, complex convolutional network, coding, Long Short-Term Memory (LSTM) network

中图分类号:

TN912.34

余本年, 詹永照, 毛启容, 董文龙, 刘洪麟. 面向语音增强的双复数卷积注意聚合递归网络[J]. 计算机应用, 2023, 43(10): 3217-3224.

Bennian YU, Yongzhao ZHAN, Qirong MAO, Wenlong DONG, Honglin LIU. Double complex convolution and attention aggregating recurrent network for speech enhancement[J]. Journal of Computer Applications, 2023, 43(10): 3217-3224.

图/表 8

图 1 DCCARN方法架构

Fig. 1 DCCARN method architecture

图 2 时频分解

Fig. 2 Time-frequency decomposition

图 3 特征注意力机制

Fig. 3 Feature attention mechanisms

图 4 在VBD数据集上的a、b取值分析

Fig. 4 Value analysis of a and b on VBD dataset

表 1 不同方法在VBD数据集上的语音评价得分

Tab. 1 Speech evaluation scores of different methods on VBD dataset

方法	输入信号	PESQ	CSIG	CBAK	COVL
NOISY		1.970	3.350	2.440	2.630
Wavenet^［17］		2.220	3.620	3.230	2.980
SEGAN^［12］	时域	2.160	3.480	2.940	2.800
CNN-GAN^［18］	频域	2.340	3.550	2.950	2.920
Wave-U-Net^［19］	时域	2.400	3.520	3.240	2.960
MMSE-GAN^［20］	频域	2.530	3.800	3.120	3.140
CRN^［3］	频域	2.610	3.780	3.260	3.170
MDPHD^［21］	频域	2.700	3.850	3.390	3.270
DCCRN^［7］	频域	2.680	3.730	2.460	3.190
TFT-Net^［22］	频域	2.750	3.930	3.440	3.340
PGGAN^［4］	频域	2.810	3.990	3.590	3.360
FTSC-GAN^［23］	时域	2.750	4.030	3.350	3.390
DCCARN	频域	2.830	3.910	3.600	3.430

表 2 不同方法在TIMIT数据集上的语音评价得分

Tab. 2 Speech evaluation scores of different methods on TIMIT dataset

方法	信噪比/dB	PESQ	CSIG	CBAK	COVL
NOISY	5	2.208
	0	1.934
	-5	1.626
DCCRN	5	2.881	4.054	2.781	3.515
	0	2.448	3.624	2.546	3.047
	-5	2.040	3.152	2.304	2.575
DCCARN	5	2.958	4.160	3.016	3.607
	0	2.535	3.763	2.762	3.162
	-5	2.119	3.317	2.490	2.700

表 3 在VBD数据集上的消融实验结构

Tab. 3 Ablation experimental results on VBD dataset

基准模型	特征块间注意力模块	特征块内注意力模块	双分支结构	$L o s s$ 损失函数	PESQ	CSIG	CBAK	COVL
√					2.680	3.730	2.460	3.190
√	√				2.700	3.690	2.500	3.170
√		√			2.710	3.650	2.970	3.200
√			√		2.740	3.680	2.930	3.290
√	√	√	√		2.750	3.890	3.330	3.400
√	√	√	√	√	2.830	3.910	3.600	3.430

表 3 在VBD数据集上的消融实验结构

Tab. 3 Ablation experimental results on VBD dataset

基准模型	特征块间注意力模块	特征块内注意力模块	双分支结构	$L o s s$ 损失函数	PESQ	CSIG	CBAK	COVL
√					2.680	3.730	2.460	3.190
√	√				2.700	3.690	2.500	3.170
√		√			2.710	3.650	2.970	3.200
√			√		2.740	3.680	2.930	3.290
√	√	√	√		2.750	3.890	3.330	3.400
√	√	√	√	√	2.830	3.910	3.600	3.430

图 5 语音质量对比

Fig. 5 Speech quality comparison

参考文献 23

1	CHOI H S， KIM J H， HUH J， et al. Phase-aware speech enhancement with deep complex U-Net［EB/OL］. （2023-08-06）［2023-08-08］..
2	HASANNEZHAD M， YU H， ZHU W P， et al. PACDNN： a phase-aware composite deep neural network for speech enhancement［J］. Speech Communication， 2022， 136： 1-13. 10.1016/j.specom.2021.10.002
3	TAN K， WANG D. A convolutional recurrent neural network for real-time speech enhancement［C］// Proceedings of the INTERSPEECH 2018. ［S.l.］： International Speech Communication Association， 2018： 3229-3233. 10.21437/interspeech.2018-1405
4	LI Y， SUN M， ZHANG X. Perception-guided generative adversarial network for end-to-end speech enhancement［J］. Applied Soft Computing， 2022， 128： No.109446. 10.1016/j.asoc.2022.109446
5	WANG Z， ZHANG T， SHAO Y， et al. LSTM-convolutional-BLSTM encoder-decoder network for minimum mean-square error approach to speech enhancement［J］. Applied Acoustics， 2021， 172： No.107647. 10.1016/j.apacoust.2020.107647
6	YU G， WANG Y， ZHENG C， et al. CycleGAN-based non-parallel speech enhancement with an adaptive attention-in-attention mechanism［C］// Proceedings of the 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Piscataway： IEEE， 2021： 523-529.
7	HU Y， LIU Y， LV S， et al. DCCRN： deep complex convolution recurrent network for phase-aware speech enhancement［C］// Proceedings of the INTERSPEECH 2020. ［S.l.］： International Speech Communication Association， 2020： 2472-2476. 10.21437/interspeech.2020-2537
8	WOO S， PARK J， LEE J Y， et al. CBAM： convolutional block attention module［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11211. Cham： Springer， 2018： 3-19.
9	KOIZUMI Y， YATABE K， DELCROIX M， et al. Speech enhancement using self-adaptation and multi-head self-attention［C］// Proceedings of the 2020 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2020： 181-185. 10.1109/icassp40776.2020.9053214
10	ZHANG Q， SONG Q， NI Z， et al. Time-frequency attention for monaural speech enhancement［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2022： 7852-7856. 10.1109/icassp43922.2022.9746454
11	高戈，王霄，曾邦，等. 基于时频联合损失函数的语音增强算法［J］. 计算机应用， 2022， 42（S1）：316-320.
	GAO G， WANG X， ZENG B， et al. Speech enhancement algorithm based on time-frequency joint loss function［J］. Journal of Computer Applications， 2022， 42（S1）：316-320.
12	PASCUAL S， BONAFONTE A， SERRÀ J. SEGAN： speech enhancement generative adversarial network［C］// Proceedings of the INTERSPEECH 2017. ［S.l.］： International Speech Communication Association， 2017： 3642-3646. 10.21437/interspeech.2017-1428
13	VEAUX C， YAMAGISHI J， KING S. The voice bank corpus： design， collection and data analysis of a large regional accent speech database［C］// Proceedings of the 2013 International Conference Oriental COCOSDA Held Jointly with Conference on Asian Spoken Language Research and Evaluation. Piscataway： IEEE， 2013： 1-4. 10.1109/icsda.2013.6709856
14	THIEMANN J， ITO N， VINCENT E. The Diverse Environments Multi-channel Acoustic Noise Database （DEMAND）： a database of multichannel environmental noise recordings［J］. Proceedings of Meetings on Acoustics， 2013， 19（1）： No.035081. 10.1121/1.4806631
15	GAROFOLO J S， LAMEL L F， FISHER W M. TIMIT acoustic phonetic continuous speech corpus［DS/OL］. ［2022-12-15］.. 10.6028/nist.ir.4930
16	VARGA A， STEENEKEN H J M. Assessment for automatic speech recognition： Ⅱ. NOISEX-92： a database and an experiment to study the effect of additive noise on speech recognition systems［J］. Speech Communication， 1993， 12（3）： 247-251. 10.1016/0167-6393(93)90095-3
17	RETHAGE D， PONS J， SERRA X. A Wavenet for speech denoising［C］// Proceedings of the 2018 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2018： 5069-5073. 10.1109/icassp.2018.8462417
18	SHAH N， PATIL H A， SONI M H. Time-frequency mask-based speech enhancement using convolutional generative adversarial network［C］// Proceedings of the 2018 AP sia-Pacific Signal and Information Processing Association Annual Summit and Conference. Piscataway： IEEE， 2018：1246-1251. 10.23919/apsipa.2018.8659692
19	MACARTNEY C， WEYDE T. Improved speech enhancement with the Wave-U-Net［EB/OL］. （2018-11-27）［2022-12-15］..
20	SONI M H， SHAH N， PATIL H A. Time-frequency masking-based speech enhancement using generative adversarial network［C］// Proceedings of the 2018 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2018： 5039-5043. 10.1109/icassp.2018.8462068
21	KIM J H， YOO J， CHUN S， et al. Multi-domain processing via hybrid denoising networks for speech enhancement［EB/OL］. （2018-12-21）［2022-12-15］.. 10.48550/arXiv.1812.08914
22	TANG C， LUO C， ZHAO Z， et al. Joint time-frequency and time domain learning for speech enhancement［C］// Proceedings of the 29th International Joint Conferences on Artificial Intelligence. California： ijcai.org， 2020： 3816-3822. 10.24963/ijcai.2020/528
23	沈梦强，于文年，易黎，等. 基于GAN的全时间尺度语音增强方法［J］.计算机工程， 2023， 49（6）：115-122， 130.
	SHEN M Q， YU W N， YI L， et al. Full-time scale speech enhancement method based on GAN［J］. Computer Engineering， 2023， 49（6）：115-122， 130.

[1]	赵志强, 马培红, 黑新宏. 基于双重注意力机制的人群计数方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2886-2892.
[2]	徐志刚, 张创. 基于门控位置编码的壁画图像多级色彩还原[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2931-2937.
[3]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[4]	李力铤, 华蓓, 贺若舟, 徐况. 基于解耦注意力机制的多变量时序预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2732-2738.
[5]	孙淳, 胡春龙, 黄树成. 一致性保留的集成排序年龄估计方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2381-2386.
[6]	薛凯鹏, 徐涛, 廖春节. 融合自监督和多层交叉注意力的多模态情感分析网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2387-2392.
[7]	汪雨晴, 朱广丽, 段文杰, 李书羽, 周若彤. 基于交互注意力机制的心理咨询文本情感分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2393-2399.
[8]	高鹏淇, 黄鹤鸣, 樊永红. 融合坐标与多头注意力机制的交互语音情感识别[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2400-2406.
[9]	张全梅, 黄润萍, 滕飞, 张海波, 周南. 融合异构信息的自动国际疾病分类编码方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2476-2482.
[10]	李钟华, 白云起, 王雪津, 黄雷雷, 林初俊, 廖诗宇. 基于图像增强的低照度人脸检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2588-2594.
[11]	邓凯丽, 魏伟波, 潘振宽. 改进掩码自编码器的工业缺陷检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2595-2603.
[12]	李晨倩, 刘俊. 基于半监督和多尺度级联注意力的超声颈动脉斑块分割方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2604-2610.
[13]	莫尚斌, 王文君, 董凌, 高盛祥, 余正涛. 基于多路信息聚合协同解码的单通道语音增强[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2611-2617.
[14]	刘丽, 侯海金, 王安红, 张涛. 基于多尺度注意力的生成式信息隐藏算法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2102-2109.
[15]	徐松, 张文博, 王一帆. 基于时空信息的轻量视频显著性目标检测网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2192-2199.

面向语音增强的双复数卷积注意聚合递归网络

Double complex convolution and attention aggregating recurrent network for speech enhancement

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 8

参考文献 23

相关文章 15

编辑推荐

Metrics