基于听觉调制孪生网络的单通道语音分离模型

doi:10.11772/j.issn.1001-9081.2024050724

《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (6): 2025-2033.DOI: 10.11772/j.issn.1001-9081.2024050724

• 多媒体计算与计算机仿真 • 上一篇

基于听觉调制孪生网络的单通道语音分离模型

宋源¹, 陈锌¹, 李亚荣¹, 李永伟², 刘扬¹, 赵振¹

^1.青岛科技大学信息科学技术学院，山东青岛 266061
^2.中国科学院自动化研究所模式识别国家重点实验室，北京 100190

收稿日期:2024-06-03 修回日期:2025-01-08 接受日期:2025-01-10 发布日期:2025-01-14 出版日期:2025-06-10
通讯作者: 刘扬
作者简介:宋源（2000—），女，河南驻马店人，硕士研究生，主要研究方向：语音分离
陈锌（2000—），男，湖南长沙人，硕士研究生，主要研究方向：语音分离、语音情感识别
李亚荣（2000—），女，山东聊城人，硕士研究生，主要研究方向：语音情感识别
李永伟（1988—），男，北京人，助理研究员，博士，主要研究方向：语音情感识别
刘扬（1988—），男，山东临沂人，副教授，博士，主要研究方向：语音信号处理 yangliu@qust.edu.cn
赵振（1982—），男，山东德州人，副教授，博士，主要研究方向：智能控制、信息处理。
基金资助:
国家自然科学基金青年项目(62201314);山东省自然科学基金青年项目(ZR2020QF007)

Single-channel speech separation model based on auditory modulation Siamese network

Yuan SONG¹, Xin CHEN¹, Yarong LI¹, Yongwei LI², Yang LIU¹, Zhen ZHAO¹

^1.School of Information Science and Technology，Qingdao University of Science and Technology，Qingdao Shandong 266061，China
^2.National Laboratory of Pattern Recognition，Institute of Automation，Chinese Academy of Sciences，Beijing 100190，China

Received:2024-06-03 Revised:2025-01-08 Accepted:2025-01-10 Online:2025-01-14 Published:2025-06-10
Contact: Yang LIU
About author:SONG Yuan， born in 2000， M. S. candidate. Her research interests include speech separation.
CHEN Xin， born in 2000， M. S. candidate. His research interests include speech separation， speech emotion recognition.
LI Yarong， born in 2000， M. S. candidate. Her research interests include speech emotion recognition.
LI Yongwei， born in 1988， Ph. D.， assistant research fellow. His research interests include speech emotion recognition.
LIU Yang， born in 1988， Ph. D.， associate professor. His research interests include speech signal processing.
ZHAO Zhen， born in 1982， Ph. D.， associate professor. His research interests include intelligent control， information processing.
Supported by:
Youth Program of National Natural Science Foundation of China(62201314);Youth Program of Shandong Provincial Natural Science Foundation(ZR2020QF007)

摘要/Abstract

摘要：

为了解决基于语谱图特征输入的单通道语音分离方法存在的不同说话人时频点重叠导致分离效果欠佳的问题，提出一种基于听觉调制孪生网络的单通道语音分离模型。首先，通过频带划分和包络检波计算调制信号，进而利用傅里叶变换提取调制幅度谱；其次，基于突变点检测和匹配的方法获取调制幅度谱特征与语音片段之间的映射关系，实现语音片段的有效划分；再次，设计融合协同注意力机制的孪生神经网络（FCMSNN）提取不同说话人语音片段的鉴别性特征；继次，提出基于邻域机制的自组织映射（N-SOM）网络，通过划定动态邻域范围实现无需预先指定说话人数量的特征聚类，以获得不同说话人的掩膜矩阵；最后，为了避免在调制域重构信号中产生伪影，设计时域滤波器将调制域掩膜转换为时域掩膜并结合相位信息重构语音信号。实验结果表明，所提模型在WSJ0-2mix和WSJ0-3mix数据集上的语音质量感知评价（PESQ）、信号失真比改进（SDRi）和尺度不变信号失真比改进（SI-SDRi）均优于双密度双树复小波变换（DDDTCWT）方法，分别提高了3.47%、6.91%和7.79%和3.08%、6.71%、7.51%。

关键词: 语音分离, 调制机制, 孪生网络, 协同注意力机制, 自组织映射网络

Abstract:

To address the problem of overlapping time-frequency points among different speakers leading to poor separation performance in single-channel speech separation methods based on spectrogram feature input， a single-channel speech separation model based on auditory modulation Siamese network was proposed. Firstly， the modulation signals were computed through frequency band division and envelope demodulation， and the modulation amplitude spectrum was extracted using Fourier transform. Secondly， mapping relationship between the modulation amplitude spectrum features and speech segments was obtained using a mutation point detection and matching method to achieve effective segmentation of speech segments. Thirdly， a Fusion of Co-attention Mechanisms in Siamese Neural Network （FCMSNN） was designed to extract discriminative features of speech segments of different speakers. Fourthly， a Neighborhood-based Self-Organizing Map （N-SOM） network was proposed to perform feature clustering without pre-specifying the number of speakers by defining a dynamic neighborhood range， so as to obtain mask matrices for different speakers. Finally， to avoid artifacts in the reconstructed signals in the modulation domain， a time-domain filter was designed to convert modulation-domain masks into time-domain masks and reconstruct speech signals by combining phase information. The experimental results show that the proposed model outperforms the Double-Density Dual-Tree Complex Wavelet Transform （DDDTCWT） method in terms of Perceptual Evaluation of Speech Quality （PESQ）， Signal-to-Distortion Ratio improvement （SDRi） and Scale-Invariant Signal-to-Distortion Ratio improvement （SI-SDRi）； on WSJ0-2mix and WSJ0-3mix datasets the proposed model has PESQ， SDRi， and SI-SDRi improved by 3.47%， 6.91% and 7.79% and 3.08%， 6.71% and 7.51% respectively.

Key words: speech separation, modulation mechanism, Siamese network, co-attention mechanism, Self-Organizing Map (SOM) network

中图分类号:

TP399

宋源, 陈锌, 李亚荣, 李永伟, 刘扬, 赵振. 基于听觉调制孪生网络的单通道语音分离模型[J]. 计算机应用, 2025, 45(6): 2025-2033.

Yuan SONG, Xin CHEN, Yarong LI, Yongwei LI, Yang LIU, Zhen ZHAO. Single-channel speech separation model based on auditory modulation Siamese network[J]. Journal of Computer Applications, 2025, 45(6): 2025-2033.

图/表 9

参考文献 35

1	SHAMMA S A. Speech processing in the auditory system Ⅱ： lateral inhibition and the central processing of speech evoked activity in the auditory nerve［J］. The Journal of the Acoustical Society of America， 1985， 78（5）： 1622-1632.
2	ROUAT J. Computational auditory scene analysis： principles， algorithms， and applications （Wang， D. and Brown， G.J.， Eds.； 2006）［Book review］［J］. IEEE Transactions on Neural Networks， 2008， 19（1）： 199-199.
3	ZEREMDINI J， MESSAOUD M A BEN， BOUZID A. A comparison of several Computational Auditory Scene Analysis （CASA） techniques for monaural speech segregation［J］. Brain Informatics， 2015， 2（3）： 155-166.
4	SHAO Y， SRINIVASAN S， JIN Z， et al. A computational auditory scene analysis system for speech segregation and robust speech recognition［J］. Computer Speech and Language， 2010， 24（1）： 77-93.
5	CARDOSO J F. Source separation using higher order moments［C］// Proceedings of the 1989 International Conference on Acoustics， Speech， and Signal Processing — Volume 4. Piscataway： IEEE， 1989： 2109-2112.
6	LEE D D， SEUNG H S. Learning the parts of objects by non-negative matrix factorization［J］. Nature， 1999， 401（6755）： 788-791.
7	CHOI S， CICHOCKI A， PARK H M， et al. Blind source separation and independent component analysis： a review［J］. Neural Information Processing — Letters and Reviews， 2005， 6（1）： 1-57.
8	STONE J V. Independent component analysis： a tutorial introduction［M］. Cambridge： MIT Press， 2004.
9	THARWAT A. Independent component analysis： an introduction［J］. Applied Computing and Informatics， 2021， 17（2）： 222-249.
10	GANNOT S， VINCENT E， MARKOVICH-GOLAN S， et al. A consolidated perspective on multimicrophone speech enhancement and source separation［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2017， 25（4）： 692-730.
11	EPHRAIM Y， MALAH D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator［J］. IEEE Transactions on Acoustics， Speech， and Signal Processing， 1984， 32（6）： 1109-1121.
12	SLOCK D T M. A survey of convergence results on the least mean square algorithm［J］. IEEE Transactions on Signal Processing， 1993， 41（12）： 2960-2988.
13	BENESTY J， CHEN J， HUANG Y， et al. Pearson correlation coefficient-based adaptive noise cancellation algorithms for improved speech recognition［J］. IEEE Transactions on Audio， Speech， and Language Processing， 2018， 26（5）： 984-998.
14	MINIPRIYA T M， RAJAVEL R. Review of ideal binary and ratio mask estimation techniques for monaural speech separation［C］// Proceedings of the 4th International Conference on Advances in Electrical， Electronics， Information， Communication and Bio-Informatics. Piscataway： IEEE， 2018： 1-5.
15	KOLBÆK M， YU D， TAN Z H， et al. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2017， 25（10）： 1901-1913.
16	KOLBÆK M， YU D， TAN Z H， et al. Joint separation and denoising of noisy multi-talker speech using recurrent neural networks and permutation invariant training［C］// Proceedings of the IEEE 27th International Workshop on Machine Learning for Signal Processing. Piscataway： IEEE， 2017： 1-6.
17	YU D， KOLBÆK M， TAN Z H， et al. Permutation invariant training of deep models for speaker-independent multi-talker speech separation［C］// Proceedings of the 2017 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2017： 241-245.
18	XU C， RAO W， XIAO X， et al. Single channel speech separation with constrained utterance level permutation invariant training using grid LSTM［C］// Proceedings of the 2018 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2018： 6-10.
19	MIN E， GUO X， LIU Q， et al. A survey of clustering with deep learning： from the perspective of network architecture［J］. IEEE Access， 2018， 6： 39501-39514.
20	CHEN Z， LUO Y， MESGARANI N. Deep attractor network for single-microphone speaker separation［C］// Proceedings of the 2017 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2017： 246-250.
21	LUO Y， CHEN Z， MESGARANI N. Speaker-independent speech separation with deep attractor network［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2018， 26（4）： 787-796.
22	关文博. 基于孪生神经网络与深度聚类融合的语音分离研究［D］. 青岛：青岛科技大学， 2023.
	GUAN W B. Research on speech separation based on fusion of Siamese neural network and deep clustering［D］. Qingdao： Qingdao University of Science and Technology， 2023.
23	KAMBLE M R， TAK H， PATIL H A. Amplitude and frequency modulation-based features for detection of replay spoof speech［J］. Speech Communication， 2020， 125： 114-127.
24	MSONDA P， UYMAZ S A， KARAAĞAÇ S S. Spatial pyramid pooling in deep convolutional networks for automatic tuberculosis diagnosis［J］. Traitement du Signal， 2020， 37（6）： 1075-1084.
25	QIAN C， GAO J， SHAO X， et al. Bearing fault diagnosis method based on improved deep residual Siamese neural network［J］. Insight — Non-Destructive Testing and Condition Monitoring， 2024， 66（3）： 174-181.
26	STÖTER F R， LIUTKUS A， ITO N. The 2018 signal separation evaluation campaign［C］// Proceedings of the 2018 International Conference on Latent Variable Analysis and Signal Separation LNCS 10891. Cham： Springer， 2018： 293-305.
27	葛宛营. 基于非负矩阵分解和深度聚类的语音分离研究［D］. 重庆：重庆邮电大学， 2020.
	GE W Y. Study on speech separation based on non-negative matrix factorization and deep clustering［D］. Chongqing： Chongqing University of Posts and Telecommunications， 2020.
28	LUO Y， MESGARANI N. Conv-TasNet： surpassing ideal time-frequency magnitude masking for speech separation［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2019， 27（8）： 1256-1266.
29	WANG Z Q， LE ROUX J， HERSHEY J R. Alternative objective functions for deep clustering［C］// Proceedings of the 2018 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2018： 686-690.
30	LUO Y， CHEN Z， HERSHEY J R， et al. Deep clustering and conventional networks for music separation： stronger together［C］// Proceedings of the 2017 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2017： 61-65.
31	STOLLER D， EWERT S， DIXON S. Wave-U-Net： a multi-scale neural network for end-to-end audio source separation［C］// Proceedings of the 19th International Society for Music Information Retrieval Conference. ［S.l.］： International Society for Music Information Retrieval， 2018： 334-340.
32	LAN C， JIANG J， ZHANG L， et al. Blind source separation based on improved Wave-U-Net network［J］. IEEE Access， 2023， 11： 125951-125958.
33	YEN Y F， SHUAI H H. Group-and-conquer for multi-speaker single-channel speech separation［C］// Proceedings of the 33rd Wireless and Optical Communications Conference. Piscataway： IEEE， 2024： 165-169.
34	HOSSAIN M I， RAHIM M A， HOSSAIN M N. Single-channel speech separation based on double-density dual-tree CWT and SNMF［J］. Annals of Emerging Technologies in Computing， 2024， 8（1）： 1-12.
35	FU J， XIE Q， MENG D， et al. Rotation equivariant proximal operator for deep unfolding methods in image restoration［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2024， 46（10）： 6577-6593.

数据集	语音分离模型	PESQ	SDRi/dB	SI-SDRi/dB
WSJ0-2mix	NMF-LSTM	2.08	6.34	—
	uPIT-LSTM	2.11	7.00	7.43
	TasNet-LSTM	2.84	8.00	8.20
	DPCL	3.05	7.93	5.80
	Chimera	3.21	8.35	7.00
	Wave-U-Net	3.32	8.67	9.02
	Wave-U-Net+SAM+ASPP	3.42	9.16	9.25
	Group+Conquer	3.44	9.23	9.21
	DDDTCWT	3.46	9.25	9.24
	本文模型	3.58	9.89	9.96
WSJ0-3mix	NMF-LSTM	1.93	4.31	—
	uPIT-LSTM	2.01	5.05	—
	TasNet-LSTM	2.53	5.26	5.64
	DPCL	2.95	5.93	6.08
	Chimera	3.02	6.17	6.05
	Wave-U-Net	3.10	6.23	7.21
	Wave-U-Net+SAM+ASPP	3.24	6.30	7.45
	Group+Conquer	3.20	6.22	7.43
	DDDTCWT	3.25	6.41	7.46
	本文模型	3.35	6.84	8.02

数据集	语音分离模型	PESQ	SDRi/dB	SI-SDRi/dB
WSJ0-2mix	NMF-LSTM	2.08	6.34	—
	uPIT-LSTM	2.11	7.00	7.43
	TasNet-LSTM	2.84	8.00	8.20
	DPCL	3.05	7.93	5.80
	Chimera	3.21	8.35	7.00
	Wave-U-Net	3.32	8.67	9.02
	Wave-U-Net+SAM+ASPP	3.42	9.16	9.25
	Group+Conquer	3.44	9.23	9.21
	DDDTCWT	3.46	9.25	9.24
	本文模型	3.58	9.89	9.96
WSJ0-3mix	NMF-LSTM	1.93	4.31	—
	uPIT-LSTM	2.01	5.05	—
	TasNet-LSTM	2.53	5.26	5.64
	DPCL	2.95	5.93	6.08
	Chimera	3.02	6.17	6.05
	Wave-U-Net	3.10	6.23	7.21
	Wave-U-Net+SAM+ASPP	3.24	6.30	7.45
	Group+Conquer	3.20	6.22	7.43
	DDDTCWT	3.25	6.41	7.46
	本文模型	3.35	6.84	8.02

语音分离模型	MOS
语音分离模型	S1	S2	S3	S4
NMF-LSTM	3.41	3.53	3.32	3.40
uPIT-LSTM	3.32	3.53	3.41	3.32
TasNet-LSTM	3.40	3.44	3.30	3.52
DPCL	3.20	3.66	3.51	3.41
Chimera	3.63	3.81	3.70	3.83
Wave-U-Net	3.82	3.73	3.61	3.94
Wave-U-Net+SAM+ASPP	3.91	4.02	3.92	3.83
Group+Conquer	3.90	3.98	3.84	4.00
DDDTCWT	3.93	4.00	4.03	3.90
本文模型	4.18	4.25	4.33	4.05

语音分离模型	MOS
语音分离模型	S1	S2	S3	S4
NMF-LSTM	3.41	3.53	3.32	3.40
uPIT-LSTM	3.32	3.53	3.41	3.32
TasNet-LSTM	3.40	3.44	3.30	3.52
DPCL	3.20	3.66	3.51	3.41
Chimera	3.63	3.81	3.70	3.83
Wave-U-Net	3.82	3.73	3.61	3.94
Wave-U-Net+SAM+ASPP	3.91	4.02	3.92	3.83
Group+Conquer	3.90	3.98	3.84	4.00
DDDTCWT	3.93	4.00	4.03	3.90
本文模型	4.18	4.25	4.33	4.05

消融实验组	PESQ	SDRi/dB	SI-SDRi/dB
A	3.58	9.89	9.96
B	3.51	9.62	9.73
C	2.98	8.14	8.33
D	3.46	9.56	9.54
E	3.06	8.82	8.75
F	3.47	9.25	8.91
G	3.34	8.80	8.56

基于听觉调制孪生网络的单通道语音分离模型

Single-channel speech separation model based on auditory modulation Siamese network

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 35

相关文章 15

编辑推荐

Metrics

[1]	熊武, 曹从军, 宋雪芳, 邵云龙, 王旭升. 基于多尺度混合域注意力机制的笔迹鉴别方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2225-2232.
[2]	孙子文, 钱立志, 杨传栋, 高一博, 陆庆阳, 袁广林. 基于Transformer的视觉目标跟踪方法综述[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1644-1654.
[3]	荆智文, 张屿佳, 孙伯廷, 郭浩. 二阶段孪生图卷积神经网络推荐算法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 469-476.
[4]	崔晨辉, 蔺素珍, 李大威, 禄晓飞, 武杰. 基于孪生网络和Transformer的红外弱小目标跟踪方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 563-571.
[5]	姜钧舰, 刘达维, 刘逸凡, 任酉贵, 赵志滨. 基于孪生网络的小样本目标检测算法[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2325-2329.
[6]	赵元龙, 单玉刚, 袁杰, 赵康迪. 基于实例分割与毕达哥拉斯模糊决策的目标跟踪[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1930-1937.
[7]	王梦亭, 杨文忠, 武雍智. 基于孪生网络的单目标跟踪算法综述[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 661-673.
[8]	王奇, 雷航, 王旭鹏. 姿态干扰下的深度人脸验证[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 595-600.
[9]	公海涛, 陈志华, 盛斌, 祝冰艳. 基于孪生网络和Transformer的小目标跟踪算法SiamTrans[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3733-3739.
[10]	庄屹, 赵海涛. 面向三维点云单目标跟踪的提案聚合网络[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1407-1416.
[11]	朱文球, 邹广, 曾志高. 融合层次特征和混合注意力的目标跟踪算法[J]. 《计算机应用》唯一官方网站, 2022, 42(3): 833-843.
[12]	姬张建, 任兴旺. 带旋转与尺度估计的全卷积孪生网络目标跟踪算法[J]. 计算机应用, 2021, 41(9): 2705-2711.
[13]	钟莎, 黄玉清. 基于孪生区域候选网络的无人机指定目标跟踪[J]. 计算机应用, 2021, 41(2): 523-529.
[14]	楼豪杰, 郑元林, 廖开阳, 雷浩, 李佳. 基于Siamese-YOLOv4的印刷品缺陷目标检测[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3206-3212.
[15]	李自强, 王正勇, 陈洪刚, 李林怡, 何小海. 基于外观和动作特征双预测模型的视频异常行为检测[J]. 计算机应用, 2021, 41(10): 2997-3003.