Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (6): 2025-2033.DOI: 10.11772/j.issn.1001-9081.2024050724

• Multimedia computing and computer simulation • Previous Articles    

Single-channel speech separation model based on auditory modulation Siamese network

Yuan SONG1, Xin CHEN1, Yarong LI1, Yongwei LI2, Yang LIU1, Zhen ZHAO1   

  1. 1.School of Information Science and Technology,Qingdao University of Science and Technology,Qingdao Shandong 266061,China
    2.National Laboratory of Pattern Recognition,Institute of Automation,Chinese Academy of Sciences,Beijing 100190,China
  • Received:2024-06-03 Revised:2025-01-08 Accepted:2025-01-10 Online:2025-01-14 Published:2025-06-10
  • Contact: Yang LIU
  • About author:SONG Yuan, born in 2000, M. S. candidate. Her research interests include speech separation.
    CHEN Xin, born in 2000, M. S. candidate. His research interests include speech separation, speech emotion recognition.
    LI Yarong, born in 2000, M. S. candidate. Her research interests include speech emotion recognition.
    LI Yongwei, born in 1988, Ph. D., assistant research fellow. His research interests include speech emotion recognition.
    LIU Yang, born in 1988, Ph. D., associate professor. His research interests include speech signal processing.
    ZHAO Zhen, born in 1982, Ph. D., associate professor. His research interests include intelligent control, information processing.
  • Supported by:
    Youth Program of National Natural Science Foundation of China(62201314);Youth Program of Shandong Provincial Natural Science Foundation(ZR2020QF007)

基于听觉调制孪生网络的单通道语音分离模型

宋源1, 陈锌1, 李亚荣1, 李永伟2, 刘扬1, 赵振1   

  1. 1.青岛科技大学 信息科学技术学院,山东 青岛 266061
    2.中国科学院自动化研究所 模式识别国家重点实验室,北京 100190
  • 通讯作者: 刘扬
  • 作者简介:宋源(2000—),女,河南驻马店人,硕士研究生,主要研究方向:语音分离
    陈锌(2000—),男,湖南长沙人,硕士研究生,主要研究方向:语音分离、语音情感识别
    李亚荣(2000—),女,山东聊城人,硕士研究生,主要研究方向:语音情感识别
    李永伟(1988—),男,北京人,助理研究员,博士,主要研究方向:语音情感识别
    刘扬(1988—),男,山东临沂人,副教授,博士,主要研究方向:语音信号处理 yangliu@qust.edu.cn
    赵振(1982—),男,山东德州人,副教授,博士,主要研究方向:智能控制、信息处理。
  • 基金资助:
    国家自然科学基金青年项目(62201314);山东省自然科学基金青年项目(ZR2020QF007)

Abstract:

To address the problem of overlapping time-frequency points among different speakers leading to poor separation performance in single-channel speech separation methods based on spectrogram feature input, a single-channel speech separation model based on auditory modulation Siamese network was proposed. Firstly, the modulation signals were computed through frequency band division and envelope demodulation, and the modulation amplitude spectrum was extracted using Fourier transform. Secondly, mapping relationship between the modulation amplitude spectrum features and speech segments was obtained using a mutation point detection and matching method to achieve effective segmentation of speech segments. Thirdly, a Fusion of Co-attention Mechanisms in Siamese Neural Network (FCMSNN) was designed to extract discriminative features of speech segments of different speakers. Fourthly, a Neighborhood-based Self-Organizing Map (N-SOM) network was proposed to perform feature clustering without pre-specifying the number of speakers by defining a dynamic neighborhood range, so as to obtain mask matrices for different speakers. Finally, to avoid artifacts in the reconstructed signals in the modulation domain, a time-domain filter was designed to convert modulation-domain masks into time-domain masks and reconstruct speech signals by combining phase information. The experimental results show that the proposed model outperforms the Double-Density Dual-Tree Complex Wavelet Transform (DDDTCWT) method in terms of Perceptual Evaluation of Speech Quality (PESQ), Signal-to-Distortion Ratio improvement (SDRi) and Scale-Invariant Signal-to-Distortion Ratio improvement (SI-SDRi); on WSJ0-2mix and WSJ0-3mix datasets the proposed model has PESQ, SDRi, and SI-SDRi improved by 3.47%, 6.91% and 7.79% and 3.08%, 6.71% and 7.51% respectively.

Key words: speech separation, modulation mechanism, Siamese network, co-attention mechanism, Self-Organizing Map (SOM) network

摘要:

为了解决基于语谱图特征输入的单通道语音分离方法存在的不同说话人时频点重叠导致分离效果欠佳的问题,提出一种基于听觉调制孪生网络的单通道语音分离模型。首先,通过频带划分和包络检波计算调制信号,进而利用傅里叶变换提取调制幅度谱;其次,基于突变点检测和匹配的方法获取调制幅度谱特征与语音片段之间的映射关系,实现语音片段的有效划分;再次,设计融合协同注意力机制的孪生神经网络(FCMSNN)提取不同说话人语音片段的鉴别性特征;继次,提出基于邻域机制的自组织映射(N-SOM)网络,通过划定动态邻域范围实现无需预先指定说话人数量的特征聚类,以获得不同说话人的掩膜矩阵;最后,为了避免在调制域重构信号中产生伪影,设计时域滤波器将调制域掩膜转换为时域掩膜并结合相位信息重构语音信号。实验结果表明,所提模型在WSJ0-2mix和WSJ0-3mix数据集上的语音质量感知评价(PESQ)、信号失真比改进(SDRi)和尺度不变信号失真比改进(SI-SDRi)均优于双密度双树复小波变换(DDDTCWT)方法,分别提高了3.47%、6.91%和7.79%和3.08%、6.71%、7.51%。

关键词: 语音分离, 调制机制, 孪生网络, 协同注意力机制, 自组织映射网络

CLC Number: