《计算机应用》唯一官方网站

• •    下一篇

基于听觉调制孪生网络的单通道语音分离模型

宋源1,陈锌1,李亚荣2,李永伟3,刘扬1,赵振1   

  1. 1. 青岛科技大学
    2. 青岛科技大学信息科学技术学院
    3. 中国科学院自动化研究所
  • 收稿日期:2024-05-31 修回日期:2025-01-08 发布日期:2025-01-14 出版日期:2025-01-14
  • 通讯作者: 刘扬
  • 基金资助:
    国家自然科学基金青年项目;山东省自然科学基金青年项目

Single-Channel Speech Separation Algorithm Based on Auditory Modulation Siamese Network

  • Received:2024-05-31 Revised:2025-01-08 Online:2025-01-14 Published:2025-01-14

摘要: 为解决基于语谱图特征输入的单通道语音分离方法存在的不同说话人时频点重叠导致分离效果欠佳的问题,提出一种基于听觉调制孪生网络的单通道语音分离模型。首先,通过频带划分和包络检波计算调制信号,进而利用傅里叶变换提取调制幅度谱;其次,基于突变点检测和匹配的方法获取调制幅度谱特征与语音片段之间的映射关系,实现语音片段的有效划分;随后,设计基于协同注意力机制的孪生网络提取不同说话人语音片段的鉴别性特征;之后,提出基于领域影响机制的自组织映射网络(N-SOM),通过划定动态邻域范围,实现了无需预先指定说话人数量的特征聚类,以获得不同说话人的掩膜矩阵;最后,为了避免在调制域重构信号产生伪影,设计时域滤波器将调制域掩膜转换为时域掩膜并结合相位信息重构语音信号。实验结果表明,所提模型在 WSJ0-2mix 和 WSJ0-3mix 数据集上的语音质量感知评价(PESQ)、信号失真比改进(SDRi)和尺度不变信号失真比改进(SI-SDRi)均优于双密度双树复小波变换(DDDTCWT)方法。

关键词: 语音分离, 调制机制, 孪生网络, 协同注意力机制, SOM

Abstract: To address the problem of overlapping time-frequency points among different speakers that leads to poor separation performance in single-channel speech separation methods based on spectrogram features, a single-channel speech separation model based on the Auditory Modulation Siamese Network is proposed. First, the modulation signal is computed through frequency band division and envelope detection, and the modulation amplitude spectrum is extracted using Fourier transform. Then, the mapping relationship between the modulation amplitude spectrum features and speech segments is obtained using a mutation point detection and matching method to achieve effective segmentation of speech segments. Subsequently, a Siamese Network based on the co-attention mechanism is designed to extract discriminative features of speech segments from different speakers. Afterward, a Self-Organizing Map Network (N-SOM) based on the domain influence mechanism is proposed, which performs feature clustering without pre-specifying the number of speakers by defining a dynamic neighborhood range to obtain mask matrices for different speakers. Finally, to avoid artifacts in the reconstructed signals in the modulation domain, a time-domain filter is designed to convert modulation-domain masks into time-domain masks and reconstruct speech signals by combining phase information. Experimental results demonstrate that the proposed model outperforms the Double-Density Dual-Tree Complex Wavelet Transform (DDDTCWT) methods in terms of Perceptual Evaluation of Speech Quality (PESQ), Signal-to-Distortion Ratio improvement (SDRi), and Scale-Invariant Signal-to-Distortion Ratio improvement (SI-SDRi) on both the WSJ0-2mix and WSJ0-3mix datasets.

Key words: speech separation, modulation mechanism, siamese network, co-attention mechanism, self-origanizing maps

中图分类号: