To address the problem of overlapping time-frequency points among different speakers leading to poor separation performance in single-channel speech separation methods based on spectrogram feature input, a single-channel speech separation model based on auditory modulation Siamese network was proposed. Firstly, the modulation signals were computed through frequency band division and envelope demodulation, and the modulation amplitude spectrum was extracted using Fourier transform. Secondly, mapping relationship between the modulation amplitude spectrum features and speech segments was obtained using a mutation point detection and matching method to achieve effective segmentation of speech segments. Thirdly, a Fusion of Co-attention Mechanisms in Siamese Neural Network (FCMSNN) was designed to extract discriminative features of speech segments of different speakers. Fourthly, a Neighborhood-based Self-Organizing Map (N-SOM) network was proposed to perform feature clustering without pre-specifying the number of speakers by defining a dynamic neighborhood range, so as to obtain mask matrices for different speakers. Finally, to avoid artifacts in the reconstructed signals in the modulation domain, a time-domain filter was designed to convert modulation-domain masks into time-domain masks and reconstruct speech signals by combining phase information. The experimental results show that the proposed model outperforms the Double-Density Dual-Tree Complex Wavelet Transform (DDDTCWT) method in terms of Perceptual Evaluation of Speech Quality (PESQ), Signal-to-Distortion Ratio improvement (SDRi) and Scale-Invariant Signal-to-Distortion Ratio improvement (SI-SDRi); on WSJ0-2mix and WSJ0-3mix datasets the proposed model has PESQ, SDRi, and SI-SDRi improved by 3.47%, 6.91% and 7.79% and 3.08%, 6.71% and 7.51% respectively.