In order to solve the problems of low accuracy and poor generalization ability in the existing Speech Emotion Recognition (SER) models, a hybrid Siamese Multi-scale CNN-BiGRU network was proposed. In this network, a Multi-Scale Feature Extractor (MSFE) and a Multi-Dimensional Attention (MDA) module were introduced to construct a Siamese network, and the training data were increased by utilizing sample pairs, thereby improving the model’s recognition accuracy and enabling it to better adapt to complex real-world application scenarios. Experimental results on IEMOCAP and EMO-DB public datasets show that the recognition accuracy of the proposed model is enhanced by 8.28 and 7.79 percentage points, respectively, compared to that of CNN-BiGRU model. Furthermore, a customer service speech emotion dataset was constructed by collecting real customer service conversation recordings. Experimental results on this dataset show that the recognition accuracy of the proposed model can reach 87.85%, indicating that the proposed model has good generalization ability.