Aiming at the problems of incomplete representation of speech information by single speech emotion features and low utilization of speech features by the model, a multi-feature fusion speech emotion recognition method based on SAA-CNN-BiLSTM network was proposed. The method enhanced data by introducing noise,volume and audio rate boosters,enabling the model to learn diverse data features,and integrated multiple features such as fundamental frequency, time domain and frequency domain features to comprehensively represent emotional information from different perspectives. Besides, based on Bidirectional Long Short-Term Memory (BiLSTM) network, Convolutional Neural Network (CNN) was introduced to capture the spatial correlation of the input data and extract more representative features. At the same time, a Simplified Additive Attention (SAA) mechanism was constructed to simplify the explicit query keys and query vectors, so that the calculation of attention weights did not depend on specific query information. Features of different dimensions were able to be correlated and influenced each other based on the attention weights. In this way, the information between features was able to be interacted and fused with each other, thus improving the effective utilization of features. Experimental results show that this method achieves the weighted precision of 87.02%, 82.59%, and 73.13%, respectively, on the EMO-DB, CASIA, and SAVEE datasets. Compared with the baseline methods such as Incremental Convolution (IncConv), Novel Heterogeneous Parallel Convolution BiLSTM (NHPC-BiLSTM), and Dynamic Convolutional Recurrent Neural Network (DCRNN), the improvements are 0.52-9.80, 2.92-23.09, and 3.13-16.63 percentage points, respectively.