The prediction of urban highway traffic flow is influenced by historical traffic flow and neighboring lane traffic flow, involving complex spatio-temporal features. In order to address the insufficient feature extraction, feature confusion, and feature information loss caused by not separating the spatio-temporal features in the traditional traffic flow prediction model of Convolutional Long Short-Term Memory (ConvLSTM) network, some improvements were made to the ConvLSTM model. Firstly, the short-term temporal features and spatial features of the traffic flow data at each sampling moment were extracted, and the short-term spatio-temporal features of the traffic flow were fused in specific dimensions. Secondly, residual mapping was performed. Finally, the mapped short-term spatio-temporal features were input to the Transformer model to capture the long-term spatio-temporal features of the traffic flow data, based on which the traffic flow at each sampling point in the future moment was predicted. On California urban freeway data, with Mean Absolute Error (MAE) as the model evaluation metric, the proposed model has the prediction accuracy improved by 18% compared to the Conv-Transformer model, validating the effectiveness of the proposed model.