In response to the inability of existing 3D shape reconstruction models to effectively fuse global spatio-temporal information, a Depth Focus Volume (DFV) module was proposed to retain the transition information of focus and defocus, on this basis, a Global Spatio-Temporal Feature Coupling (GSTFC) model was proposed to extract local and global spatio-temporal feature information of multi-depth-of-field image sequences. Firstly, the 3D-ConvNeXt module and 3D convolutional layer were interspersed in the shrinkage path to capture multi-scale local spatio-temporal features. Meanwhile, the 3D-SwinTransformer module was added to the bottleneck module to capture the global correlations of local spatio-temporal features of multi-depth-of-field image sequences. Then, the local spatio-temporal features and global correlations were fused into global spatio-temporal features through the adaptive parameter layer, which were input into the expansion path to guide and generate focus volume. Finally, the sequence weight information of the focus volume was extracted by DFV and the transition information of focus and defocus was retained to obtain the final depth map. Experimental results show that GSTFC decreases the Root Mean Square Error (RMSE) index by 12.5% on FoD500 dataset compared with the state-of-the-art All-in-Focus Depth Net (AiFDepthNet) model, and retains more depth-of-field transition relationships compared with the traditional Robust Focus Volume Regularization in Shape from Focus (RFVR-SFF) model.