In response to issues faced by memory-based methods in semi-supervised Video Object Segmentation (VOS), such as object occlusion caused by inter-object interactions and interference from similar objects or background noise, a semi-supervised VOS method based on spatio-temporal decoupling and regional robustness enhancement was proposed. Firstly, a structural Transformer architecture was employed to eliminate shared feature information across all pixels, emphasizing the differences among pixels and thoroughly exploring the key features of objects in video frames. Secondly, the similarity between the current frame and the long-term memory frames was decoupled into two critical dimensions: spatio-temporal correlation and object importance. This decoupling allowed for a more precise analysis of pixel-level spatio-temporal and object features, thereby solving the issue of object occlusion caused by inter-object interactions. Finally, a Regional Strip Attention (RSA) module was designed to enhance focus to the foreground region and suppress background noise by utilizing the object location information from long-term memory. Experimental results indicate that the proposed method outperforms the retrained AOT (Associating Objects with Transformers) model on DAVIS 2017 validation set by 1.7 percentage points in J&F, and achieves a 1.6 percentage points improvement compared to the retrained AOT model in overall score on YouTube-VOS 2019 validation set, indicating that the proposed method effectively addresses existing challenges in semi-supervised VOS.