According to recent research on human pose estimation, making full use of potential 2D pose space information to acquire representative characteristics can produce more accurate 3D pose results. Therefore, a spatio-temporal context network based on graph attention mechanism was proposed, which includes Temporal Context Network with Shifted windows (STCN), Extremity-Guided global graph ATtention mechanism network (EGAT), and Pose Grammar-based local graph attention Convolution Network (PGCN). Firstly, STCN was used to transform the 2D joint position in long sequence into potential features of human pose in single sequence, which aggregated and utilized the long-range and short-range human pose information effectively, and reduce the computational cost significantly. Secondly, EGAT was presented for computing global spatial context effectively, so that human extremities were treated as “traffic hubs”, and bridges were established for information exchange between them and other nodes. Thirdly, graph attention mechanism was utilized for adaptive weight assignment to perform global context computation on human joints. Finally, PGCN was designed to utilize Graph Convolution Network (GCN) for computing and modeling local spatial context, thereby emphasizing the motion consistency of symmetrical nodes of human and the motion correlation structure of human bones. Evaluations of the proposed model were conducted on the two complex benchmark datasets: Human3.6M and HumanEva-Ⅰ. Experimental results demonstrate that the proposed model has superior performance. Specifically, when the input frame length is 81, the proposed model achieves a Mean Per Joint Position Error (MPJPE) of 43.5 mm on the Human3.6M dataset, which represents a 10.5% reduction compared to that of the state-of-the-art algorithm MCFNet (Multi-scale Cross Fusion Network), showcasing higher accuracy.