To address the challenges of background interference, keypoint localization deviation, and target occlusion in Unmanned Aerial Vehicle (UAV) aerial view human pose estimation, an enhanced human pose estimation algorithm named YOLO-AirPose was proposed for non-ground view scenarios. Firstly, a symmetric flip augmentation strategy based on keypoint topology constraint, named IPSFA (Index-Preserved Symmetric Flip Augmentation), was designed to improve generalization under multi-view scenarios. Secondly, a C2BRA (C2 Bi-level Routing Attention) module was constructed by integrating BRA (Bi-level Routing Attention) mechanism to replace the original C2PSA (Cross stage Partial with Spatial Attention), thereby enhancing the model’s perception of small-scale targets and occluded keypoints. Thirdly, combining spatial modeling ability of Transformer, an AIFI (Adaptive Interaction Feature Integration) module was embedded into the backbone network, so that 2D positional encoding was combined to improve keypoint localization performance. Finally, a C3k2-DAttention module based on deformable attention mechanism was designed to strengthen the network’s global modeling and receptive field adjustment abilities. Experimental results show that YOLO-AirPose achieves improvements of 3.0, 5.0, 4.6, and 6.8 percentage points in precision of object detection and precision, recall, and mAP@0.5 of pose estimation compared to the baseline model YOLO-Pose, respectively, while maintaining low computational cost and parameter quantity. It can be seen that the proposed algorithm provides an improved solution to the accuracy limitations in UAV aerial view human pose estimation and enhances adaptability to complex human poses.