Gaze estimation is a method for estimating 3D gaze directions from face images, where information about eye details directly related to gaze is concentrated in the face image and has a significant impact on gaze estimation. However, existing gaze estimation models ignore small-scale eye details and are easily overwhelmed by gaze-independent information in image features. For this reason, a model based on multi-scale aggregation and shared attention was proposed to enhance the representativeness of features. First, the omission of eye details in images by the model was dealt with by using shunted self-attention to aggregate eye and face information at different scales in an image and guiding the model to learn the correlation between objects at different scales; second, the attention to gaze-irrelevant features was reduced by establishing shared attention to capture shared features between images; and lastly, the combination of multi-scale aggregation and shared attention was used to further improve the accuracy of gaze estimation. On the public datasets MPIIFaceGaze, Gaze360, Gaze360_Processed, and GAFA-Head, the average angular errors of the proposed model are lower by 5.74%, 4.09%, 4.82%, and 10.55% compared to GazeTR (Gaze TRansformer). For difficult images with back-to-camera on the Gaze360, the average angular error of the proposed model is lower by 4.70% compared to GazeTR. The experimental results show that the proposed model can effectively aggregate multi-scale gaze information and shared attention to improve the accuracy and robustness of gaze estimation.