The existing image aesthetic quality evaluation methods widely use Convolution Neural Network (CNN) to extract image features. Limited by the local receptive field mechanism, it is difficult for CNN to extract global features from a given image, thereby resulting in the absence of aesthetic attributes like global composition relations, global color matching and so on. In order to solve this problem, an image aesthetic quality evaluation method based on SSViT (Self-Supervised Vision Transformer) model was proposed. Self-attention mechanism was utilized to establish long-distance dependencies among local patches of the image and to adaptively learn their correlations, and extracted the global features so as to characterize the aesthetic attributes. Meanwhile, three tasks of perceiving the aesthetic quality, namely classifying image degradation, ranking image aesthetic quality, and reconstructing image semantics, were designed to pre-train the vision Transformer in a self-supervised manner using unlabeled image data, so as to enhance the representation of global features. The experimental results on AVA (Aesthetic Visual Assessment) dataset show that the SSViT model achieves 83.28%, 0.763 4, 0.746 2 on the metrics including evaluation accuracy, Pearson Linear Correlation Coefficient (PLCC) and SRCC (Spearman Rank-order Correlation Coefficient), respectively. These experimental results demonstrate that the SSViT model achieves higher accuracy in image aesthetic quality evaluation.