Since the lack of inductive bias in Vision Transformer (ViT) makes it hard to learn meaningful visual representations on relatively small-scale datasets, an unsupervised person re-identification method based on self-distilled vision Transformer was proposed. Firstly, because of the modular architecture of ViT, the feature generated by any intermediate block has the same dimension, so an intermediate Transformer block was selected randomly and was fed into the classifier to obtain prediction results. Secondly, by using the Kullback-Leibler divergence between the minimized randomly selected intermediate classifier output and the final classifier output distribution, the classification prediction results of the intermediate block were constrained to be consistent with the results of the final classifier, and a self-distillation loss function was constructed based on this. Finally, the model was optimized by jointly minimizing the cluster-level contrast loss, instance-level contrast loss, and self-distillation loss. Besides by providing soft supervision from the final classifier to the intermediate block, the inductive bias was introduced to ViT model effectively, so that the model was able to learn more robust and generalized visual representations. Compared to Transformer-based Object Re-IDentification Self-Supervised Learning (TransReID-SSL), the proposed method improves the mean Average Precision (mAP) and Rank-1 by 1.2 and 0.8 percentage points respectively on Market-1501 dataset, and by 3.4 and 3.1 percentage points respectively on MSMT17 dataset. Experimental results demonstrate that the proposed method can increase the unsupervised person re-identification precision effectively.