Visual object tracking is one of the important tasks in computer vision, in order to achieve high-performance object tracking, a large number of object tracking methods have been proposed in recent years. Among them, Transformer-based object tracking methods become a hot topic in the field of visual object tracking due to their ability to perform global modeling and capture contextual information. Firstly, existing Transformer-based visual object tracking methods were classified based on their network structures, an overview of the underlying principles and key techniques for model improvement were expounded, and the advantages and disadvantages of different network structures were also summarized. Then, the experimental results of the Transformer-based visual object tracking methods on public datasets were compared to analyze the impact of network structure on performance. in which MixViT-L (ConvMAE) achieved tracking success rates of 73.3% and 86.1% on LaSOT and TrackingNet, respectively, proving that the object tracking methods based on pure Transformer two-stage architecture have better performance and broader development prospects. Finally, the limitations of these methods, such as complex network structure, large number of parameters, high training requirements, and difficulty in deploying on edge devices, were summarized, and the future research focus was outlooked, by combining model compression, self-supervised learning, and Transformer interpretability analysis, more kinds of feasible solutions for Transformer-based visual target tracking could be presented.