基于Transformer的视觉目标跟踪方法综述

doi:10.11772/j.issn.1001-9081.2023060796

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (5): 1644-1654.DOI: 10.11772/j.issn.1001-9081.2023060796

• 多媒体计算与计算机仿真 • 上一篇

基于Transformer的视觉目标跟踪方法综述

孙子文¹, 钱立志¹, 杨传栋¹, 高一博¹, 陆庆阳², 袁广林²()

^1.陆军炮兵防空兵学院高过载弹药制导控制与信息感知实验室, 合肥 230031
^2.陆军炮兵防空兵学院信息工程系, 合肥 23003

收稿日期:2023-06-21 修回日期:2023-09-04 接受日期:2023-09-11 发布日期:2023-10-27 出版日期:2024-05-10
通讯作者: 袁广林
作者简介:孙子文（1996—），男，安徽蒙城人，博士研究生，主要研究方向：计算机视觉
钱立志（1963—），男，安徽枞阳人，教授，博士，主要研究方向：智能弹药打击
杨传栋（1994—），男，山东泰安人，博士研究生，主要研究方向：计算机视觉
高一博（1998—），男，安徽太和人，硕士，主要研究方向：机器学习、硬件可编程语言
陆庆阳（1994—），男，安徽合肥人，硕士研究生，主要研究方向：自然语言处理、目标跟踪
第一联系人：袁广林（1976—），河南周口人，副教授，博士，主要研究方向：计算机视觉、机器学习。
基金资助:
军队型号项目(LZX20190112)

Survey of visual object tracking methods based on Transformer

Ziwen SUN¹, Lizhi QIAN¹, Chuandong YANG¹, Yibo GAO¹, Qingyang LU², Guanglin YUAN²()

^1.Laboratory of Guidance Control and Information Perception Technology of High Overload Projectiles，Army Academy of Artillery and Air Defense，Hefei Anhui 230031，China
^2.Department of Information Engineering，Army Academy of Artillery and Air Defense，Hefei Anhui 230031，China

Received:2023-06-21 Revised:2023-09-04 Accepted:2023-09-11 Online:2023-10-27 Published:2024-05-10
Contact: Guanglin YUAN
About author:SUN Ziwen， born in 1996， Ph. D. candidate. His research interests include computer vision.
QIAN Lizhi， born in 1963， Ph. D.， professor. His research interests include intelligent ammunition strike.
YANG Chuandong， born in 1994， Ph. D. candidate. His research interests include computer vision.
GAO Yibo， born in 1998， M. S. His research interests include machine learning， hardware programmable languages.
LU Qingyang， born in 1994， M. S. candidate. His research interests include natural language processing， object tracking.
Supported by:
Military Model Project(LZX20190112)

摘要/Abstract

摘要：

视觉目标跟踪是计算机视觉中的重要任务之一，为实现高性能的目标跟踪，近年来提出了大量的目标跟踪方法，其中基于Transformer的目标跟踪方法由于具有全局建模和联系上下文的能力，是目前视觉目标跟踪领域研究的热点。首先，根据网络结构的不同对基于Transformer的视觉目标跟踪方法进行分类，概述相关原理和模型改进的关键技术，总结不同网络结构的优缺点；其次，对这类方法在公开数据集上的实验结果进行对比，分析网络结构对性能的影响，其中MixViT-L（ConvMAE）在LaSOT和TrackingNet上跟踪成功率分别达到了73.3%和86.1%，说明基于纯Transformer两段式架构的目标跟踪方法具有更优的性能和更广的发展前景；最后，对方法当前存在的网络结构复杂、参数量大、训练要求高和边缘设备使用难度大等不足进行总结，并对今后的研究重点进行展望，通过与模型压缩、自监督学习以及Transformer可解释性分析相结合，可为基于Transformer的视觉目标跟踪提出更多可行的解决方案。

关键词: 计算机视觉, 目标跟踪, 混合网络结构, 深度学习, 孪生网络, Transformer

Abstract:

Visual object tracking is one of the important tasks in computer vision， in order to achieve high-performance object tracking， a large number of object tracking methods have been proposed in recent years. Among them， Transformer-based object tracking methods become a hot topic in the field of visual object tracking due to their ability to perform global modeling and capture contextual information. Firstly， existing Transformer-based visual object tracking methods were classified based on their network structures， an overview of the underlying principles and key techniques for model improvement were expounded， and the advantages and disadvantages of different network structures were also summarized. Then， the experimental results of the Transformer-based visual object tracking methods on public datasets were compared to analyze the impact of network structure on performance. in which MixViT-L （ConvMAE） achieved tracking success rates of 73.3% and 86.1% on LaSOT and TrackingNet， respectively， proving that the object tracking methods based on pure Transformer two-stage architecture have better performance and broader development prospects. Finally， the limitations of these methods， such as complex network structure， large number of parameters， high training requirements， and difficulty in deploying on edge devices， were summarized， and the future research focus was outlooked， by combining model compression， self-supervised learning， and Transformer interpretability analysis， more kinds of feasible solutions for Transformer-based visual target tracking could be presented.

Key words: computer vision, object tracking, hybrid network structure, deep learning, Siamese network, Transformer

中图分类号:

TP391.4

孙子文, 钱立志, 杨传栋, 高一博, 陆庆阳, 袁广林. 基于Transformer的视觉目标跟踪方法综述[J]. 计算机应用, 2024, 44(5): 1644-1654.

Ziwen SUN, Lizhi QIAN, Chuandong YANG, Yibo GAO, Qingyang LU, Guanglin YUAN. Survey of visual object tracking methods based on Transformer[J]. Journal of Computer Applications, 2024, 44(5): 1644-1654.

图/表 5

参考文献 82

1	黄凯奇，陈晓棠，康运锋，等.智能视频监控技术综述［J］.计算机学报，2015，38（6）：1093-1118. 10.11897/SP.J.1016.2015.01093
	HUANG K Q， CHEN X T， KANG Y F， et al. Intelligent visual surveillance： a review［J］. Chinese Journal of Computers， 2015， 38（6）： 1093-1118. 10.11897/SP.J.1016.2015.01093
2	钱夔，宋爱国.一种改进型机器人仿生认知神经网络［J］.电子学报，2015，43（6）：1084-1089. 10.3969/j.issn.0372-2112.2015.06.007
	QIAN K， SONG A G. An improved bionic cognitive neural network for robot［J］. Acta Electronica Sinica， 2015， 43（6）： 1084-1089. 10.3969/j.issn.0372-2112.2015.06.007
3	刘彩虹，张磊，黄华.交通路口监控视频跨视域多目标跟踪的可视化［J］.计算机学报，2018，41（1）：221-235. 10.11897/SP.J.1016.2018.00221
	LIU C H， ZHANG L， HUANG H. Visualization of cross-view multi-object tracking for surveillance videos in crossroad［J］. Chinese Journal of Computers， 2018， 41（1）： 221-235. 10.11897/SP.J.1016.2018.00221
4	梁永强，王崴，瞿珏，等.基于眼动特征的人机交互行为意图预测模型［J］.电子学报，2018，46（12）：2993-3001. 10.3969/j.issn.0372-2112.2018.12.024
	LIANG Y Q， WANG W， QU Y， et al. Human-computer interaction behavior and intention prediction model based on eye movement characteristics［J］. Acta Electronica Sinica， 2018， 46（12）： 2993-3001. 10.3969/j.issn.0372-2112.2018.12.024
5	葛宝义，左宪章，胡永江.视觉目标跟踪方法研究综述［J］.中国图象图形学报，2018，23（8）：1091-1107. 10.11834/jig.170604
	GE B Y， ZUO X Z， HU Y J. Review of visual object tracking technology［J］. Journal of Image and Graphics， 2018， 23（8）： 1091-1107. 10.11834/jig.170604
6	孟琭，杨旭.目标跟踪算法综述［J］.自动化学报，2019，45（7）：1244-1260. 10.16383/j.aas.c180277
	MENG L， YANG X. A survey of object tracking algorithms［J］. Acta Automatica Sinica， 2019， 45（7）： 1244-1260. 10.16383/j.aas.c180277
7	AVIDAN S. Support vector tracking［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2004， 26（8）： 1064-1072. 10.1109/tpami.2004.53
8	NING J， YANG J， JIANG S， et al. Object tracking via dual linear structured SVM and explicit feature map［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition.Piscataway： IEEE， 2016 ：4266-4274. 10.1109/cvpr.2016.462
9	ZHANG T， GHANEM B， LIU S， et al. Robust visual tracking via multi task sparse learning ［C］// Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2012： 2042-2049. 10.1109/cvpr.2012.6247908
10	BABENKO B， YANG M-H， BELONGIE S. Visual tracking with online multiple instance learning［C］// Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2009： 983-990. 10.1109/cvpr.2009.5206737
11	BOLME D S， BEVERIDGE J R， DRAPER B A， et al. Visual object tracking using adaptive correlation filters［C］// Proceeding of the 2010 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2010： 2544-2550. 10.1109/cvpr.2010.5539960
12	HENRIQUES J F， CASEIRO R， MARTINS P， et al. Exploiting the circulant structure of tracking-by-detection with kernels［C］// Proceedings of the 12th European Conference on Computer Vision. Berlin： Springer， 2012： 702-715. 10.1007/978-3-642-33765-9_50
13	HENRIQUES J F， CASEIRO R， MARTINS P， et al. High-speed tracking with kernelized correlation filters［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2015， 37（3）： 583-596. 10.1109/tpami.2014.2345390
14	DANELLJAN M HÄGER G， KHAN F， et al. Accurate scale estimation for robust visual tracking［C］// Proceedings of the British Machine Vision Conference 2014. Durham： British Machine Vision Association， 2014： 1-11. 10.5244/c.28.65
15	LI Y， ZHU J. A scale adaptive kernel correlation filter tracker with feature integration［C］// Proceedings of the 2014 European Conference on Computer Vision Workshops. Cham： Springer， 2015： 254-265. 10.1007/978-3-319-16181-5_18
16	LUKEŽIČ A， VOJÍŘ T， ČEHOVIN ZAJC L， et al. Discriminative correlation filter tracker with channel and spatial reliability［J］. International Journal of Computer Vision， 2018， 126： 671-688. 10.1007/s11263-017-1061-3
17	DANELLJAN M， HÄGER G， KHAN F S， et al. Convolutional features for correlation filter based visual tracking［C］// Proceedings of the 2015 IEEE International Conference on Computer Vision Workshop. Piscataway： IEEE，2015：621-629. 10.1109/iccvw.2015.84
18	MA C， HUANG J-B， YANG X， et al. Hierarchical convolutional features for visual tracking［C］// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2015： 3074-3082. 10.1109/iccv.2015.352
19	GALOOGAHI H K， FAGG A， LUCEY S. Learning background-sware correlation filters for visual tracking［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE，2017：1144-1152. 10.1109/iccv.2017.129
20	DANELLJAN M， ROBINSON A， KHAN F S， et al. Beyond correlation filters： learning continuous convolution operators for visual tracking［C］// Proceedings of the 14th European Conference on Computer Vision. Cham： Springer， 2016： 472-488. 10.1007/978-3-319-46454-1_29
21	DANELLJAN M， BHAT G， FAHAD S K， et al. ECO： efficient convolution operators for tracking［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE，2017： 6931-6939. 10.1109/cvpr.2017.733
22	BHAT G， JOHNANDER J， DANELLJAN M， et al. Unveiling the power of deep tracking［C］// Proceedings of the 15th European Conference on Computer Vision. Piscataway： IEEE， 2018： 493-509. 10.1007/978-3-030-01216-8_30
23	VASWANI A， SHAZEER N， PARMAR N， et al.Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 6000-6010.
24	DEVLIN J， CHANG M-W， LEE K， et al. BERT： pre-training of deep bidirectional transformers for language understanding［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsberg： ACL， 2019： 4171-4186. 10.18653/v1/n18-2
25	TONG C， PENG H， DAI Q， et al. Improving natural language understanding by reverse mapping Bytepair encoding［C］// Proceedings of the 23rd Conference on Computational Natural Language Learning. Stroudsberg： ACL， 2019：163-173. 10.18653/v1/k19-1016
26	DOSOVITSKIY A， BEYER L， KOLESNIKOV A， et al. An image is worth 16×16 words： Transformers for image recognition at scale［C/OL］// Proceedings of the 9th International Conference on Learning Representations. ［S.l.］： ICLR， 2020 ［2023-05-30］. .
27	CHEN C-F R， FAN Q， PANADA R. CrossViT： cross-attention multi-scale vision transformer for image classification［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 347-356. 10.1109/iccv48922.2021.00041
28	CARION N， MASSA F， SYNNAEVE G， et al. End-to-end object detection with Transformers［C］// Proceeding of the 16th European Conference on Computer Vision. Cham： Springer， 2020：213-229. 10.1007/978-3-030-58452-8_13
29	STRUDEL R， GARCIA R， LAPTEV I， et al. Segmenter： Transformer for semantic segmentation［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021：7242-7252. 10.1109/iccv48922.2021.00717
30	LIU Z， LIN Y， CAO Y， et al. Swin Transformer： hierarchical vision Transformer using shifted windows［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021：9992-10002. 10.1109/iccv48922.2021.00986
31	TOUVRON H， CORD M， DOUZE M， et al. Training data-efficient image Transformers & distillation through attention［C］// Proceedings of the 38th International Conference on Machine Learning. New York： PMLR， 2021：10347-10357. 10.1109/iccv48922.2021.00091
32	ZHOU D， KANG B， JIN X， et al. DeepViT： towards deeper vision Transformer［EB/OL］. ［2023-05-30］. .
33	LI Y， ZHANG K， CAO J， et al. LocalViT： bringing locality to vision Transformers［EB/OL］. ［2023-05-30］. . 10.1109/iros55552.2023.10342025
34	CHOPRA S， HADSELL R， LeCUN Y. Learning a similarity metric discriminatively， with application to face verification［C］// Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2005：539-546.
35	TAO R， GAVVES E， SMEULDERS A W M. Siamese instance search for tracking［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 1420-1429. 10.1109/cvpr.2016.158
36	BERTINETTO L， VALMADRE J， HENRIQUES J F， et al. Fully-convolutional siamese networks for object tracking［C］// Proceeding of the 14th European Conference on Computer Vision. Cham： Springer， 2016： 850-865. 10.1007/978-3-319-48881-3_56
37	YU B， TANG M， ZHENG L， et al. High-performance discriminative tracking with Transformers［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 9836-9845. 10.1109/iccv48922.2021.00971
38	ZHAO M， OKADA K， INABA M. TrTr： visual tracking with Transformer［EB/OL］. ［2023-05-30］. .
39	CHEN X， YAN B， ZHU J， et al. Transformer tracking［C］//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021：8122-8131. 10.1109/cvpr46437.2021.00803
40	CHEN X， KANG B， WANG D， et al. Efficient visual tracking via hierarchical cross-attention Transformer［C］// Proceedings of the 17th European Conference on Computer Vision. Cham： Springer 2022： 461-477. 10.1007/978-3-031-25085-9_26
41	CHEN X， YAN B， ZHU J， et al. High-performance Transformer tracking［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2023， 45（7）： 8507-8523.
42	WANG N， ZHOU W， WANG J， et al. Transformer meets tracker： exploiting temporal context for robust visual tracking［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021：1571-1580. 10.1109/cvpr46437.2021.00162
43	YAN B， PENG H， FU J， et al. Learning spatio-temporal Transformer for visual tracking［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 10428-10437. 10.1109/iccv48922.2021.01028
44	BLATTER P， KANAKIS M， DANELLJAN M， et al. Efficient visual tracking with exemplar Transformers［C］// Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway： IEEE， 2023： 1571-1581. 10.1109/wacv56688.2023.00162
45	FU Z， FU Z， LIU Q， et al. SparseTT： visual tracking with sparse Transformers［C］// Proceedings of 31st the International Joint Conference on Artificial Intelligence. California： IJCAI， 2022：905-912. 10.24963/ijcai.2022/127
46	SONG Z， YU J， CHEN Y-P， et al. Transformer tracking with cyclic shifting window attention［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022：8781-8790. 10.1109/cvpr52688.2022.00859
47	MAYER C， DANELLJAN M， BHAT G， et al. Transforming model prediction for tracking［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 8721-8730. 10.1109/cvpr52688.2022.00853
48	GAO S， ZHOU C， MA C， et al. AiATrack： attention in attention for Transformer visual tracking［C］// Proceedings of the 17th European Conference on Computer Vision. Cham： Springer， 2022：146-164. 10.1007/978-3-031-20047-2_9
49	XIE F， WANG C， WANG G， et al. Learning tracking representations via dual-branch fully Transformer networks［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops. Piscataway： IEEE，2021：2688-2697. 10.1109/iccvw54120.2021.00303
50	LIN L， FAN H， XU Y， et al. SwinTrack： a simple and strong baseline for Transformer tracking［EB/OL］. ［2023-05-30］. . 10.48550/arXiv.2112.00995
51	TANG C， WANG X， BAI Y， et al. Learning spatial-frequency Transformer for visual object tracking［EB/OL］. ［2023-05-30］. . 10.1109/tcsvt.2023.3249468
52	HE K， ZHANG C， XIE S， et al. Target-aware tracking with long-term context attention［C］// Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence. Palo Alto： AAAI Press， 2023： 773-780. 10.1609/aaai.v37i1.25155
53	CHEN Q， WU Q， WANG J， et al. MixFormer： mixing features across windows and dimensions［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 5239-5249. 10.1109/cvpr52688.2022.00518
54	CUI T， JIANG C， WU G， et al. MixFormer： end-to-end tracking with iterative mixed attention［EB/OL］. ［2023-05-30］. . 10.1109/cvpr52688.2022.01324
55	CHEN B， LI P， BAI L， et al. Backbone is all you need： a simplified architecture for visual object tracking［C］// Proceedings of the 17th European Conference on Computer Vision. Cham： Springer， 2022： 375-392. 10.1007/978-3-031-20047-2_22
56	YE B， CHANG H， MA B， et al. Joint feature learning and relation modeling for tracking： a one-stream framework［C］// Proceedings of the 17th European Conference on Computer Vision. Cham： Springer， 2022： 341-357. 10.1007/978-3-031-20047-2_20
57	LAN J-P， CHENG Z-Q， HE J-Y， et al. ProContEXT： exploring progressive context Transformer for tracking［EB/OL］. ［2023-05-30］. . 10.1109/icassp49357.2023.10094971
58	XIE F， CHU L， LI J， et al. VideoTrack： learning to track objects via video Transformer［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023： 22826-22835. 10.1109/cvpr52729.2023.02186
59	GAO S， ZHOU C， ZHANG J. Generalized relation modeling for Transformer tracking ［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023： 18686-18695. 10.1109/cvpr52729.2023.01792
60	WU Q， YANG T， LIU Z， et al. DropMAE： masked autoencoders with spatial-attention dropout for tracking tasks［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023： 14561-14571. 10.1109/cvpr52729.2023.01399
61	CHEN X， PENG H， WANG D， et al. SeqTrack： sequence to sequence learning for visual object tracking ［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023：14572-14581. 10.1109/cvpr52729.2023.01400
62	WEI X， BAI Y， ZHENG Y， et al. Autoregressive visual tracking ［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023： 9697-9706. 10.1109/cvpr52729.2023.00935
63	HE K， ZHANG X， REN S， et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778. 10.1109/cvpr.2016.90
64	FAN H， LING L， YANG F， et al. LaSOT： a high quality benchmark for large scale single object tracking［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 5369-5378. 10.1109/cvpr.2019.00552
65	MÜLLER M， BIBI A， GIANCOLA S， et al. TrackingNet： a large scale dataset and benchmark for object tracking in the wild［C］// Proceedings of the 15th European Conference on Computer Vision. Cham： Springer， 2018：310-327. 10.1007/978-3-030-01246-5_19
66	HUANG L， ZHAO X， HUANG K. GOT-10k： a large high diversity benchmark for generic object tracking in the wild［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2021， 43（5）：1562-1577. 10.1109/tpami.2019.2957464
67	WU Y， LIM J， YANG M-H. Object tracking benchmark［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2015， 37（9）：1834-1848. 10.1109/tpami.2014.2388226
68	GALOOGHI H K， FAGG A， HUANG C， et al. Need for speed： a benchmark for higher frame rate object tracking［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 1134-1143. 10.1109/iccv.2017.128
69	MUELLER M， SMITH N， GHANEM B. A benchmark and simulator for UAV tracking［C］// Proceedings of the 14th European Conference on Computer Vision. Cham： Springer， 2016： 445-461. 10.1007/978-3-319-46448-0_27
70	KRISTAN M， LEONARDIS A， MATAS J， et al. The eighth visual object tracking VOT2020 Challenge results［C］// Proceedings of the 16th European Conference on Computer Vision. Cham： Springer， 2020： 547-601.
71	KRIZHEVSKY A， SUTSKEVER I， HINTON G. ImageNet classification with deep convolutional neural networks［C］// Proceedings of the 25th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2012： 1097-1105.
72	XIA M， ZHONG Z， CHEN D. Structured pruning learns compact and accurate models［C］// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsberg： ACL， 2022： 1513-1528. 10.18653/v1/2022.acl-long.107
73	WANG W， WEI F， LI D， et al. MiniLM： deep self-attention distillation for task-agnostic compression of pre-trained Transformers［C］// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2020： 5776-5788.
74	SO D R， LE Q V， LIANG C. The evolved Transformer［C］// Proceedings of the 36th International Conference on Machine Learning. New York： JMLR， 2019：5877-5886.
75	WU Z， LIU Z， LIN J， et al. Lite Transformer with long-short range attention［C/OL］// Proceedings of the 8th International Conference on Learning Representations. ［S.l.］： ICLR， 2020 ［2023-05-30］. .
76	CARON M， TOUVRON H， MISRA I， et al. Emerging properties in self-supervised vision Transformers［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 9630-9640. 10.1109/iccv48922.2021.00951
77	XIE Z， LIN Y， YAO Z， et al. Self-supervised learning with Swin Transformers［EB/OL］. ［2023-05-30］. .
78	ATITO S， AWAIS M， KITTLER J. SiT： self-supervised vision Transformer［EB/OL］. ［2023-05-30］. . 10.1109/icip49359.2023.10222150
79	CHEFER H， GUR S， WOLF L. Transformer interpretability beyond attention visualization［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognitio. Piscataway： IEEE， 2021： 782-791. 10.1109/cvpr46437.2021.00084
80	SU H， YE Y， CHEN Z， et al. Re-attention Transformer for weakly supervised object localization［C］// Proceedings of the 33rd British Machine Vision Conference. Durham： BMVA Press， 2022：70.
81	XIE W， LI X-H， CAO C C， et al. ViT-CX： causal explanation of vision Transformers［C］// Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence. California： IJCAI， 2023： 1569-1577. 10.24963/ijcai.2023/174
82	MOHEBBI H， ZUIDEMA W， CHRUPAŁA G， et al. Quantifying context mixing in Transformers［C］// Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistic. Stroudsberg： ACL， 2023： 3378-3400. 10.18653/v1/2023.eacl-main.245

方法	出处	LaSOT		TrackingNet		GOT-10k			OTB2015	NFS	UAV123	VOT2020			FPS	Params/MB
方法	出处	AUC/%	P/%	AUC/%	P/%	AO	SR_0.5	SR_0.75	AUC/%	AUC/%	AUC/%	EAO	Acc/%	Rob	FPS	Params/MB
DTT^［37］	ICCV21	53.8		74	68.8	63.4	74.9	51.4		60.8					54.5
TrTr-offline^［38］	CVPR21	46.3		69.3					69.1	55.2	59.4				50	33.9
TrTr-online^［38］	CVPR21	55.1		71					71.5	63.1	65.2				35.3	21
TransT^［39］	ICCV21	64.9	69	81.4	80.3	72.3	82.4	68.2	69.4	65.7	69.1				50	35.5
TrSiam^［42］	ICCV21	62.4	60	78.1	72.7	67.3	78.7	58.6	70.8	65.8	67.4				35.6	52
TrDiMP^［42］	ICCV21	63.9	61.4	78.4	73.1	68.8	80.5	59.7	71.1	66.5	67.5				26.3	55.2
STARK-ST50^［43］	ICCV21	66.6	70.8	81.3		68.0	77.7	62.3		66.2	68.2	0.308	0.478	0.799	41.8	23.5
STARK-ST101^［43］	ICCV21	67.1	77.0	82.0		68.8	78.1	64.1				0.303	0.481	0.775	31.7	42.4
E.T.Track^［44］	CVPR22	59.1		75.0	70.6				67.8	59.0	62.3	0.267	0.432	0.741	47.2
HCAT^［40］	CVPR22	59.1	60.7	76.6	72.9	65.3	76.8	57	68.1	63.6	63.6	0.276	0.455	0.747	195
TransT_H^［40］	CVPR22	66.2	70.7	82.2	80.4	72.4	82.0	68.5
TransT-M^［41］	CVPR22	65.4	69.6	82.5	80.0	74.7	85.5	71.3	68.9	66.2	70.9	0.550	0.742	0.869	42.7	23.1
SparseTT^［45］	IJCAI22	66.0	70.1	81.7	79.5	69.3	79.1	63.8	70.4		70.4				40.1	58.3
CSWinTT^［46］	ICCV22	66.2	70.9	81.9	79.5	69.4	78.9	65.4			70.5	0.304	0.480	0.787	12
ToMP-50^［47］	CVPR22	67.6	72.2	81.2	78.6				70.1	66.9	69	0.297	0.453	0.789
ToMP-101^［47］	CVPR22	68.5	73.5	81.5	78.9				70.1	66.7	66.9	0.309	0.453	0.814
AiATrack^［48］	ECCV22	69.0	73.8	82.7	80.4	69.6	80.0	63.2	69.6	67.9	70.6	0.530	0.764	0.827	38	23.6
DualTFR^［49］	ICCV21	63.5	66.5	80.1		73.5	84.8	69.9			68.2	0.528	0.755	0.836		44.1
SwinTrack-B^［50］	CVPR21	69.6	74.1	82.5	80.4	69.4	78	64.3							52.2	91.4
SwinTrack-B-384^［50］	CVPR21	70.2	75.3	84.0	83.2										45	101.3
SFTransT^［51］	CoRR22	69.0	73.9	82.9	81.3	72.7	84.3	66.9	70.3	66.0	71.3				27.3	29.61
Sim-L/14^［55］	ECCV21	70.5	76.2	83.4	87.4	69.8	78.8	66.0			71.2					103.1
OSTrack-256^［56］	ECCV22	69.1	75.2	83.1	82.0	71.0	80.4	68.2		64.7	66.5				105.4
OSTrack-384^［56］	ECCV22	71.1	77.6	83.9	83.2	73.7	83.2	70.8		68.3	70.7				58.1
ProContEXT^［57］	CoRR22			84.6	83.8	74.6	84.7	72.9							26.4	118.5
VideoTrack^［58］	CVPR23	70.2	76.4	83.8	83.1	72.9	81.9	69.8			69.7				81.4
TATrack-L^［52］	IAAI23	71.1	76.1	85.0	84.5	79.2	88.6	78.3							6.6	112.8
GRM^［59］	CVPR23	69.9	75.8	84.0	83.3	73.4	82.9	70.4		65.6	70.2				45
GRM-L320^［59］	CVPR23	71.4	77.9	84.4	84.0
DropTrack^［60］	CVPR23	71.8	78.1	84.1	83.0	75.9	86.8	72.0	69.6
MixFormer^［53］	CVPR22	67.9	73.9	82.6	81.2	73.2	83.2	70.2			68.7	0.527	0.746	0.833	29	53.3
MixFormer-L^［53］	CVPR22	70.1	76.3	83.9	83.1	75.6	85.7	72.8			69.5	0.555	0.762	0.855	25	215.7
MixViT-L （ConvMAE）^［54］	CVPR23	73.3	80.3	86.1	86.0	75.4	84.0	75.4	70.7		70.0	0.567	0.747	0.870	10	286.9
SeqTrack-B256^［61］	CVPR23	69.9	76.3	83.3	82.2	74.7	84.7	71.8		67.6	69.2	0.520			40	89
SeqTrack-B384^［61］		71.5	77.8	83.9	83.6	74.5	84.3	71.4		66.7	68.6	0.522			15	89
SeqTrack-L256^［61］		72.1	79.0	85.0	84.9	74.5	83.2	72.0		66.9	69.7	0.555			15	309
SeqTrack-L384^［61］		72.5	79.3	85.5	85.8	74.8	81.9	72.2		66.2	68.5	0.561			5	309
ARTrack₂₅₆^［62］	CVPR23	70.4	76.6	84.2	83.5	73.5	82.2	70.9		64.3	67.7
ARTrack₃₈₄^［62］		72.6	79.1	85.1	84.8	75.5	84.3	74.3		66.8	70.5
ARTrack-L₃₈₄^［62］		73.1	80.3	85.6	86.0	78.5	87.4	77.8		67.9	71.2				26

方法	出处	LaSOT		TrackingNet		GOT-10k			OTB2015	NFS	UAV123	VOT2020			FPS	Params/MB
方法	出处	AUC/%	P/%	AUC/%	P/%	AO	SR_0.5	SR_0.75	AUC/%	AUC/%	AUC/%	EAO	Acc/%	Rob	FPS	Params/MB
DTT^［37］	ICCV21	53.8		74	68.8	63.4	74.9	51.4		60.8					54.5
TrTr-offline^［38］	CVPR21	46.3		69.3					69.1	55.2	59.4				50	33.9
TrTr-online^［38］	CVPR21	55.1		71					71.5	63.1	65.2				35.3	21
TransT^［39］	ICCV21	64.9	69	81.4	80.3	72.3	82.4	68.2	69.4	65.7	69.1				50	35.5
TrSiam^［42］	ICCV21	62.4	60	78.1	72.7	67.3	78.7	58.6	70.8	65.8	67.4				35.6	52
TrDiMP^［42］	ICCV21	63.9	61.4	78.4	73.1	68.8	80.5	59.7	71.1	66.5	67.5				26.3	55.2
STARK-ST50^［43］	ICCV21	66.6	70.8	81.3		68.0	77.7	62.3		66.2	68.2	0.308	0.478	0.799	41.8	23.5
STARK-ST101^［43］	ICCV21	67.1	77.0	82.0		68.8	78.1	64.1				0.303	0.481	0.775	31.7	42.4
E.T.Track^［44］	CVPR22	59.1		75.0	70.6				67.8	59.0	62.3	0.267	0.432	0.741	47.2
HCAT^［40］	CVPR22	59.1	60.7	76.6	72.9	65.3	76.8	57	68.1	63.6	63.6	0.276	0.455	0.747	195
TransT_H^［40］	CVPR22	66.2	70.7	82.2	80.4	72.4	82.0	68.5
TransT-M^［41］	CVPR22	65.4	69.6	82.5	80.0	74.7	85.5	71.3	68.9	66.2	70.9	0.550	0.742	0.869	42.7	23.1
SparseTT^［45］	IJCAI22	66.0	70.1	81.7	79.5	69.3	79.1	63.8	70.4		70.4				40.1	58.3
CSWinTT^［46］	ICCV22	66.2	70.9	81.9	79.5	69.4	78.9	65.4			70.5	0.304	0.480	0.787	12
ToMP-50^［47］	CVPR22	67.6	72.2	81.2	78.6				70.1	66.9	69	0.297	0.453	0.789
ToMP-101^［47］	CVPR22	68.5	73.5	81.5	78.9				70.1	66.7	66.9	0.309	0.453	0.814
AiATrack^［48］	ECCV22	69.0	73.8	82.7	80.4	69.6	80.0	63.2	69.6	67.9	70.6	0.530	0.764	0.827	38	23.6
DualTFR^［49］	ICCV21	63.5	66.5	80.1		73.5	84.8	69.9			68.2	0.528	0.755	0.836		44.1
SwinTrack-B^［50］	CVPR21	69.6	74.1	82.5	80.4	69.4	78	64.3							52.2	91.4
SwinTrack-B-384^［50］	CVPR21	70.2	75.3	84.0	83.2										45	101.3
SFTransT^［51］	CoRR22	69.0	73.9	82.9	81.3	72.7	84.3	66.9	70.3	66.0	71.3				27.3	29.61
Sim-L/14^［55］	ECCV21	70.5	76.2	83.4	87.4	69.8	78.8	66.0			71.2					103.1
OSTrack-256^［56］	ECCV22	69.1	75.2	83.1	82.0	71.0	80.4	68.2		64.7	66.5				105.4
OSTrack-384^［56］	ECCV22	71.1	77.6	83.9	83.2	73.7	83.2	70.8		68.3	70.7				58.1
ProContEXT^［57］	CoRR22			84.6	83.8	74.6	84.7	72.9							26.4	118.5
VideoTrack^［58］	CVPR23	70.2	76.4	83.8	83.1	72.9	81.9	69.8			69.7				81.4
TATrack-L^［52］	IAAI23	71.1	76.1	85.0	84.5	79.2	88.6	78.3							6.6	112.8
GRM^［59］	CVPR23	69.9	75.8	84.0	83.3	73.4	82.9	70.4		65.6	70.2				45
GRM-L320^［59］	CVPR23	71.4	77.9	84.4	84.0
DropTrack^［60］	CVPR23	71.8	78.1	84.1	83.0	75.9	86.8	72.0	69.6
MixFormer^［53］	CVPR22	67.9	73.9	82.6	81.2	73.2	83.2	70.2			68.7	0.527	0.746	0.833	29	53.3
MixFormer-L^［53］	CVPR22	70.1	76.3	83.9	83.1	75.6	85.7	72.8			69.5	0.555	0.762	0.855	25	215.7
MixViT-L （ConvMAE）^［54］	CVPR23	73.3	80.3	86.1	86.0	75.4	84.0	75.4	70.7		70.0	0.567	0.747	0.870	10	286.9
SeqTrack-B256^［61］	CVPR23	69.9	76.3	83.3	82.2	74.7	84.7	71.8		67.6	69.2	0.520			40	89
SeqTrack-B384^［61］		71.5	77.8	83.9	83.6	74.5	84.3	71.4		66.7	68.6	0.522			15	89
SeqTrack-L256^［61］		72.1	79.0	85.0	84.9	74.5	83.2	72.0		66.9	69.7	0.555			15	309
SeqTrack-L384^［61］		72.5	79.3	85.5	85.8	74.8	81.9	72.2		66.2	68.5	0.561			5	309
ARTrack₂₅₆^［62］	CVPR23	70.4	76.6	84.2	83.5	73.5	82.2	70.9		64.3	67.7
ARTrack₃₈₄^［62］		72.6	79.1	85.1	84.8	75.5	84.3	74.3		66.8	70.5
ARTrack-L₃₈₄^［62］		73.1	80.3	85.6	86.0	78.5	87.4	77.8		67.9	71.2				26

[1]	席治远, 唐超, 童安炀, 王文剑. 基于双路时空网络的驾驶员行为识别[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1511-1519.
[2]	刘子涵, 周登文, 刘玉铠. 基于全局依赖Transformer的图像超分辨率网络[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1588-1596.
[3]	李鑫, 孟乔, 皇甫俊逸, 孟令辰. 基于分离式标签协同学习的YOLOv5多属性分类[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1619-1628.
[4]	盖彦辛, 闫涛, 张江峰, 郭小英, 陈斌. 基于时空注意力的空间关联三维形貌重建[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1570-1578.
[5]	时旺军, 王晶, 宁晓军, 林友芳. 小样本场景下的元迁移学习睡眠分期模型[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1445-1451.
[6]	郭琳, 刘坤虎, 马晨阳, 来佑雪, 徐映芬. 基于感受野扩展残差注意力网络的图像超分辨率重建[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1579-1587.
[7]	杨先凤, 汤依磊, 李自强. 基于交替注意力机制和图卷积网络的方面级情感分析模型[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1058-1064.
[8]	王铂越, 李英祥, 钟剑丹. 基于改进Res-UNet的昼夜地基云图分割网络[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1310-1316.
[9]	万泽轩, 谢春丽, 吕泉润, 梁瑶. 基于依赖增强的分层抽象语法树的代码克隆检测[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1259-1268.
[10]	唐睿, 岳士博, 张睿智, 刘川, 庞川林. UAV协助下非正交多址接入使能的数据采集系统中能效优化机制[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1209-1218.
[11]	黄荣, 宋俊杰, 周树波, 刘浩. 基于自监督视觉Transformer的图像美学质量评价方法[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1269-1276.
[12]	孙祥杰, 魏强, 王奕森, 杜江. 代码相似性检测技术综述[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1248-1258.
[13]	张鹏飞, 韩李涛, 冯恒健, 李洪梅. 基于注意力机制和全局特征优化的点云语义分割[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1086-1092.
[14]	董炜娜, 刘佳, 潘晓中, 陈立峰, 孙文权. 基于编码-解码网络的大容量鲁棒图像隐写方案[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 772-779.
[15]	赵奎, 仇慧琪, 李旭, 徐知非. 结合注意力和多路径融合的实时肺结节检测算法[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 945-952.

基于Transformer的视觉目标跟踪方法综述

Survey of visual object tracking methods based on Transformer

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 5

参考文献 82

相关文章 15

编辑推荐

Metrics