《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (5): 1644-1654.DOI: 10.11772/j.issn.1001-9081.2023060796

• 多媒体计算与计算机仿真 • 上一篇    

基于Transformer的视觉目标跟踪方法综述

孙子文1, 钱立志1, 杨传栋1, 高一博1, 陆庆阳2, 袁广林2()   

  1. 1.陆军炮兵防空兵学院 高过载弹药制导控制与信息感知实验室, 合肥 230031
    2.陆军炮兵防空兵学院 信息工程系, 合肥 23003
  • 收稿日期:2023-06-21 修回日期:2023-09-04 接受日期:2023-09-11 发布日期:2023-10-27 出版日期:2024-05-10
  • 通讯作者: 袁广林
  • 作者简介:孙子文(1996—),男,安徽蒙城人,博士研究生,主要研究方向:计算机视觉
    钱立志(1963—),男,安徽枞阳人,教授,博士,主要研究方向:智能弹药打击
    杨传栋(1994—),男,山东泰安人,博士研究生,主要研究方向:计算机视觉
    高一博(1998—),男,安徽太和人,硕士,主要研究方向:机器学习、硬件可编程语言
    陆庆阳(1994—),男,安徽合肥人,硕士研究生,主要研究方向:自然语言处理、目标跟踪
    第一联系人:袁广林(1976—),河南周口人,副教授,博士,主要研究方向:计算机视觉、机器学习。
  • 基金资助:
    军队型号项目(LZX20190112)

Survey of visual object tracking methods based on Transformer

Ziwen SUN1, Lizhi QIAN1, Chuandong YANG1, Yibo GAO1, Qingyang LU2, Guanglin YUAN2()   

  1. 1.Laboratory of Guidance Control and Information Perception Technology of High Overload Projectiles,Army Academy of Artillery and Air Defense,Hefei Anhui 230031,China
    2.Department of Information Engineering,Army Academy of Artillery and Air Defense,Hefei Anhui 230031,China
  • Received:2023-06-21 Revised:2023-09-04 Accepted:2023-09-11 Online:2023-10-27 Published:2024-05-10
  • Contact: Guanglin YUAN
  • About author:SUN Ziwen, born in 1996, Ph. D. candidate. His research interests include computer vision.
    QIAN Lizhi, born in 1963, Ph. D., professor. His research interests include intelligent ammunition strike.
    YANG Chuandong, born in 1994, Ph. D. candidate. His research interests include computer vision.
    GAO Yibo, born in 1998, M. S. His research interests include machine learning, hardware programmable languages.
    LU Qingyang, born in 1994, M. S. candidate. His research interests include natural language processing, object tracking.
  • Supported by:
    Military Model Project(LZX20190112)

摘要:

视觉目标跟踪是计算机视觉中的重要任务之一,为实现高性能的目标跟踪,近年来提出了大量的目标跟踪方法,其中基于Transformer的目标跟踪方法由于具有全局建模和联系上下文的能力,是目前视觉目标跟踪领域研究的热点。首先,根据网络结构的不同对基于Transformer的视觉目标跟踪方法进行分类,概述相关原理和模型改进的关键技术,总结不同网络结构的优缺点;其次,对这类方法在公开数据集上的实验结果进行对比,分析网络结构对性能的影响,其中MixViT-L(ConvMAE)在LaSOT和TrackingNet上跟踪成功率分别达到了73.3%和86.1%,说明基于纯Transformer两段式架构的目标跟踪方法具有更优的性能和更广的发展前景;最后,对方法当前存在的网络结构复杂、参数量大、训练要求高和边缘设备使用难度大等不足进行总结,并对今后的研究重点进行展望,通过与模型压缩、自监督学习以及Transformer可解释性分析相结合,可为基于Transformer的视觉目标跟踪提出更多可行的解决方案。

关键词: 计算机视觉, 目标跟踪, 混合网络结构, 深度学习, 孪生网络, Transformer

Abstract:

Visual object tracking is one of the important tasks in computer vision, in order to achieve high-performance object tracking, a large number of object tracking methods have been proposed in recent years. Among them, Transformer-based object tracking methods become a hot topic in the field of visual object tracking due to their ability to perform global modeling and capture contextual information. Firstly, existing Transformer-based visual object tracking methods were classified based on their network structures, an overview of the underlying principles and key techniques for model improvement were expounded, and the advantages and disadvantages of different network structures were also summarized. Then, the experimental results of the Transformer-based visual object tracking methods on public datasets were compared to analyze the impact of network structure on performance. in which MixViT-L (ConvMAE) achieved tracking success rates of 73.3% and 86.1% on LaSOT and TrackingNet, respectively, proving that the object tracking methods based on pure Transformer two-stage architecture have better performance and broader development prospects. Finally, the limitations of these methods, such as complex network structure, large number of parameters, high training requirements, and difficulty in deploying on edge devices, were summarized, and the future research focus was outlooked, by combining model compression, self-supervised learning, and Transformer interpretability analysis, more kinds of feasible solutions for Transformer-based visual target tracking could be presented.

Key words: computer vision, object tracking, hybrid network structure, deep learning, Siamese network, Transformer

中图分类号: