Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (9): 2893-2902.DOI: 10.11772/j.issn.1001-9081.2024040425

• Multimedia computing and computer simulation • Previous Articles     Next Articles

Unsupervised person re-identification based on self-distilled vision Transformer

Jieru JIA1,2,3(), Jianchao YANG1,2,3, Shuorui ZHANG1,2,3, Tao YAN1,2,4,5, Bin CHEN4,5   

  1. 1.Institute of Big Data Science and Industry,Shanxi University,Taiyuan Shanxi 030006,China
    2.School of Computer and Information Technology,Shanxi University,Taiyuan Shanxi 030006,China
    3.Engineering Research Center for Machine Vision and Data Mining of Shanxi Province (Shanxi University),Taiyuan Shanxi 030006,China
    4.Chongqing Research Institute of Harbin Institute of Technology,Chongqing 401151,China
    5.International Research Institute of Artificial Intelligence,Harbin Institute of Technology,Shenzhen Guangdong 518055,China
  • Received:2024-04-09 Revised:2024-06-18 Accepted:2024-06-26 Online:2024-09-14 Published:2024-09-10
  • Contact: Jieru JIA
  • About author:YANG Jianchao, born in 2001, M. S. candidate. His research interests include vision Transformer.
    ZHANG Shuorui, born in 1999, M. S. candidate. Her research interests include unsupervised learning.
    YAN Tao, born in 1987, Ph. D., associate professor. His research interests include 3D reconstruction.
    CHEN Bin, born in 1970, Ph. D., professor. His research interests include computer vision.
  • Supported by:
    National Natural Science Foundation of China(62106133);Funds for Central-Government-Guided Local Science and Technology Development(YDZJSX20231C001)

基于自蒸馏视觉Transformer的无监督行人重识别

贾洁茹1,2,3(), 杨建超1,2,3, 张硕蕊1,2,3, 闫涛1,2,4,5, 陈斌4,5   

  1. 1.山西大学 大数据科学与产业研究院, 太原 030006
    2.山西大学 计算机与信息技术学院, 太原 030006
    3.山西省机器视觉与数据挖掘工程研究中心(山西大学), 太原 030006
    4.哈尔滨工业大学重庆研究院, 重庆 401151
    5.哈尔滨工业大学(深圳)国际人工智能研究院, 广东 深圳 518055
  • 通讯作者: 贾洁茹
  • 作者简介:杨建超(2001—),男,山西运城人,硕士研究生,主要研究方向:视觉Transformer
    张硕蕊(1999—),女,山西运城人,硕士研究生,主要研究方向:无监督学习
    闫涛(1987—),男,山西定襄人,副教授,博士,主要研究方向:三维重建
    陈斌(1970—),男,四川广汉人,教授,博士,主要研究方向:机器视觉。
  • 基金资助:
    国家自然科学基金资助项目(62106133);中央引导地方科技发展资金资助项目(YDZJSX20231C001)

Abstract:

Since the lack of inductive bias in Vision Transformer (ViT) makes it hard to learn meaningful visual representations on relatively small-scale datasets, an unsupervised person re-identification method based on self-distilled vision Transformer was proposed. Firstly, because of the modular architecture of ViT, the feature generated by any intermediate block has the same dimension, so an intermediate Transformer block was selected randomly and was fed into the classifier to obtain prediction results. Secondly, by using the Kullback-Leibler divergence between the minimized randomly selected intermediate classifier output and the final classifier output distribution, the classification prediction results of the intermediate block were constrained to be consistent with the results of the final classifier, and a self-distillation loss function was constructed based on this. Finally, the model was optimized by jointly minimizing the cluster-level contrast loss, instance-level contrast loss, and self-distillation loss. Besides by providing soft supervision from the final classifier to the intermediate block, the inductive bias was introduced to ViT model effectively, so that the model was able to learn more robust and generalized visual representations. Compared to Transformer-based Object Re-IDentification Self-Supervised Learning (TransReID-SSL), the proposed method improves the mean Average Precision (mAP) and Rank-1 by 1.2 and 0.8 percentage points respectively on Market-1501 dataset, and by 3.4 and 3.1 percentage points respectively on MSMT17 dataset. Experimental results demonstrate that the proposed method can increase the unsupervised person re-identification precision effectively.

Key words: person re-identification, unsupervised learning, Vision Transformer (ViT), knowledge distillation, feature representation

摘要:

针对视觉Transformer(ViT)缺乏归纳偏置,导致在相对小规模的行人重识别数据上难以学习有意义的视觉表征的问题,提出一种基于自蒸馏视觉Transformer的无监督行人重识别方法。首先,利用ViT的模块化架构,即每个中间块生成的特征维度相同的特性,随机选择一个中间Transformer块并将它送入分类器以得到预测结果;其次,通过最小化随机选择的中间分类器输出与最终分类器输出分布之间的Kullback-Leibler散度,约束中间块的分类预测结果与最终分类器的结果保持一致,据此构建自蒸馏损失函数;最后,通过对聚类级对比损失、实例级对比损失和自蒸馏损失进行联合最小化,对模型进行优化。此外,通过从最终分类器向中间块提供软监督,有效地给ViT模型引入归纳偏置,进而有助于模型学习更鲁棒和通用的视觉表征。与基于TransReID的自监督学习(TransReID-SSL)相比,在Market-1501数据集上,所提方法的平均精度均值(mAP)和Rank-1分别提升1.2和0.8个百分点;在MSMT17数据集上,所提方法的mAP和Rank-1分别提升3.4和3.1个百分点。实验结果表明,所提方法能够有效提高无监督行人重识别的精度。

关键词: 行人重识别, 无监督学习, 视觉Transformer, 知识蒸馏, 特征表示

CLC Number: