Multi-domain convolutional neural network based on self-attention mechanism for visual tracking

doi:10.11772/j.issn.1001-9081.2019122139

Abstract

Abstract: In order to solve the model drift problem of Multi-Domain convolutional neural Network (MDNet) when the target moves rapidly and the appearance changes drastically, a Multi-Domain convolutional neural Network based on Self-Attention (SAMDNet) was proposed to improve the performance of the tracking network from the dimensions of channel and space by introducing the self-attention mechanism. First, the spatial attention module was used to selectively aggregate the weighted sum of features at all positions to all positions in the feature map, so that the similar features were related to each other. Then, the channel attention module was used to selectively emphasize the importance of interconnected channels by aggregating all feature maps. Finally, the final feature map was obtained by fusion. In addition, in order to solve the problem of inaccurate classification of the network model caused by the existence of many similar sequences with different attributes in training data of MDNet algorithm, a composite loss function was constructed. The composite loss function was composed of a classification loss function and an instance discriminant loss function. First of all, the classification loss function was used to calculate the classification loss value. Second, the instance discriminant loss function was used to increase the weight of the target in the current video sequence and suppress its weight in other sequences. Lastly, the two losses were fused as the final loss of the model. The experiments were conducted on two widely used testing benchmark datasets OTB50 and OTB2015. Experimental results show that the proposed algorithm improves success rate index by 1.6 percentage points and 1.4 percentage points respectively compared with the champion algorithm MDNet of the 2015 Visual-Object-Tracking challenge (VOT2015). The results also show that the precision rate and success rate of the proposed algorithm exceed those of the Continuous Convolution Operators for Visual Tracking (CCOT) algorithm, and the precision rate index of it on OTB50 is also superior to the Efficient Convolution Operators (ECO) algorithm, which verifies the effectiveness of the proposed algorithm.

Key words: Multi-Domain convolutional neural Network (MDNet), visual tracking, self-attention mechanism, instance discriminant loss, deep learning

摘要： 为了解决多域卷积神经网络（MDNet）在目标快速移动和外观剧烈变化时发生的模型漂移问题，提出了自注意力多域卷积神经网络（SAMDNet），通过引入自注意力机制从通道和空间两个维度来提升追踪网络的性能。首先，利用空间注意力模块将所有位置上的特征的加权总和选择性地聚合到特征图中的所有位置上，使得相似的特征彼此相关；然后，利用通道注意力模块整合所有特征图来选择性地强调互相关联的通道的重要性；最后，融合得到最终的特征图。此外，针对MDNet算法因训练数据中存在较多相似但属性不同的序列所造成的网络模型分类不准的问题，构造了复合损失函数。该复合损失函数由分类损失函数和实例判别损失函数组成，首先，用分类损失函数来统计分类的损失值；然后，利用实例判别损失函数来提高目标在当前视频序列中的权重，抑制其在其他序列中的权重；最后，融合两项损失作为模型的最终损失。在目前广泛采用的测试基准数据集OTB50和OTB2015上进行实验，结果表明所提出的算法在成功率指标上相比2015年视觉目标跟踪挑战（VOT2015）的冠军算法MDNet分别提高了1.6个百分点和1.4个百分点，在精确率和成功率指标上优于连续域卷积相关滤波（CCOT）算法，在OTB50上的精确率指标优于高效卷积操作（ECO）算法，验证了该算法的有效性。

关键词: 多域卷积神经网络, 视觉追踪, 自注意力机制, 实例判别损失, 深度学习

CLC Number:

TP391.4

LI Shengwu, ZHANG Xuande. Multi-domain convolutional neural network based on self-attention mechanism for visual tracking[J]. Journal of Computer Applications, 2020, 40(8): 2219-2224.

李生武, 张选德. 基于自注意力机制的多域卷积神经网络的视觉追踪[J]. 计算机应用, 2020, 40(8): 2219-2224.

References

[1] HENRIQUES J F, CASEIRO R, MARTINS P, et al. High-speed tracking with kernelized correlation filters[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(3):583-596.
[2] 樊佳庆,宋慧慧,张开华. 通道稳定性加权补充学习的实时视觉跟踪算法[J]. 计算机应用, 2018, 38(6):1751-1754. (FAN J Q, SONG H H, ZHANG K H. Real-time visual tracking via channel stability weighted complementary learning[J]. Journal of Computer Applications, 2018, 38(6):1751-1754.)
[3] 熊昌镇,车满强,王润玲. 基于稀疏卷积特征和相关滤波的实时视觉跟踪算法[J]. 计算机应用, 2018, 38(8):2175-2179. (XIONG C Z, CHE M Q, WANG R L. Real-time visual tracking algorithm based on correlation filters and sparse convolutional features[J]. Journal of Computer Applications, 2018, 38(8):2175-2179.)
[4] SONG Y, MA C, WU X, et al. VITAL:visual tracking via adversarial learning[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2018:8990-8999.
[5] GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[C]//Proceeding of the 27th International Conference on Neural Information Processing Systems. Cambridge:MIT Press, 2014:2672-2680.
[6] FAN H, LING H. SANet:structure-aware network for visual tracking[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Piscataway:IEEE, 2017:2217-2224.
[7] PINEDA F J. Generalization of back propagation to recurrent and higher order neural networks[C]//Proceedings of the 1987 International Conference on Neural Information Processing Systems. Cambridge:MIT Press, 1987:602-611.
[8] NAM H, BAEK M, HAN B. Modeling and propagating CNNs in a tree structure for visual tracking[EB/OL].[2019-12-20].https://arxiv.org/pdf/1608.07242.pdf.
[9] NAM H, HAN B. Learning multi-domain convolutional neural networks for visual tracking[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:4293-4302.
[10] KRISTAN M, MATAS J, LEONARDIS A, et al. The visual object tracking VOT2015 challenge results[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision Workshop. Piscataway:IEEE, 2015:564-586.
[11] GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2014:580-587.
[12] RUSSAKOVSKY O, DENG J, SU H, et al. ImageNet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115(3):211-252.
[13] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY:Curran Associates Inc., 2017:6000-6010.
[14] ZHANG H, GOODFELLOW I J, METAXAS D N, et al. Self-attention generative adversarial networks[C]//Proceedings of the 36th International Conference on Machine Learning. New York:JMLR.org, 2019:7354-7363.
[15] WANG X, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2018:7794-7803.
[16] SUNG K K, POGGIO T. Example-based learning for view based human face detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(1):39-51.
[17] WU Y, LIM J, YANG M H. Online object tracking:a benchmark[C]//Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2013:2411-2418.
[18] WU Y, LIM J, YANG M H. Object tracking benchmark[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9):1834-1848.
[19] DANELLJAN M, BHAT G, SHAHBAZ KHAN F, et al. ECO:efficient convolution operators for tracking[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2017:6931-6939.
[20] DANELLJAN M, ROBINSON A, SHAHBAZ KHAN F, et al. Beyond correlation filters:learning continuous convolution operators for visual tracking[C]//Proceedings of the 14th European Conference on Computer Vision. Cham:Springer, 2016:472-488.
[21] DANELLJAN M, HAGER G, SHAHBAZ KHAN F, et al. Adaptive decontamination of the training set:a unified formulation for discriminative visual tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:1430-1438.
[22] HONG S, YOU T, KWAK S, et al. Online tracking by learning discriminative saliency map with convolutional neural network[C]//Proceedings of the 32nd International Conference on Machine Learning. New York:JMLR.org, 2015:597-606.
[23] BERTINETTO L, VALMADRE J, HENRIQUES J F, et al. Fully-convolutional Siamese networks for object tracking[C]//Proceedings of the 2016 European Conference on Computer Vision, LNCS 9914. Cham:Springer, 2016:850-865.
[24] WANG Q, GAO J, XING J, et, al. DCFNet:discriminant correlation filters network for visual tracking[EB/OL].[2019-12-20].https://arxiv.org/pdf/1704.04057v1.pdf.