Abstract:In order to solve the model drift problem of Multi-Domain convolutional neural Network (MDNet) when the target moves rapidly and the appearance changes drastically, a Multi-Domain convolutional neural Network based on Self-Attention (SAMDNet) was proposed to improve the performance of the tracking network from the dimensions of channel and space by introducing the self-attention mechanism. First, the spatial attention module was used to selectively aggregate the weighted sum of features at all positions to all positions in the feature map, so that the similar features were related to each other. Then, the channel attention module was used to selectively emphasize the importance of interconnected channels by aggregating all feature maps. Finally, the final feature map was obtained by fusion. In addition, in order to solve the problem of inaccurate classification of the network model caused by the existence of many similar sequences with different attributes in training data of MDNet algorithm, a composite loss function was constructed. The composite loss function was composed of a classification loss function and an instance discriminant loss function. First of all, the classification loss function was used to calculate the classification loss value. Second, the instance discriminant loss function was used to increase the weight of the target in the current video sequence and suppress its weight in other sequences. Lastly, the two losses were fused as the final loss of the model. The experiments were conducted on two widely used testing benchmark datasets OTB50 and OTB2015. Experimental results show that the proposed algorithm improves success rate index by 1.6 percentage points and 1.4 percentage points respectively compared with the champion algorithm MDNet of the 2015 Visual-Object-Tracking challenge (VOT2015). The results also show that the precision rate and success rate of the proposed algorithm exceed those of the Continuous Convolution Operators for Visual Tracking (CCOT) algorithm, and the precision rate index of it on OTB50 is also superior to the Efficient Convolution Operators (ECO) algorithm, which verifies the effectiveness of the proposed algorithm.
[1] HENRIQUES J F, CASEIRO R, MARTINS P, et al. High-speed tracking with kernelized correlation filters[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(3):583-596. [2] 樊佳庆,宋慧慧,张开华. 通道稳定性加权补充学习的实时视觉跟踪算法[J]. 计算机应用, 2018, 38(6):1751-1754. (FAN J Q, SONG H H, ZHANG K H. Real-time visual tracking via channel stability weighted complementary learning[J]. Journal of Computer Applications, 2018, 38(6):1751-1754.) [3] 熊昌镇,车满强,王润玲. 基于稀疏卷积特征和相关滤波的实时视觉跟踪算法[J]. 计算机应用, 2018, 38(8):2175-2179. (XIONG C Z, CHE M Q, WANG R L. Real-time visual tracking algorithm based on correlation filters and sparse convolutional features[J]. Journal of Computer Applications, 2018, 38(8):2175-2179.) [4] SONG Y, MA C, WU X, et al. VITAL:visual tracking via adversarial learning[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2018:8990-8999. [5] GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[C]//Proceeding of the 27th International Conference on Neural Information Processing Systems. Cambridge:MIT Press, 2014:2672-2680. [6] FAN H, LING H. SANet:structure-aware network for visual tracking[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Piscataway:IEEE, 2017:2217-2224. [7] PINEDA F J. Generalization of back propagation to recurrent and higher order neural networks[C]//Proceedings of the 1987 International Conference on Neural Information Processing Systems. Cambridge:MIT Press, 1987:602-611. [8] NAM H, BAEK M, HAN B. Modeling and propagating CNNs in a tree structure for visual tracking[EB/OL].[2019-12-20].https://arxiv.org/pdf/1608.07242.pdf. [9] NAM H, HAN B. Learning multi-domain convolutional neural networks for visual tracking[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:4293-4302. [10] KRISTAN M, MATAS J, LEONARDIS A, et al. The visual object tracking VOT2015 challenge results[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision Workshop. Piscataway:IEEE, 2015:564-586. [11] GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2014:580-587. [12] RUSSAKOVSKY O, DENG J, SU H, et al. ImageNet large scale visual recognition challenge[J]. International Journal of Computer Vision, 2015, 115(3):211-252. [13] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY:Curran Associates Inc., 2017:6000-6010. [14] ZHANG H, GOODFELLOW I J, METAXAS D N, et al. Self-attention generative adversarial networks[C]//Proceedings of the 36th International Conference on Machine Learning. New York:JMLR.org, 2019:7354-7363. [15] WANG X, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2018:7794-7803. [16] SUNG K K, POGGIO T. Example-based learning for view based human face detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(1):39-51. [17] WU Y, LIM J, YANG M H. Online object tracking:a benchmark[C]//Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2013:2411-2418. [18] WU Y, LIM J, YANG M H. Object tracking benchmark[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9):1834-1848. [19] DANELLJAN M, BHAT G, SHAHBAZ KHAN F, et al. ECO:efficient convolution operators for tracking[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2017:6931-6939. [20] DANELLJAN M, ROBINSON A, SHAHBAZ KHAN F, et al. Beyond correlation filters:learning continuous convolution operators for visual tracking[C]//Proceedings of the 14th European Conference on Computer Vision. Cham:Springer, 2016:472-488. [21] DANELLJAN M, HAGER G, SHAHBAZ KHAN F, et al. Adaptive decontamination of the training set:a unified formulation for discriminative visual tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:1430-1438. [22] HONG S, YOU T, KWAK S, et al. Online tracking by learning discriminative saliency map with convolutional neural network[C]//Proceedings of the 32nd International Conference on Machine Learning. New York:JMLR.org, 2015:597-606. [23] BERTINETTO L, VALMADRE J, HENRIQUES J F, et al. Fully-convolutional Siamese networks for object tracking[C]//Proceedings of the 2016 European Conference on Computer Vision, LNCS 9914. Cham:Springer, 2016:850-865. [24] WANG Q, GAO J, XING J, et, al. DCFNet:discriminant correlation filters network for visual tracking[EB/OL].[2019-12-20].https://arxiv.org/pdf/1704.04057v1.pdf.