计算机应用 ›› 2020, Vol. 40 ›› Issue (8): 2219-2224.DOI: 10.11772/j.issn.1001-9081.2019122139

• 人工智能 • 上一篇    下一篇

基于自注意力机制的多域卷积神经网络的视觉追踪

李生武, 张选德   

  1. 陕西科技大学 电子信息与人工智能学院, 西安 710021
  • 收稿日期:2019-12-23 修回日期:2020-03-15 出版日期:2020-08-10 发布日期:2020-05-13
  • 通讯作者: 张选德(1979-),男,宁夏固原人,教授,博士,主要研究方向:图像质量评价、图像恢复,zhangxuande@sust.edu.cn
  • 作者简介:李生武(1994-),男,甘肃武威人,硕士研究生,主要研究方向:视觉跟踪、深度学习。
  • 基金资助:
    国家自然科学基金资助项目(61871260)。

Multi-domain convolutional neural network based on self-attention mechanism for visual tracking

LI Shengwu, ZHANG Xuande   

  1. School of Electronic Information and Artificial Intelligence, Shaanxi University of Science and Technology, Xi'an Shaanxi 710021, China
  • Received:2019-12-23 Revised:2020-03-15 Online:2020-08-10 Published:2020-05-13
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61871260).

摘要: 为了解决多域卷积神经网络(MDNet)在目标快速移动和外观剧烈变化时发生的模型漂移问题,提出了自注意力多域卷积神经网络(SAMDNet),通过引入自注意力机制从通道和空间两个维度来提升追踪网络的性能。首先,利用空间注意力模块将所有位置上的特征的加权总和选择性地聚合到特征图中的所有位置上,使得相似的特征彼此相关;然后,利用通道注意力模块整合所有特征图来选择性地强调互相关联的通道的重要性;最后,融合得到最终的特征图。此外,针对MDNet算法因训练数据中存在较多相似但属性不同的序列所造成的网络模型分类不准的问题,构造了复合损失函数。该复合损失函数由分类损失函数和实例判别损失函数组成,首先,用分类损失函数来统计分类的损失值;然后,利用实例判别损失函数来提高目标在当前视频序列中的权重,抑制其在其他序列中的权重;最后,融合两项损失作为模型的最终损失。在目前广泛采用的测试基准数据集OTB50和OTB2015上进行实验,结果表明所提出的算法在成功率指标上相比2015年视觉目标跟踪挑战(VOT2015)的冠军算法MDNet分别提高了1.6个百分点和1.4个百分点,在精确率和成功率指标上优于连续域卷积相关滤波(CCOT)算法,在OTB50上的精确率指标优于高效卷积操作(ECO)算法,验证了该算法的有效性。

关键词: 多域卷积神经网络, 视觉追踪, 自注意力机制, 实例判别损失, 深度学习

Abstract: In order to solve the model drift problem of Multi-Domain convolutional neural Network (MDNet) when the target moves rapidly and the appearance changes drastically, a Multi-Domain convolutional neural Network based on Self-Attention (SAMDNet) was proposed to improve the performance of the tracking network from the dimensions of channel and space by introducing the self-attention mechanism. First, the spatial attention module was used to selectively aggregate the weighted sum of features at all positions to all positions in the feature map, so that the similar features were related to each other. Then, the channel attention module was used to selectively emphasize the importance of interconnected channels by aggregating all feature maps. Finally, the final feature map was obtained by fusion. In addition, in order to solve the problem of inaccurate classification of the network model caused by the existence of many similar sequences with different attributes in training data of MDNet algorithm, a composite loss function was constructed. The composite loss function was composed of a classification loss function and an instance discriminant loss function. First of all, the classification loss function was used to calculate the classification loss value. Second, the instance discriminant loss function was used to increase the weight of the target in the current video sequence and suppress its weight in other sequences. Lastly, the two losses were fused as the final loss of the model. The experiments were conducted on two widely used testing benchmark datasets OTB50 and OTB2015. Experimental results show that the proposed algorithm improves success rate index by 1.6 percentage points and 1.4 percentage points respectively compared with the champion algorithm MDNet of the 2015 Visual-Object-Tracking challenge (VOT2015). The results also show that the precision rate and success rate of the proposed algorithm exceed those of the Continuous Convolution Operators for Visual Tracking (CCOT) algorithm, and the precision rate index of it on OTB50 is also superior to the Efficient Convolution Operators (ECO) algorithm, which verifies the effectiveness of the proposed algorithm.

Key words: Multi-Domain convolutional neural Network (MDNet), visual tracking, self-attention mechanism, instance discriminant loss, deep learning

中图分类号: