《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (3): 960-967.DOI: 10.11772/j.issn.1001-9081.2021030372

• 多媒体计算与计算机仿真 • 上一篇    

基于注意力机制的弱监督动作定位方法

胡聪, 华钢()   

  1. 中国矿业大学 信息与控制工程学院,江苏 徐州 221116
  • 收稿日期:2021-03-12 修回日期:2021-06-22 接受日期:2021-06-28 发布日期:2022-04-09 出版日期:2022-03-10
  • 通讯作者: 华钢
  • 作者简介:胡聪(1995—),男(回族),江苏徐州人,硕士研究生,主要研究方向:计算机视觉理解、深度学习;

Weakly supervised action localization method based on attention mechanism

Cong HU, Gang HUA()   

  1. College of Information and Control Engineering,China University of Mining and Technology,Xuzhou Jiangsu 221116,China
  • Received:2021-03-12 Revised:2021-06-22 Accepted:2021-06-28 Online:2022-04-09 Published:2022-03-10
  • Contact: Gang HUA
  • About author:HU Cong, born in 1995, M. S. candidate. His research interests include computer vision understanding, deep learning.

摘要:

针对弱监督动作定位方法无法直接进行动作定位且定位准确性不高的问题,提出了一种基于注意力机制的弱监督动作定位方法,并设计和实现了一种基于动作前后帧信息和区分函数的动作定位模型。采用条件变分自编码器(CVAE)注意力值生成模型,将生成的帧级注意力值作为伪帧级标签;为了增强帧前后的关联性,改进CVAE注意力值生成模型,加入动作前后帧信息以获取帧级注意力值;采用基于区分函数的注意力值优化模型,对伪帧级标签进行反复训练和优化。在THUMOS14和ActivityNet1.2数据集上进行的实验结果表明,基于动作前后帧信息和区分函数的动作定位模型具有较好的动作定位效果和准确性,相较于未加入动作前后帧信息的模型,动作漏检率减小了11.7%;与AutoLoc、W-TALC、3C-Net等弱监督动作定位模型对比,当交并比(IoU)取值0.5时,在THUMOS14数据集上平均检测精度均值(mAP)提升10.7%以上,在ActivityNet1.2数据集上mAP提升8.8%以上。

关键词: 弱监督, 注意力值, 条件变分自编码器, 区分函数, 动作定位, 平均检测精度均值

Abstract:

Aiming at the problem that weakly supervised action localization method cannot locate action directly and the localization accuracy is not high, a weakly supervised action localization method based on attention mechanism was proposed, and an action localization model based on the pre-frame and post-frame information of action frame and the distinguishing function was designed and realized. The attention value generation model of Conditional Variational AutoEncoder (CVAE) was used to generate frame-level attention values as pseudo-frame-level labels; which CAVE was improved to obtain the frame-level attention value by adding the pre-frame and post-frame information of the action frame; to train and optimize pseudo-frame-level labels repeatedly, the optimization model for attention value based on distinguishing function was used. The experimental results conducted on THUMOS14 and ActivityNet1.2 datasets show that the action localization model based on the pre- and post-frame information of the action frame and the distinguishing function has better action localization effect and accuracy, which missing detection rate reduced by 11.7% compared with the model without the pre-frame and post-frame information of action frame; compared with AutoLoc, Weakly-supervised Temporal Activity Localization and Classification framework (W-TALC), 3C-Net and other weakly supervised action localization models, when Intersection over Union (IoU) value is set to 0.5, the mean Average Precision (mAP) value on THUMOS14 dataset is improved by more than 10.7%, and the mAP value on ActivityNet1.2 dataset is improved by more than 8.8%.

Key words: weakly supervised, attention value, Conditional Variational AutoEncoder (CVAE), distinguishing function, action localization, mean Average Precision (mAP)

中图分类号: