• •    

基于注意力机制的弱监督动作定位方法

胡聪,华钢   

  1. 中国矿业大学
  • 收稿日期:2021-03-12 修回日期:2021-06-25 发布日期:2021-08-31
  • 通讯作者: 胡聪

Weakly supervised action localization method based on attention mechanism

  • Received:2021-03-12 Revised:2021-06-25 Online:2021-08-31

摘要: 针对弱监督动作定位方法无法直接进行动作定位且定位准确性不高的问题,提出了一种基于注意力机制的弱监督动作定位方法,并设计和实现了一种基于动作前后帧信息和区分函数的动作定位模型。采用条件变分自编码器(CVAE)注意力值生成模型,将生成的帧级注意力值作为伪帧级标签;为了增强帧前后的关联性,改进CVAE注意力值生成模型,加入动作前后帧信息以获取帧级注意力值;采用基于区分函数的注意力值优化模型,对伪帧级标签进行反复训练和优化。在THUMOS14和ActivityNet1.2数据集上进行的实验结果表明,基于动作前后帧信息和区分函数的动作定位模型具有较好的动作定位效果和准确性,相比于未加入动作前后帧信息的模型,使动作漏检率减少了11.7%;与AutoLoc、W-TALC(Weakly-supervised Temporal Activity Localization and Classification framework)、3C-Net等弱监督动作定位模型对比,当IoU(Intersection over Union)取值0.5时,在THUMOS14数据集上平均检测精度(mAP)提升10.7%以上,在ActivityNet1.2数据集上mAP提升8.8%以上。

Abstract: Aimed at the problem that weakly supervised action localization method cannot locate action directly and the localization accuracy was not advanced, a weakly supervised action localization method based on the attention mechanism was proposed, and an action localization model based on the pre and post information of action frame and the distinguishing function was designed and realized. Conditional Variational AutoEncoder (CVAE) attention value generation model was used to generate frame-level attention values, which was used as pseudo-frame-level labels. In order to enhance the pre and post relevance of frame, CVAE attention value generation model was improved, to obtain the frame-level attention value, the pre and post information of action frame was added; To train and optimize pseudo-frame-level labels repeatedly, the optimization model for attention value based on distinguishing function was used. The experimental results conducted on THUMOS14 and ActivityNet1.2 datasets show that the action localization model based on the pre and post information of the action frame and the distinguishing function has better action localization effect and accuracy, which makes the movement of missing detection rate reduced by 11.7 percentage compared with the model without the pre and post information of action frame; Compared with AutoLoc, Weakly-supervised Temporal Activity Localization and Classification framework (W-TALC), 3C-Net and other weakly supervised action localization models, when IoU (Intersection over Union) value is setted to 0.5, the mean Average Precision (mAP) value on THUMOS14 dataset is improved by more than 10.7 percentage, and the mAP value on ActivityNet1.2 dataset is improved by more than 8.8 percentage.

中图分类号: