Weakly supervised action localization based on action template matching

doi:10.11772/j.issn.1001-9081.2019010139

Abstract

Abstract: In order to solve the problem of action localization in video, a weakly supervised method based on template matching was proposed. Firstly, several candidate bounding boxes of the action subject position were given on each frame of the video, and then these candidate bounding boxes were connected in chronological order to form action proposals. Secondly, action templates were obtained from some frames of the training set video. Finally, the optimal model parameters were obtained after model training by using action proposals and action templates. In the experiments on UCF-sports dataset, the method has the accuracy of the action classification increased by 0.3 percentage points compared with TLSVM (Transfer Latent Support Vector Machine) method; when the overlapping threshold is 0.2, the method has the accuracy of action localization increased by 28.21 percentage points compared with CRANE method. Experimental results show that the proposed method can not only reduce the workload of dataset annotation, but also improve the accuracy of action classification and action localization.

Key words: action localization, action template, weakly supervised, action proposal, video

摘要： 为解决视频中的动作定位问题，提出一种基于模板匹配的弱监督动作定位方法。首先在视频的每一帧上给出若干个动作主体位置的候选框，按时间顺序连接这些候选框形成动作提名；然后利用训练集视频的部分帧得到动作模板；最后利用动作提名与动作模板训练模型，找到最优的模型参数。在UCF-sports数据集上进行实验，结果显示，与TLSVM方法相比，所提方法的动作分类准确率提升了0.3个百分点；当重叠度阈值取0.2时，与CRANE方法相比，所提方法的动作定位准确率提升了28.21个百分点。实验结果表明，所提方法不但能够减少数据集标注的工作量，而且动作分类和动作定位的准确率均得到提升。

关键词: 动作定位, 动作模板, 弱监督, 动作提名, 视频

CLC Number:

TP391.4

SHI Xiangbin, ZHOU Jincheng, LIU Cuiwei. Weakly supervised action localization based on action template matching[J]. Journal of Computer Applications, 2019, 39(8): 2408-2413.

石祥滨, 周金成, 刘翠微. 基于动作模板匹配的弱监督动作定位[J]. 计算机应用, 2019, 39(8): 2408-2413.

References

[1] YUAN Z, STROUD J C, LU T, et al. Temporal action localization by structured maximal sums[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ:IEEE, 2017:3215-3223.
[2] LIN T, ZHAO X, SHOU Z. Single shot temporal action detection[C]//Proceedings of the 25th ACM International Conference on Multimedia. New York:ACM, 2017:988-996.
[3] SHOU Z, WANG D, CHANG S. Action temporal localization in untrimmed videos via multi-stage CNNs[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ:IEEE, 2016:1049-1058.
[4] SHOU Z, CHAN J, ZAREIAN A. CDC:convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ:IEEE, 2017:1417-1426.
[5] XU H, DAS A, SAENKO K. R-C3D:region convolutional 3D network for temporal activity detection[C]//Proceedings of the 2016 IEEE International Conference on Computer Vision. Piscataway, NJ:IEEE, 2017:5794-5803.
[6] ZHAO Y, XIONG Y, WANG L, et al. Temporal action detection with structured segment networks[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway, NJ:IEEE, 2017:2933-2942.
[7] SCHMIDT M. Graphical model structure learning with l1-regularization[D]. Berkeley:University of British Columbia, 2010:27-32.
[8] SAHA S, SINGH G, SAPIENZA M, et al. Deep learning for detecting multiple space-time action tubes in videos[C]//Proceedings of the 2016 British Machine Vision Conference. Guildford, UK:BMVA Press, 2016:No.58.
[9] ZOLFAGHARI M, OLIVEIRA G L, SEDAGHAT N, et al. Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection[C]//Proceedings of the 2017 IEEE Conference on International Conference on Computer Vision. Piscataway, NJ:IEEE, 2017:2923-2932.
[10] SINGH K K, LEE Y J. Hide-and-Seek:forcing a network to be meticulous for weakly-supervised object and action localization[C]//Proceedings of the 2017 IEEE Conference on International Conference on Computer Vision. Piscataway, NJ:IEEE, 2017:3544-3553.
[11] BAGAUTDINOV T, ALAHI A, FLEURET F, et al. Social scene understanding:end-to-end multi-person action localization and collective activity recognition[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ:IEEE, 2017:3425-3434.
[12] CHEN L, ZHAI M, MORI G. Attending to distinctive moments:weakly-supervised attention models for action localization in video[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops. Piscataway, NJ:IEEE, 2017:328-336.
[13] HOU R, CHEN C, SHAH M. Tube Convolutional Neural Network (T-CNN) for action detection in videos[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway, NJ:IEEE, 2017:5823-5832.
[14] WANG L M, XIONG Y J, LIN D H, et al. UntrimmedNets for weakly supervised action recognition and detection[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ:IEEE, 2017:6402-6411.
[15] KLÄSER A, MARSZAŁEK M, SCHMID C, et al. Human focused action localization in video[C]//Proceedings of the 2010 European Conference on Computer Vision, LNCS 6553. Berlin:Springer, 2010:219-233.
[16] WEINZAEPFEL P, HARCHAOUI Z, SCHMID C. Learning to track for spatio-temporal action localization[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway, NJ:IEEE, 2015:3164-3172.
[17] SULTANI W, SHAH M. What if we do not have multiple videos of the same action?-video action localization using Web images[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ:IEEE, 2016:1077-1085.
[18] LIU C W, WU X, JIA Y. Weakly supervised action recognition and localization using Web images[C]//Proceedings of the 2014 Asian Conference on Computer Vision, LNCS 9007. Berlin:Springer, 2014:642-657.
[19] TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway, NJ:IEEE, 2015:4489-4497.
[20] REDMON J, FARHADI A. YOLOv3:An incremental improvement[J]. arXiv E-print, 2018:arXiv:1804.02767.
[21] ZITNICK L, DOLLÁR P. Edge boxes:locating object proposals from edges[C]//Proceedings of the 2014 European Conference on Computer Vision, LNCS 8693. Berlin:Springer, 2014:391-405.
[22] CHENG M, ZHANG Z, LIN W, et al. BING:binarized normed gradients for objectness estimation at 300 fps[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ:IEEE, 2014:3286-3293.
[23] WANG H, SCHMID C. Action recognition with improved trajectories[C]//Proceedings of the 2013 IEEE International Conference on Computer Vision. Piscataway, NJ:IEEE, 2013:3551-3558.
[24] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv E-print, 2015:arXiv:1409.1556.
[25] DO T, ARTIÉRES T. Regularized bundle methods for convex and non-convex risks[J]. The Journal of Machine Learning Research, 2012, 13(1):3539-3583.
[26] LAN T, WANG Y, MORI G. Discriminative figure-centric models for joint action localization and recognition[C]//Proceedings of the 2011 IEEE International Conference on Computer Vision. Piscataway, NJ:IEEE, 2011:2003-2010.
[27] MOSABBEB E A, CABRAL R, TORRE F de la, et al. Multi-label discriminative weakly-supervised human activity recognition and localization[C]//Proceedings of the 2014 Asian Conference on Computer Vision, LNCS 9007. Berlin:Springer, 2014:241-258.
[28] TANG K, SUKTHANKAR R, YAGNIK J, et al. Discriminative segment annotation in weakly labeled video[C]//Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ:IEEE, 2013:2483-2490.
[29] SIVA P, RUSSELL C, XIANG T. In defence of negative mining for annotating weakly labelled data[C]//Proceedings of the 2012 European Conference on Computer Vision, LNCS 7574. Berlin:Springer, 2012:594-608.
[30] 刘翠微.视频中人的动作分析与理解[D].北京:北京理工大学,2015:77-78. (LIU C W. Analysis and understanding of human action in video[D]. Beijing:Beijing Institute of Technology, 2015:77-78.)