Weakly supervised action localization method based on attention mechanism

doi:10.11772/j.issn.1001-9081.2021030372

Abstract

Abstract:

Aiming at the problem that weakly supervised action localization method cannot locate action directly and the localization accuracy is not high， a weakly supervised action localization method based on attention mechanism was proposed， and an action localization model based on the pre-frame and post-frame information of action frame and the distinguishing function was designed and realized. The attention value generation model of Conditional Variational AutoEncoder （CVAE） was used to generate frame-level attention values as pseudo-frame-level labels； which CAVE was improved to obtain the frame-level attention value by adding the pre-frame and post-frame information of the action frame； to train and optimize pseudo-frame-level labels repeatedly， the optimization model for attention value based on distinguishing function was used. The experimental results conducted on THUMOS14 and ActivityNet1.2 datasets show that the action localization model based on the pre- and post-frame information of the action frame and the distinguishing function has better action localization effect and accuracy， which missing detection rate reduced by 11.7% compared with the model without the pre-frame and post-frame information of action frame； compared with AutoLoc， Weakly-supervised Temporal Activity Localization and Classification framework （W-TALC）， 3C-Net and other weakly supervised action localization models， when Intersection over Union （IoU） value is set to 0.5， the mean Average Precision （mAP） value on THUMOS14 dataset is improved by more than 10.7%， and the mAP value on ActivityNet1.2 dataset is improved by more than 8.8%.

Key words: weakly supervised, attention value, Conditional Variational AutoEncoder (CVAE), distinguishing function, action localization, mean Average Precision (mAP)

摘要：

针对弱监督动作定位方法无法直接进行动作定位且定位准确性不高的问题，提出了一种基于注意力机制的弱监督动作定位方法，并设计和实现了一种基于动作前后帧信息和区分函数的动作定位模型。采用条件变分自编码器（CVAE）注意力值生成模型，将生成的帧级注意力值作为伪帧级标签；为了增强帧前后的关联性，改进CVAE注意力值生成模型，加入动作前后帧信息以获取帧级注意力值；采用基于区分函数的注意力值优化模型，对伪帧级标签进行反复训练和优化。在THUMOS14和ActivityNet1.2数据集上进行的实验结果表明，基于动作前后帧信息和区分函数的动作定位模型具有较好的动作定位效果和准确性，相较于未加入动作前后帧信息的模型，动作漏检率减小了11.7%；与AutoLoc、W-TALC、3C-Net等弱监督动作定位模型对比，当交并比（IoU）取值0.5时，在THUMOS14数据集上平均检测精度均值（mAP）提升10.7%以上，在ActivityNet1.2数据集上mAP提升8.8%以上。

关键词: 弱监督, 注意力值, 条件变分自编码器, 区分函数, 动作定位, 平均检测精度均值

CLC Number:

TP391.4

Cong HU, Gang HUA. Weakly supervised action localization method based on attention mechanism[J]. Journal of Computer Applications, 2022, 42(3): 960-967.

胡聪, 华钢. 基于注意力机制的弱监督动作定位方法[J]. 《计算机应用》唯一官方网站, 2022, 42(3): 960-967.

Figures/Tables 9

Fig. 1 CVAE generation mode

Fig. 2 Flowchart of attention value generation and optimization

Fig. 3 Flowchart of proposed model

Tab. 1 Comparison of mAP values based on IoU=0.5 using different β， α and latent space size on THUMOS14 dataset

$β$	mAP/%	$α$	mAP/%	虚拟分布空间大小	mAP/%
0.05	28.6	5	28.8	8×8	26.1
0.10	29.2	6	29.5	64×64	28.9
0.20	29.9	7	29.9	128×128	29.9
0.30	27.8	8	29.6	256×256	29.0
		9	29.4	512×512	28.6

Tab. 1 Comparison of mAP values based on IoU=0.5 using different β， α and latent space size on THUMOS14 dataset

$β$	mAP/%	$α$	mAP/%	虚拟分布空间大小	mAP/%
0.05	28.6	5	28.8	8×8	26.1
0.10	29.2	6	29.5	64×64	28.9
0.20	29.9	7	29.9	128×128	29.9
0.30	27.8	8	29.6	256×256	29.0
		9	29.4	512×512	28.6

Tab. 2 Comparison of mAP values based on IoU=0.5 using different γ1 and γ2 on THUMOS14 dataset

$γ 1$	mAP/%	$γ 2$ （RGB）	mAP/%	$γ 2$ （光流）	mAP/%
0.1	28.4	0.3	28.3	0.1	28.7
0.2	29.1	0.4	29.2	0.2	29.1
0.3	29.9	0.5	29.9	0.3	29.9
0.4	28.4	0.6	28.2	0.4	28.5

Tab. 2 Comparison of mAP values based on IoU=0.5 using different γ1 and γ2 on THUMOS14 dataset

$γ 1$	mAP/%	$γ 2$ （RGB）	mAP/%	$γ 2$ （光流）	mAP/%
0.1	28.4	0.3	28.3	0.1	28.7
0.2	29.1	0.4	29.2	0.2	29.1
0.3	29.9	0.5	29.9	0.3	29.9
0.4	28.4	0.6	28.2	0.4	28.5

Tab. 3 Improvement of mAP value of adding pre- and post-information of action frame on THUMOS14 dataset

是否加入动作前后帧信息	漏检率/%
加入	14.3
未加入	16.2

Tab. 4 Improvement of mAP of distinguishing function on THUMOS14 dataset

IoU	mAP/%
IoU	加入区分函数	未加入区分函数
0.1	59.8	56.3
0.2	54.5	51.2
0.3	47.7	44.6
0.4	39.3	36.3
0.5	29.9	26.9
0.6	20.9	18.5
0.7	12.0	10.6
0.8	3.7	2.8
0.9	0.4	0.3

Tab. 5 Comparison of mAP values of different models based on different IoU on THUMOS14 dataset

模型	特征提取	IoU
模型	特征提取	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9
AutoLoc	UNT	—	—	35.8	29.0	21.2	13.4	5.8	—	—
STPN	I3D	52.0	44.7	35.5	25.8	16.9	9.9	4.3	1.2	0.1
W-TALC	I3D	55.2	49.6	40.1	31.1	22.8	—	7.6	—	—
3C-Net	I3D	56.8	49.8	40.9	32.3	24.6	—	7.7	—	—
BaS-Net	I3D	58.2	52.3	44.6	36.0	27.0	18.6	10.4	3.9	0.5
本文模型	I3D	59.8	54.5	47.7	39.3	29.9	20.9	12.0	3.7	0.4

Tab. 6 Comparison of mAP values of different models based on different IoU on ActivityNet1.2 dataset

模型	特征提取	IoU
模型	特征提取	0.50	0.55	0.60	0.65	0.70	0.75	0.80	0.85	0.90	0.95
AutoLoc	UNT	27.3	24.9	22.5	19.9	17.5	15.1	13.0	10.0	6.8	3.3
TSM	I3D	28.3	26.0	23.6	21.2	18.9	17.0	14.0	11.1	7.5	3.5
BaS-Net	I3D	38.5	—	—	—	—	24.2	—	—	—	5.6
3C-Net	I3D	35.4	—	—	—	22.9	—	—	—	—	—
W-TALC	I3D	37.0	33.5	30.4	25.7	14.6	12.7	10.0	7.0	4.2	1.5
本文模型	I3D	41.9	38.4	34.3	30.8	27.3	23.8	19.7	15.6	10.4	4.7

References 38

1	王倩，范冬艳，李世玺，等.基于双流卷积神经网络的时序动作定位［J］.软件导刊，2020，19（9）：35-38.
	WANG Q， FAN D Y， LI S X，et al. Temporal action localization based on two-stream convolution neural network［J］. Software Guide， 2020， 19（9）：35-38.
2	ESCORCIA V， DAO C D， JAIN M， et al. Guess where？ Actor-supervision for spatiotemporal action localization［J］. Computer Vision and Image Understanding， 2020， 192：102886. 10.1016/j.cviu.2019.102886
3	LI T， BING B， WU X X. Boundary discrimination and proposal evaluation for temporal action proposal generation［J］. Multimedia Tools and Applications， 2020， 80（2）：1-17. 10.1007/s11042-020-09703-x
4	EUM H， YOON C， LEE H， et al. Continuous human action recognition using Depth-MHI-HOG and a spotter model［J］. Sensors， 2015， 15（3）：5197-5227. 10.3390/s150305197
5	ZAWADZKI P， STRACY M， GINDA K， et al. The localization and action of topoisomerase IV in escherichia coli chromosome segregation is coordinated by the SMC complex MukBEF［J］. Cell Reports， 2015， 13（11）：2587-2596. 10.1016/j.celrep.2015.11.034
6	石祥滨，周金成，刘翠微.基于动作模板匹配的弱监督动作定位［J］.计算机应用，2019，39（8）：2408-2413.
	SHI X B， ZHOU J C， LIU C W. Weakly supervised action localization based on action template matching［J］. Journal of Computer Applications， 2019， 39（8）：2408-2413.
7	WANG L， DUAN X H， ZHANG Q L， et al. Segment-Tube： spatio-temporal action localization in untrimmed videos with per-frame segmentation［J］. Sensors， 2018， 18（5）：1657. 10.3390/s18051657
8	SHEN Z， WANG F， DAI J. Weakly supervised temporal action localization by multi-stage fusion network［J］. IEEE Access， 2020， 8：1-15. 10.1109/access.2020.2967627
9	LEE P， UH Y， BYUN H. Background suppression network for weakly-supervised temporal action localization［C］// Proceedings of the 2020 AAAI Conference on Artificial Intelligence. Menlo Park， CA： AAAI Press， 2020：11320-11327. 10.1609/aaai.v34i07.6793
10	ISLAM A， LONG C， RADKE R J. A hybrid attention mechanism for weakly-supervised temporal action localization［C］// Proceedings of the 2021 AAAI Conference on Artificial Intelligence. Menlo Park， CA： AAAI Press， 2021：1-9.
11	OGNIBENE D， CHINELLATO E， SARABIA M， et al. Contextual action recognition and target localization with an active allocation of attention on a humanoid robot［J］. Bioinspiration & Biomimetics， 2013， 8（3）：035002. 10.1088/1748-3182/8/3/035002
12	ZHANG C W， XU Y L， CHENG Z Z， et al. Adversarial seeded sequence growing for weakly-supervised temporal action localization［C］// Proceedings of the 27th ACM International Conference on Multimedia. New York： ACM， 2019：738-746. 10.1145/3343031.3351044
13	SHIM J， KIM J. Contextualizing geneticization and medical pluralism： How variable institutionalization of Traditional， Complementary and Alternative Medicine （TCAM） conditions effects of genetic beliefs on utilization［J］. Social Science & Medicine， 2020， 267：113349. 10.1016/j.socscimed.2020.113349
14	YIN X Z， NI K， REIS D. An ultra-dense 2FeFET TCAM design based on a multi-domain FeFET model［J］.IEEE Transactions on Circuits and Systems II： Express Briefs， 2019， 66（9）： 1577-1581. 10.1109/tcsii.2018.2889225
15	NGUYEN P， RAMANAN D， FOWLKES C. Weakly-supervised action localization with back-ground modeling［C］// Proceedings of the 2019 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2019：5502-5511. 10.1109/iccv.2019.00560
16	SHOU Z， WANG D G， CHANG S F. Temporal action localization in untrimmed videos via multi-stage CNNs［C］// Proceedings of the 2016 IEEE International Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016：1049-1058. 10.1109/cvpr.2016.119
17	SHI B F， DAI Q， MU Y D， et al. Weakly-supervised action localization by generative attention modeling［C］// Proceedings of the 2020 IEEE International Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020：1009-1019. 10.1109/cvpr42600.2020.00109
18	JARADA T N， ROKNE J G， ALHAJJ R. SNF-CVAE： Computational method to predict drug-disease interactions using similarity network fusion and collective variational autoencoder［J］. Knowledge-Based Systems， 2021， 212：106585. 10.1016/j.knosys.2020.106585
19	GONZALEZ J A， HURTADO L F， PLA F. TWilBert： Pre-trained deep bidirectional transformers for Spanish Twitter［J］. Neurocomputing， 2020， 426：58-69. 10.1016/j.neucom.2020.09.078
20	TANG S， CHEN W， JIN L， et al. SWCNTs-based MEMS gas sensor array and its pattern recognition based on deep belief networks of gases detection in oil-immersed transformers［J］. Sensors and Actuators， 2020， 312：127998. 10.1016/j.snb.2020.127998
21	SAKTHI K， Dr. NIRMAL K P. Reconfigurable parallelized TCAM architecture based on enhanced static memory cell ［J］. Microprocessors and Microsystems， 2020， 76：103073. 10.1016/j.micpro.2020.103073
22	YEN T P， PARK K. Ternary Content Addressable Memory （TCAM） cells with small footprint size and efficient layout aspect ratio： US6900999 B1［P］. 2005-05-31.
23	GAO Z， GUO L M， REN T W， et al. Pairwise two-stream ConvNets for cross-domain action recognition with small data［J］. IEEE Transactions on Neural Networks and Learning Systems， 2020， PP（99）：1-15. 10.1109/tnnls.2020.3041018
24	WANG L M， XIONG Y J， WANG Z， et al. Temporal segment networks for action recognition in videos［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2019， 41（11）：1-16. 10.1109/tpami.2018.2868668
25	TRAN D， BOURDEV L， FERGUS R， et al. Learning spatiotemporal features with 3D convolutional networks［C］// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2015： 4489-4497. 10.1109/ICCV.2015.510
26	SKOPINTSEV A M， DONTSOV E V， KOVTUNENKO P V， et al. The coupling of an enhanced pseudo-3D model for hydraulic fracturing with a proppant transport model［J］. Engineering Fracture Mechanics， 2020， 236（1）：107177. 10.1016/j.engfracmech.2020.107177
27	WU Q Y， ZHU A C， CUI R， et al. Pose-guided inflated 3D ConvNet for action recognition in videos［J］. Signal Processing： Image Communication， 2021， 91（13）：116098. 10.1016/j.image.2020.116098
28	SUJOY P， SOURYA R， ROY-CHOWDHURY A K. W-TALC：Weakly-supervised temporal activity localization and classfication［C］// Proceedings of the 2017 European Conference on Computer Vision. Cham： Springer， 2017：5533-5541.
29	NARAYAN S， CHOLAKKAL H， KHAN F， et al. 3C-Net： category count and center loss for weakly-supervised action localization［C］// Proceedings of the 2019 IEEE International Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019：8679-8687. 10.1109/iccv.2019.00877
30	NGUYEN P， LIU T， PRASAD G， et al. Weakly supervised action localization by sparse temporal pooling network［C］// Proceedings of the 2018 IEEE International Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6752-6761. 10.1109/cvpr.2018.00706
31	SHOU Z， GAO H， ZHANG L， et al. AutoLoc： Weakly-supervised temporal action localization in untrimmed videos［C］// Proceedings of the 2018 European Conference on Computer Vision. Cham： Springer， 2018， 1：154-171. 10.1007/978-3-030-01270-0_10
32	GOODFELLOW I J， POUGET-A J， MIRZA M， et al. Generative adversarial networks［EB/OL］.［2020-06-20］ . 10.1145/3422622
33	YU H， LI H R. A conditional factor VAE model for pump degradation assessment under varying conditions［J］. Applied Soft Computing Journal， 2021， 100（11）：106992. 10.1016/j.asoc.2020.106992
34	LIU D C， JIANG T T， WANG Y Z. Completeness modeling and context separation for weakly supervised temporal action localization［C］// Proceedings of the 2019 IEEE International Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019：1298-1307. 10.1109/cvpr.2019.00139
35	PADMAVATHI K， ASHA C S， KARKI M V. A novel medical image fusion by combining TV-L1 decomposed textures based on adaptive weighting scheme［J］. Engineering Science and Technology， an International Journal， 2020， 23（1）：225-239. 10.1016/j.jestch.2019.03.008
36	BIKASH S， AVISHEK K S， DIPTI P M. Graph-based non-maximal suppression for detecting products on the rack［J］. Pattern Recognition Letters， 2020， 140：73-80. 10.1016/j.patrec.2020.09.023
37	YU T， REN Z， LI Y C， et al. Temporal structure mining for weakly supervised learning［C］// Proceedings of the 2019 International Conference on Learning Representations. Piscataway： IEEE， 2019：5522-5531. 10.1109/iccv.2019.00562
38	LEE P， ULH Y， BYUN H. Background suppression network for weakly-supervised temporal action localization［C］// Proceedings of the 2020 AAAI Conference on Artificial Intelligence. Menlo Park， CA： AAAI Press， 2020：11320-11327. 10.1609/aaai.v34i07.6793

[1]	Zimeng ZHU, Zhixin LI, Zhan HUAN, Ying CHEN, Jiuzhen LIANG. Weakly supervised video anomaly detection based on triplet-centered guidance [J]. Journal of Computer Applications, 2024, 44(5): 1452-1457.
[2]	Weichao DANG, Lei ZHANG, Gaimei GAO, Chunxia LIU. Weakly supervised action localization method with snippet contrastive learning [J]. Journal of Computer Applications, 2024, 44(2): 548-555.
[3]	Qiang WANG, Xiaoming HUANG, Qiang TONG, Xiulei LIU. Weakly supervised salient object detection algorithm based on bounding box annotation [J]. Journal of Computer Applications, 2023, 43(6): 1910-1918.
[4]	Ping LUO, Ling DING, Xue YANG, Yang XIANG. Chinese event detection based on data augmentation and weakly supervised adversarial training [J]. Journal of Computer Applications, 2022, 42(10): 2990-2995.
[5]	Shuang DENG, Xiaohai HE, Linbo QING, Honggang CHEN, Qizhi TENG. Weakly supervised fine-grained classification method of Alzheimer’s disease based on improved visual geometry group network [J]. Journal of Computer Applications, 2022, 42(1): 302-309.
[6]	LU Xinwei, YU Pengfei, LI Haiyan, LI Hongsong, DING Wenqian. Weakly supervised fine-grained image classification algorithm based on attention-attention bilinear pooling [J]. Journal of Computer Applications, 2021, 41(5): 1319-1325.
[7]	BIAN Xiaoyong, JIANG Peiling, ZHAO Min, DING Sheng, ZHANG Xiaolong. Multi-branch neural network model based weakly supervised fine-grained image classification method [J]. Journal of Computer Applications, 2020, 40(5): 1295-1300.
[8]	SHI Xiangbin, ZHOU Jincheng, LIU Cuiwei. Weakly supervised action localization based on action template matching [J]. Journal of Computer Applications, 2019, 39(8): 2408-2413.