基于注意力机制的弱监督动作定位方法

doi:10.11772/j.issn.1001-9081.2021030372

摘要/Abstract

摘要：

针对弱监督动作定位方法无法直接进行动作定位且定位准确性不高的问题，提出了一种基于注意力机制的弱监督动作定位方法，并设计和实现了一种基于动作前后帧信息和区分函数的动作定位模型。采用条件变分自编码器（CVAE）注意力值生成模型，将生成的帧级注意力值作为伪帧级标签；为了增强帧前后的关联性，改进CVAE注意力值生成模型，加入动作前后帧信息以获取帧级注意力值；采用基于区分函数的注意力值优化模型，对伪帧级标签进行反复训练和优化。在THUMOS14和ActivityNet1.2数据集上进行的实验结果表明，基于动作前后帧信息和区分函数的动作定位模型具有较好的动作定位效果和准确性，相较于未加入动作前后帧信息的模型，动作漏检率减小了11.7%；与AutoLoc、W-TALC、3C-Net等弱监督动作定位模型对比，当交并比（IoU）取值0.5时，在THUMOS14数据集上平均检测精度均值（mAP）提升10.7%以上，在ActivityNet1.2数据集上mAP提升8.8%以上。

关键词: 弱监督, 注意力值, 条件变分自编码器, 区分函数, 动作定位, 平均检测精度均值

Abstract:

Aiming at the problem that weakly supervised action localization method cannot locate action directly and the localization accuracy is not high， a weakly supervised action localization method based on attention mechanism was proposed， and an action localization model based on the pre-frame and post-frame information of action frame and the distinguishing function was designed and realized. The attention value generation model of Conditional Variational AutoEncoder （CVAE） was used to generate frame-level attention values as pseudo-frame-level labels； which CAVE was improved to obtain the frame-level attention value by adding the pre-frame and post-frame information of the action frame； to train and optimize pseudo-frame-level labels repeatedly， the optimization model for attention value based on distinguishing function was used. The experimental results conducted on THUMOS14 and ActivityNet1.2 datasets show that the action localization model based on the pre- and post-frame information of the action frame and the distinguishing function has better action localization effect and accuracy， which missing detection rate reduced by 11.7% compared with the model without the pre-frame and post-frame information of action frame； compared with AutoLoc， Weakly-supervised Temporal Activity Localization and Classification framework （W-TALC）， 3C-Net and other weakly supervised action localization models， when Intersection over Union （IoU） value is set to 0.5， the mean Average Precision （mAP） value on THUMOS14 dataset is improved by more than 10.7%， and the mAP value on ActivityNet1.2 dataset is improved by more than 8.8%.

Key words: weakly supervised, attention value, Conditional Variational AutoEncoder (CVAE), distinguishing function, action localization, mean Average Precision (mAP)

中图分类号:

TP391.4

胡聪, 华钢. 基于注意力机制的弱监督动作定位方法[J]. 计算机应用, 2022, 42(3): 960-967.

Cong HU, Gang HUA. Weakly supervised action localization method based on attention mechanism[J]. Journal of Computer Applications, 2022, 42(3): 960-967.

图/表 9

图1 CVAE生成方式

Fig. 1 CVAE generation mode

图2 注意力值生成及优化流程

Fig. 2 Flowchart of attention value generation and optimization

图3 本文模型的流程

Fig. 3 Flowchart of proposed model

表1 在THUMOS14数据集上采用不同的β、α和虚拟分布空间大小得到的基于IoU=0.5的mAP值对比

Tab. 1 Comparison of mAP values based on IoU=0.5 using different β， α and latent space size on THUMOS14 dataset

$β$	mAP/%	$α$	mAP/%	虚拟分布空间大小	mAP/%
0.05	28.6	5	28.8	8×8	26.1
0.10	29.2	6	29.5	64×64	28.9
0.20	29.9	7	29.9	128×128	29.9
0.30	27.8	8	29.6	256×256	29.0
		9	29.4	512×512	28.6

表1 在THUMOS14数据集上采用不同的β、α和虚拟分布空间大小得到的基于IoU=0.5的mAP值对比

Tab. 1 Comparison of mAP values based on IoU=0.5 using different β， α and latent space size on THUMOS14 dataset

$β$	mAP/%	$α$	mAP/%	虚拟分布空间大小	mAP/%
0.05	28.6	5	28.8	8×8	26.1
0.10	29.2	6	29.5	64×64	28.9
0.20	29.9	7	29.9	128×128	29.9
0.30	27.8	8	29.6	256×256	29.0
		9	29.4	512×512	28.6

表2 在THUMOS14数据集上采用不同的γ1和γ2得到的基于IoU=0.5的mAP值对比

Tab. 2 Comparison of mAP values based on IoU=0.5 using different γ1 and γ2 on THUMOS14 dataset

$γ 1$	mAP/%	$γ 2$ （RGB）	mAP/%	$γ 2$ （光流）	mAP/%
0.1	28.4	0.3	28.3	0.1	28.7
0.2	29.1	0.4	29.2	0.2	29.1
0.3	29.9	0.5	29.9	0.3	29.9
0.4	28.4	0.6	28.2	0.4	28.5

表2 在THUMOS14数据集上采用不同的γ1和γ2得到的基于IoU=0.5的mAP值对比

Tab. 2 Comparison of mAP values based on IoU=0.5 using different γ1 and γ2 on THUMOS14 dataset

$γ 1$	mAP/%	$γ 2$ （RGB）	mAP/%	$γ 2$ （光流）	mAP/%
0.1	28.4	0.3	28.3	0.1	28.7
0.2	29.1	0.4	29.2	0.2	29.1
0.3	29.9	0.5	29.9	0.3	29.9
0.4	28.4	0.6	28.2	0.4	28.5

表3 在THUMOS14数据集上加入动作前后帧信息对mAP值的提升效果

Tab. 3 Improvement of mAP value of adding pre- and post-information of action frame on THUMOS14 dataset

是否加入动作前后帧信息	漏检率/%
加入	14.3
未加入	16.2

表4 在THUMOS14数据集上区分函数对mAP值的提升效果

Tab. 4 Improvement of mAP of distinguishing function on THUMOS14 dataset

IoU	mAP/%
IoU	加入区分函数	未加入区分函数
0.1	59.8	56.3
0.2	54.5	51.2
0.3	47.7	44.6
0.4	39.3	36.3
0.5	29.9	26.9
0.6	20.9	18.5
0.7	12.0	10.6
0.8	3.7	2.8
0.9	0.4	0.3

表5 THUMOS14数据集不同模型基于不同IoU的mAP值对比 (%)

Tab. 5 Comparison of mAP values of different models based on different IoU on THUMOS14 dataset

模型	特征提取	IoU
模型	特征提取	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9
AutoLoc	UNT	—	—	35.8	29.0	21.2	13.4	5.8	—	—
STPN	I3D	52.0	44.7	35.5	25.8	16.9	9.9	4.3	1.2	0.1
W-TALC	I3D	55.2	49.6	40.1	31.1	22.8	—	7.6	—	—
3C-Net	I3D	56.8	49.8	40.9	32.3	24.6	—	7.7	—	—
BaS-Net	I3D	58.2	52.3	44.6	36.0	27.0	18.6	10.4	3.9	0.5
本文模型	I3D	59.8	54.5	47.7	39.3	29.9	20.9	12.0	3.7	0.4

表6 ActivityNet1.2数据集不同模型基于不同IoU的mAP值对比 (%)

Tab. 6 Comparison of mAP values of different models based on different IoU on ActivityNet1.2 dataset

模型	特征提取	IoU
模型	特征提取	0.50	0.55	0.60	0.65	0.70	0.75	0.80	0.85	0.90	0.95
AutoLoc	UNT	27.3	24.9	22.5	19.9	17.5	15.1	13.0	10.0	6.8	3.3
TSM	I3D	28.3	26.0	23.6	21.2	18.9	17.0	14.0	11.1	7.5	3.5
BaS-Net	I3D	38.5	—	—	—	—	24.2	—	—	—	5.6
3C-Net	I3D	35.4	—	—	—	22.9	—	—	—	—	—
W-TALC	I3D	37.0	33.5	30.4	25.7	14.6	12.7	10.0	7.0	4.2	1.5
本文模型	I3D	41.9	38.4	34.3	30.8	27.3	23.8	19.7	15.6	10.4	4.7

参考文献 38

1	王倩，范冬艳，李世玺，等.基于双流卷积神经网络的时序动作定位［J］.软件导刊，2020，19（9）：35-38.
	WANG Q， FAN D Y， LI S X，et al. Temporal action localization based on two-stream convolution neural network［J］. Software Guide， 2020， 19（9）：35-38.
2	ESCORCIA V， DAO C D， JAIN M， et al. Guess where？ Actor-supervision for spatiotemporal action localization［J］. Computer Vision and Image Understanding， 2020， 192：102886. 10.1016/j.cviu.2019.102886
3	LI T， BING B， WU X X. Boundary discrimination and proposal evaluation for temporal action proposal generation［J］. Multimedia Tools and Applications， 2020， 80（2）：1-17. 10.1007/s11042-020-09703-x
4	EUM H， YOON C， LEE H， et al. Continuous human action recognition using Depth-MHI-HOG and a spotter model［J］. Sensors， 2015， 15（3）：5197-5227. 10.3390/s150305197
5	ZAWADZKI P， STRACY M， GINDA K， et al. The localization and action of topoisomerase IV in escherichia coli chromosome segregation is coordinated by the SMC complex MukBEF［J］. Cell Reports， 2015， 13（11）：2587-2596. 10.1016/j.celrep.2015.11.034
6	石祥滨，周金成，刘翠微.基于动作模板匹配的弱监督动作定位［J］.计算机应用，2019，39（8）：2408-2413.
	SHI X B， ZHOU J C， LIU C W. Weakly supervised action localization based on action template matching［J］. Journal of Computer Applications， 2019， 39（8）：2408-2413.
7	WANG L， DUAN X H， ZHANG Q L， et al. Segment-Tube： spatio-temporal action localization in untrimmed videos with per-frame segmentation［J］. Sensors， 2018， 18（5）：1657. 10.3390/s18051657
8	SHEN Z， WANG F， DAI J. Weakly supervised temporal action localization by multi-stage fusion network［J］. IEEE Access， 2020， 8：1-15. 10.1109/access.2020.2967627
9	LEE P， UH Y， BYUN H. Background suppression network for weakly-supervised temporal action localization［C］// Proceedings of the 2020 AAAI Conference on Artificial Intelligence. Menlo Park， CA： AAAI Press， 2020：11320-11327. 10.1609/aaai.v34i07.6793
10	ISLAM A， LONG C， RADKE R J. A hybrid attention mechanism for weakly-supervised temporal action localization［C］// Proceedings of the 2021 AAAI Conference on Artificial Intelligence. Menlo Park， CA： AAAI Press， 2021：1-9.
11	OGNIBENE D， CHINELLATO E， SARABIA M， et al. Contextual action recognition and target localization with an active allocation of attention on a humanoid robot［J］. Bioinspiration & Biomimetics， 2013， 8（3）：035002. 10.1088/1748-3182/8/3/035002
12	ZHANG C W， XU Y L， CHENG Z Z， et al. Adversarial seeded sequence growing for weakly-supervised temporal action localization［C］// Proceedings of the 27th ACM International Conference on Multimedia. New York： ACM， 2019：738-746. 10.1145/3343031.3351044
13	SHIM J， KIM J. Contextualizing geneticization and medical pluralism： How variable institutionalization of Traditional， Complementary and Alternative Medicine （TCAM） conditions effects of genetic beliefs on utilization［J］. Social Science & Medicine， 2020， 267：113349. 10.1016/j.socscimed.2020.113349
14	YIN X Z， NI K， REIS D. An ultra-dense 2FeFET TCAM design based on a multi-domain FeFET model［J］.IEEE Transactions on Circuits and Systems II： Express Briefs， 2019， 66（9）： 1577-1581. 10.1109/tcsii.2018.2889225
15	NGUYEN P， RAMANAN D， FOWLKES C. Weakly-supervised action localization with back-ground modeling［C］// Proceedings of the 2019 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2019：5502-5511. 10.1109/iccv.2019.00560
16	SHOU Z， WANG D G， CHANG S F. Temporal action localization in untrimmed videos via multi-stage CNNs［C］// Proceedings of the 2016 IEEE International Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016：1049-1058. 10.1109/cvpr.2016.119
17	SHI B F， DAI Q， MU Y D， et al. Weakly-supervised action localization by generative attention modeling［C］// Proceedings of the 2020 IEEE International Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020：1009-1019. 10.1109/cvpr42600.2020.00109
18	JARADA T N， ROKNE J G， ALHAJJ R. SNF-CVAE： Computational method to predict drug-disease interactions using similarity network fusion and collective variational autoencoder［J］. Knowledge-Based Systems， 2021， 212：106585. 10.1016/j.knosys.2020.106585
19	GONZALEZ J A， HURTADO L F， PLA F. TWilBert： Pre-trained deep bidirectional transformers for Spanish Twitter［J］. Neurocomputing， 2020， 426：58-69. 10.1016/j.neucom.2020.09.078
20	TANG S， CHEN W， JIN L， et al. SWCNTs-based MEMS gas sensor array and its pattern recognition based on deep belief networks of gases detection in oil-immersed transformers［J］. Sensors and Actuators， 2020， 312：127998. 10.1016/j.snb.2020.127998
21	SAKTHI K， Dr. NIRMAL K P. Reconfigurable parallelized TCAM architecture based on enhanced static memory cell ［J］. Microprocessors and Microsystems， 2020， 76：103073. 10.1016/j.micpro.2020.103073
22	YEN T P， PARK K. Ternary Content Addressable Memory （TCAM） cells with small footprint size and efficient layout aspect ratio： US6900999 B1［P］. 2005-05-31.
23	GAO Z， GUO L M， REN T W， et al. Pairwise two-stream ConvNets for cross-domain action recognition with small data［J］. IEEE Transactions on Neural Networks and Learning Systems， 2020， PP（99）：1-15. 10.1109/tnnls.2020.3041018
24	WANG L M， XIONG Y J， WANG Z， et al. Temporal segment networks for action recognition in videos［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2019， 41（11）：1-16. 10.1109/tpami.2018.2868668
25	TRAN D， BOURDEV L， FERGUS R， et al. Learning spatiotemporal features with 3D convolutional networks［C］// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2015： 4489-4497. 10.1109/ICCV.2015.510
26	SKOPINTSEV A M， DONTSOV E V， KOVTUNENKO P V， et al. The coupling of an enhanced pseudo-3D model for hydraulic fracturing with a proppant transport model［J］. Engineering Fracture Mechanics， 2020， 236（1）：107177. 10.1016/j.engfracmech.2020.107177
27	WU Q Y， ZHU A C， CUI R， et al. Pose-guided inflated 3D ConvNet for action recognition in videos［J］. Signal Processing： Image Communication， 2021， 91（13）：116098. 10.1016/j.image.2020.116098
28	SUJOY P， SOURYA R， ROY-CHOWDHURY A K. W-TALC：Weakly-supervised temporal activity localization and classfication［C］// Proceedings of the 2017 European Conference on Computer Vision. Cham： Springer， 2017：5533-5541.
29	NARAYAN S， CHOLAKKAL H， KHAN F， et al. 3C-Net： category count and center loss for weakly-supervised action localization［C］// Proceedings of the 2019 IEEE International Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019：8679-8687. 10.1109/iccv.2019.00877
30	NGUYEN P， LIU T， PRASAD G， et al. Weakly supervised action localization by sparse temporal pooling network［C］// Proceedings of the 2018 IEEE International Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6752-6761. 10.1109/cvpr.2018.00706
31	SHOU Z， GAO H， ZHANG L， et al. AutoLoc： Weakly-supervised temporal action localization in untrimmed videos［C］// Proceedings of the 2018 European Conference on Computer Vision. Cham： Springer， 2018， 1：154-171. 10.1007/978-3-030-01270-0_10
32	GOODFELLOW I J， POUGET-A J， MIRZA M， et al. Generative adversarial networks［EB/OL］.［2020-06-20］ . 10.1145/3422622
33	YU H， LI H R. A conditional factor VAE model for pump degradation assessment under varying conditions［J］. Applied Soft Computing Journal， 2021， 100（11）：106992. 10.1016/j.asoc.2020.106992
34	LIU D C， JIANG T T， WANG Y Z. Completeness modeling and context separation for weakly supervised temporal action localization［C］// Proceedings of the 2019 IEEE International Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019：1298-1307. 10.1109/cvpr.2019.00139
35	PADMAVATHI K， ASHA C S， KARKI M V. A novel medical image fusion by combining TV-L1 decomposed textures based on adaptive weighting scheme［J］. Engineering Science and Technology， an International Journal， 2020， 23（1）：225-239. 10.1016/j.jestch.2019.03.008
36	BIKASH S， AVISHEK K S， DIPTI P M. Graph-based non-maximal suppression for detecting products on the rack［J］. Pattern Recognition Letters， 2020， 140：73-80. 10.1016/j.patrec.2020.09.023
37	YU T， REN Z， LI Y C， et al. Temporal structure mining for weakly supervised learning［C］// Proceedings of the 2019 International Conference on Learning Representations. Piscataway： IEEE， 2019：5522-5531. 10.1109/iccv.2019.00562
38	LEE P， ULH Y， BYUN H. Background suppression network for weakly-supervised temporal action localization［C］// Proceedings of the 2020 AAAI Conference on Artificial Intelligence. Menlo Park， CA： AAAI Press， 2020：11320-11327. 10.1609/aaai.v34i07.6793

[1]	朱子蒙, 李志新, 郇战, 陈瑛, 梁久祯. 基于三元中心引导的弱监督视频异常检测[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1452-1457.
[2]	党伟超, 张磊, 高改梅, 刘春霞. 融合片段对比学习的弱监督动作定位方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 548-555.
[3]	王强, 黄小明, 佟强, 刘秀磊. 基于边界框标注的弱监督显著性目标检测算法[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1910-1918.
[4]	林呈宇, 王雷, 薛聪. 标签语义增强的弱监督文本分类模型[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 335-342.
[5]	罗萍, 丁玲, 杨雪, 向阳. 基于数据增强和弱监督对抗训练的中文事件检测[J]. 《计算机应用》唯一官方网站, 2022, 42(10): 2990-2995.
[6]	邓爽, 何小海, 卿粼波, 陈洪刚, 滕奇志. 基于改进VGG网络的弱监督细粒度阿尔兹海默症分类方法[J]. 《计算机应用》唯一官方网站, 2022, 42(1): 302-309.
[7]	陆鑫伟, 余鹏飞, 李海燕, 李红松, 丁文谦. 基于注意力自身线性融合的弱监督细粒度图像分类算法[J]. 计算机应用, 2021, 41(5): 1319-1325.
[8]	边小勇, 江沛龄, 赵敏, 丁胜, 张晓龙. 基于多分支神经网络模型的弱监督细粒度图像分类方法[J]. 计算机应用, 2020, 40(5): 1295-1300.
[9]	周健, 黄章进. 基于改进三维形变模型的三维人脸重建和密集人脸对齐方法[J]. 计算机应用, 2020, 40(11): 3306-3313.
[10]	严经纬, 李强, 王春茂, 谢迪, 王保青, 戴骏. 面部运动单元检测研究综述[J]. 计算机应用, 2020, 40(1): 8-15.
[11]	丁英姿, 丁香乾, 郭保琪. 基于弱监督的改进型GoogLeNet在DR检测中的应用[J]. 计算机应用, 2019, 39(8): 2484-2488.
[12]	石祥滨, 周金成, 刘翠微. 基于动作模板匹配的弱监督动作定位[J]. 计算机应用, 2019, 39(8): 2408-2413.
[13]	郭勃, 冯旭鹏, 刘利军, 黄青松. 基于平行语料库的双语协同中文关系抽取[J]. 计算机应用, 2017, 37(4): 1051-1055.
[14]	李艳玲, 颜永红. 中文口语理解弱监督训练方法[J]. 计算机应用, 2015, 35(7): 1965-1968.
[15]	杨宇飞戴齐贾真尹红风. 基于弱监督的属性关系抽取方法[J]. 计算机应用, 2014, 34(1): 64-68.