Weakly supervised action localization method with snippet contrastive learning

doi:10.11772/j.issn.1001-9081.2023020246

Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (2): 548-555.DOI: 10.11772/j.issn.1001-9081.2023020246

• Multimedia computing and computer simulation • Previous Articles

Weakly supervised action localization method with snippet contrastive learning

Weichao DANG, Lei ZHANG(), Gaimei GAO, Chunxia LIU

College of Computer Science and Technology，Taiyuan University of Science and Technology，Taiyuan Shanxi 030024，China

Received:2023-03-09 Revised:2023-06-11 Accepted:2023-06-15 Online:2023-08-14 Published:2024-02-10
Contact: Lei ZHANG
About author:DANG Weichao， born in 1974， Ph. D.， associate professor. His research interests include intelligent computing， software reliability.
GAO Gaimei， born in 1978， Ph. D.， associate professor. Her research interests include network security， cryptography.
LIU Chunxia， born in 1977， M. S.， associate professor. Her research interests include software engineering， database.
Supported by:
Doctoral Research Start-up Fund of Taiyuan University of Science and Technology(20202063);Graduate Education Innovation Project of Taiyuan University of Science and Technology(SY2022063)

融合片段对比学习的弱监督动作定位方法

党伟超, 张磊(), 高改梅, 刘春霞

太原科技大学计算机科学与技术学院，太原 030024

通讯作者: 张磊
作者简介:党伟超（1974—），男，山西运城人，副教授，博士，CCF会员，主要研究方向：智能计算、软件可靠性
高改梅（1978—），女，山西吕梁人，副教授，博士，CCF会员，主要研究方向：网络安全、密码学
刘春霞（1977—），女，山西大同人，副教授，硕士，CCF会员，主要研究方向：软件工程、数据库。
基金资助:
太原科技大学博士科研启动基金资助项目(20202063);太原科技大学研究生教育创新项目(SY2022063)

Abstract

Abstract:

A weakly supervised action localization method， which integrated snippet contrastive learning， was proposed to address the issue of misclassification of snippets at action boundaries in existing attention-based methods. First， an attention mechanism with three branches was introduced to measure the possibility of each video frame being an action instance， context， or background. Second， the Class Activation Sequences （CAS） corresponding to each branch were constructed based on the obtained attention values. Then， positive and negative sample pairs were generated using a snippet mining algorithm. Finally， the network was guided through snippet contrastive learning to correctly classify hard snippets. Experimental results indicated that at an Intersection over Union （IoU） of 0.5， the mean Average Precisions （mAP） of the proposed method on THUMOS14 and ActivityNet1.3 datasets are 33.9% and 40.1% respectively， with improvements of 1.1 and 2.9 percentage points compared to the DGCNN （Dynamic Graph modeling for weakly-supervised temporal action localization Convolutional Neural Network） weakly supervised action localization model， validating the effectiveness of the proposed method.

Key words: weakly-supervised, contrastive learning, temporal action localization, attention mechanism, class activation sequence

摘要：

针对现有基于注意力机制的弱监督动作定位方法对动作边界处的片段容易错误分类的问题，提出一种融合片段对比学习的弱监督动作定位方法。首先，引入三个分支的注意力机制，分别测量每个视频帧是动作实例、上下文以及背景的可能性；其次，基于得到的注意力值构建对应分支的类激活序列；然后，通过片段挖掘算法构造正负样本对；最后，利用片段对比学习引导网络将模糊片段正确归类。实验结果表明，当交并比（IoU）取值0.5时，在THUMOS14与ActivityNet1.3两个公共数据集上，所提方法的平均检测精度（mAP）分别达到了33.9%和40.1%，相较于DGCNN（Dynamic Graph modeling for weakly-supervised temporal action localization Convolutional Neural Network）弱监督动作定位模型在上述两个数据集上分别提升1.1和2.9个百分点，验证了所提方法的有效性。

关键词: 弱监督, 对比学习, 时序动作定位, 注意力机制, 类激活序列

CLC Number:

TP391.4

Weichao DANG, Lei ZHANG, Gaimei GAO, Chunxia LIU. Weakly supervised action localization method with snippet contrastive learning[J]. Journal of Computer Applications, 2024, 44(2): 548-555.

党伟超, 张磊, 高改梅, 刘春霞. 融合片段对比学习的弱监督动作定位方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 548-555.

Figures/Tables 8

Fig. 1 Principle of snippet contrastive learning

Fig. 2 Overall framework of proposed model

Fig. 3 Hard snippet mining algorithm

Tab. 1 Detection results of different weakly-supervised action localization methods on THUMOS14 dataset

方法	mAP@IoU
方法	0.1	0.2	0.3	0.4	0.5	0.6	0.7
STPN^［4］	52.0	44.7	35.5	25.8	16.9	9.9	4.3
W-TALC^［29］	55.2	49.6	40.1	31.1	22.8	—	7.6
MAAN^［19］	59.8	50.8	41.1	30.6	20.3	12.0	6.9
BasNet^［30］	58.2	52.3	44.6	36.0	27.0	18.6	10.4
DGAM^［9］	60.0	54.2	46.8	38.2	28.8	19.8	11.4
A2CL-PT^［18］	61.2	56.1	48.1	39.0	30.1	19.2	10.6
TSCN^［31］	63.4	57.6	47.8	37.7	28.7	19.4	10.2
MSA-Net^［32］	65.6	60.7	52.3	41.6	29.7	20.6	10.1
HAM-Net^［33］	65.4	59.0	50.3	41.1	31.0	20.7	11.1
EGA-Net^［34］	64.5	58.4	50.0	41.4	31.5	21.0	10.7
ACS-Net^［35］	—	—	51.4	42.7	32.4	22.0	11.7
DGCNN^［36］	66.3	59.9	52.3	43.2	32.8	22.1	13.1
本文模型	67.7	62.3	53.5	43.3	33.9	22.1	11.1

Tab. 2 Detection results of different models on ActivityNet1.3 dataset

模型	mAP@IoU
模型	0.5	0.75	0.95
STPN^［4］	29.3	16.9	2.6
MAAN^［19］	33.7	21.9	5.5
TSM^［38］	30.3	19.0	4.5
BasNet^［30］	34.5	22.5	4.9
EGA-Net^［34］	35.4	22.5	4.5
A2CL-PT^［18］	36.8	22.5	5.2
BMUE^［39］	37.0	23.9	5.7
DGCNN^［36］	37.2	23.8	5.8
本文模型	40.1	24.0	6.0

Tab. 3 Performance comparison of different balance factors on THUMOS 14 dataset

平衡因子 $λ 2$	mAP@0.5	平衡因子 $λ 2$	mAP@0.5
0.005	32.5	0.050	33.8
0.007	33.1	0.070	33.8
0.010	33.9	0.100	33.6
0.030	33.6

Tab. 3 Performance comparison of different balance factors on THUMOS 14 dataset

平衡因子 $λ 2$	mAP@0.5	平衡因子 $λ 2$	mAP@0.5
0.005	32.5	0.050	33.8
0.007	33.1	0.070	33.8
0.010	33.9	0.100	33.6
0.030	33.6

Tab.4 Ablation experiment results of action context branch

实验	$L c l s$			$L g u i$	$L s$	mAP@IoU/%
实验	$L c l s i n s$	$L c l s c o n$	$L c l s b a k$	$L g u i$	$L s$	0.1	0.3	0.5	0.7
1	√	×	×	×	×	49.9	32.9	16.6	5.3
2	√	×	√	×	×	55.9	41.9	23.0	7.1
3	√	√	×	×	×	67.4	50.8	31.5	10.8
4	√	√	√	×	×	65.6	49.4	29.6	10.0
5	√	√	√	√	√	67.7	53.5	33.9	11.1

Tab.4 Ablation experiment results of action context branch

实验	$L c l s$			$L g u i$	$L s$	mAP@IoU/%
实验	$L c l s i n s$	$L c l s c o n$	$L c l s b a k$	$L g u i$	$L s$	0.1	0.3	0.5	0.7
1	√	×	×	×	×	49.9	32.9	16.6	5.3
2	√	×	√	×	×	55.9	41.9	23.0	7.1
3	√	√	×	×	×	67.4	50.8	31.5	10.8
4	√	√	√	×	×	65.6	49.4	29.6	10.0
5	√	√	√	√	√	67.7	53.5	33.9	11.1

Tab.5 Ablation experiment results of attention guided loss and snippet contrast loss

实验	$L c l s$	$L g u i$	$L s$	mAP@IoU/%
实验	$L c l s$	$L g u i$	$L s$	0.1	0.3	0.5	0.7
1	√	×	×	65.6	49.4	29.6	10.0
2	√	×	√	65.5	49.7	29.8	10.1
3	√	√	×	66.4	51.5	32.2	11.0
4	√	√	√	67.7	53.5	33.9	11.1

Tab.5 Ablation experiment results of attention guided loss and snippet contrast loss

实验	$L c l s$	$L g u i$	$L s$	mAP@IoU/%
实验	$L c l s$	$L g u i$	$L s$	0.1	0.3	0.5	0.7
1	√	×	×	65.6	49.4	29.6	10.0
2	√	×	√	65.5	49.7	29.8	10.1
3	√	√	×	66.4	51.5	32.2	11.0
4	√	√	√	67.7	53.5	33.9	11.1

References 39

1	SUN C， SHETTY S， SUKTHANKAR R， et al. Temporal localization of fine-grained actions in videos by domain transfer from web images ［C］// Proceedings of the 23rd ACM International Conference on Multimedia. New York： ACM， 2015：371-380. 10.1145/2733373.2806226
2	胡聪，华钢.基于注意力机制的弱监督动作定位方法［J］.计算机应用， 2022， 42（3）： 960-967.
	HU C， HUA G. Weakly supervised action localization method based on attention mechanism［J］. Journal of Computer Applications， 2022， 42（3）： 960-967.
3	郭文斌，杨兴明，蒋哲远，等.多时间尺度一致性的弱监督时序动作定位［J］.计算机工程与应用， 2023， 59（10）： 151-161. 10.3778/j.issn.1002-8331.2201-0233
	GUO W B， YANG X M， JIANG Z Y， et al. Multi-temporal scales consensus for weakly supervised temporal action localization［J］. Computer Engineering and Applications， 2023， 59（10）： 151-161. 10.3778/j.issn.1002-8331.2201-0233
4	NGUYEN P， HAN B， LIU T， et al. Weakly supervised action localization by sparse temporal pooling network［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6752-6761. 10.1109/cvpr.2018.00706
5	ZENG R， GAN C， CHEN P， et al. Breaking Winner-Takes-All： Iterative-Winners-Out networks for weakly supervised temporal action localization［J］. IEEE Transactions on Image Processing， 2019， 28（12）：5797-5808. 10.1109/tip.2019.2922108
6	SHOU Z， GAO H， ZHANG L， et al. AutoLoc： weakly-supervised temporal action localization in untrimmed videos［C］// Proceedings of the 2018 European Conference on Computer Vision. Cham： Springer， 2018： 162-179. 10.1007/978-3-030-01270-0_10
7	CHEN M， FANG Y， WANG X， et al. Diversity transfer network for Few-Shot learning［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2020， 34：10559-10566. 10.1609/aaai.v34i07.6628
8	ZHUANG C， ZHAI A， YAMINS D. Local aggregation for unsupervised learning of visual embeddings［C］// Proceedings of the 2019 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2019： 6001-6011. 10.1109/iccv.2019.00610
9	SHI B， DAI Q， MU Y， et al. Weakly-supervised action localization by generative attention modeling［C］// Proceedings of the 2020 International Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 1006-1016. 10.1109/cvpr42600.2020.00109
10	ZHANG C， CAO M， YANG D， et al. CoLA： weakly-supervised temporal action localization with snippet contrastive learning［C］// Proceedings of the 2021 International Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 16010-16019. 10.1109/cvpr46437.2021.01575
11	SHOU Z， WANG D， CHANG S-F. Temporal action localization in untrimmed videos via multi-stage CNNs［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 1049-1058. 10.1109/cvpr.2016.119
12	ZHAO Y， XIONG Y， WANG L， et al. Temporal action detection with structured segment networks［C］// Proceedings of the 2017 IEEE Conference on Computer Vision. Piscataway： IEEE， 2017： 2933-2942. 10.1109/iccv.2017.317
13	XU H， DAS A， SAENKO K. R-C3D： region convolutional 3D network for temporal activity detection［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 5794-5803. 10.1109/iccv.2017.617
14	LIN T， ZHAO X， SU H， et al. BSN： boundary sensitive network for temporal action proposal generation［C］// Proceedings of the 2018 European Conference on Computer Vision. Cham： Springer， 2018： 3-21. 10.1007/978-3-030-01225-0_1
15	LIN T， ZHAO X， SHOU Z. Single shot temporal action detection［C］// Proceedings of the 25th ACM International Conference on Multimedia. New York： ACM， 2017： 988-996. 10.1145/3123266.3123343
16	WANG L， XIONG Y， LIN D， et al. UntrimmedNets for weakly supervised action recognition and detection［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 6402-6411. 10.1109/cvpr.2017.678
17	NARAYAN S， CHOLAKKAL H， KHAN F S， et al. 3C-Net： category count and center loss for weakly-supervised action localization［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 8679-8687. 10.1109/iccv.2019.00877
18	MIN K， CORSO J J. Adversarial background-aware loss for weakly-supervised temporal activity localization［C］// Proceedings of the 2020 European Conference on Computer Vision. Cham： Springer， 2020： 283-299. 10.1007/978-3-030-58568-6_17
19	YUAN Y， LYU Y， SHEN X， et al. Marginalized average attentional network for weakly-supervised learning［EB/OL］. ［2023-03-09］. .
20	李希，刘喜平，李旺才，等.对比学习研究综述［J］.小型微型计算机系统， 2023， 44（4）： 787-797.
	LI X， LIU X P， LI W C， et al. Survey on contrastive learning research ［J］. Journal of Chinese Computer Systems， 2023， 44（4）： 787-797.
21	HE K， FAN H， WU Y， et al. Momentum contrast for unsupervised visual representation learning［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 9726-9735. 10.1109/cvpr42600.2020.00975
22	CHEN T， KORNBLITH S， NOROUZI M， et al. A simple framework for contrastive learning of visual representations［EB/OL］. ［2023-03-09］. .
23	GUTMANN M， HYVÄRINEN A. Noise-contrastive estimation： a new estimation principle for unnormalized statistical models［C］// Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. New York： JMLR.org， 2010： 297-304.
24	ZACH C， POCK T， BISCHOF H， et al. A duality based approach for realtime TV-L ¹ optical flow［C］// Proceedings of the 29th DAGM Conference on Pattern Recognition. Berlin： Springer， 2007： 214-223. 10.1007/978-3-540-74936-3_22
25	KAY W， CARREIRA J， SIMONYAN K， et al. The Kinetics human action video dataset［EB/OL］. ［2023-03-09］. .
26	CARREIRA J， ZISSERMAN A. Quo vadis， action recognition？ A new model and the kinetics dataset［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 4724-4733. 10.1109/cvpr.2017.502
27	IDREES H， ZAMIR A R， JIANG Y-G， et al. The THUMOS challenge on action recognition for videos "in the wild"［J］. Computer Vision and Image Understanding， 2017， 155： 1-23. 10.1016/j.cviu.2016.10.018
28	HEILBRON F C， ESCORCIA V， GHANEM B， et al. ActivityNet： a large-scale video benchmark for human activity understanding［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2015： 961-970. 10.1109/cvpr.2015.7298698
29	PAUL S， ROY S， ROY-CHOWDHURY A K. W-TALC： weakly-supervised temporal activity localization and classification ［C］// Proceedings of the 2018 European Conference on Computer Vision. Cham： Springer， 2018： 588-607. 10.1007/978-3-030-01225-0_35
30	LEE P， UH Y， BYUN H. Background suppression network for weakly-supervised temporal action localization［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2020， 34（7）： 11320-11327. 10.1609/aaai.v34i07.6793
31	ZHAI Y， WANG L， TANG W， et al. Two-stream consensus network for weakly-supervised temporal action localization［C］// Proceedings of the 2020 European Conference on Computer Vision. Cham： Springer， 2020： 37-54. 10.1007/978-3-030-58539-6_3
32	YANG W， ZHANG T， MAO Z， et al. Multi-scale structure-aware network for weakly supervised temporal action detection［J］. IEEE Transactions on Image Processing， 2021， 30： 5848-5861. 10.1109/tip.2021.3089361
33	ISLAM A， LONG C， RADKE R. A hybrid attention mechanism for weakly-supervised temporal action localization［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2021， 35（2）： 1637-1645. 10.1609/aaai.v35i2.16256
34	CHENG Y， SUN Y， FAN H， et al. Entropy guided attention network for weakly-supervised action localization［J］. Pattern Recognition， 2022， 129： 108718. 10.1016/j.patcog.2022.108718
35	LIU Z， WANG L， ZHANG Q， et al. ACSNet： action-context separation network for weakly supervised temporal action localization ［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2021， 35（3）： 2233-2241. 10.1609/aaai.v35i3.16322
36	SHI H， ZHANG X-Y， LI C， et al. Dynamic graph modeling for weakly-supervised temporal action localization［C］// Proceedings of the 30th ACM International Conference on Multimedia. New York： ACM， 2022： 3820-3828. 10.1145/3503161.3548077
37	KINGMA D P， BA J. Adam： A method for stochastic optimization［EB/OL］. ［2023-03-09］. .
38	YU T， REN Z， LI Y， et al. Temporal structure mining for weakly supervised action detection［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 5521-5530. 10.1109/iccv.2019.00562
39	LEE P， WANG J， LU Y， et al. Background modeling via uncertainty estimation for weakly-supervised action localization［EB/OL］. ［2023-03-09］. .

[1]	Ziqi HUANG, Jianpeng HU. Entity category enhanced nested named entity recognition in automotive domain [J]. Journal of Computer Applications, 2024, 44(2): 377-384.
[2]	Xinran LUO, Tianrui LI, Zhen JIA. Chinese medical named entity recognition based on self-attention mechanism and lexicon enhancement [J]. Journal of Computer Applications, 2024, 44(2): 385-392.
[3]	Fuqin DENG, Huifeng GUAN, Chaoen TAN, Lanhui FU, Hongmin WANG, Tinlun LAM, Jianmin ZHANG. Multi-robot reinforcement learning path planning method based on request-response communication mechanism and local attention mechanism [J]. Journal of Computer Applications, 2024, 44(2): 432-438.
[4]	Wei TONG, Liyang HE, Rui LI, Wei HUANG, Zhenya HUANG, Qi LIU. Efficient similar exercise retrieval model based on unsupervised semantic hashing [J]. Journal of Computer Applications, 2024, 44(1): 206-216.
[5]	Jia CHEN, Hong ZHANG. Image text retrieval method based on feature enhancement and semantic correlation matching [J]. Journal of Computer Applications, 2024, 44(1): 16-23.
[6]	Zhiping ZHU, Yan YANG, Jie WANG. Scene graph-aware cross-modal image captioning model [J]. Journal of Computer Applications, 2024, 44(1): 58-64.
[7]	Li’an CHEN, Yi GUO. Text sentiment analysis model based on individual bias information [J]. Journal of Computer Applications, 2024, 44(1): 145-151.
[8]	Yirui HUANG, Junwei LUO, Jingqiang CHEN. Multi-modal dialog reply retrieval based on contrast learning and GIF tag [J]. Journal of Computer Applications, 2024, 44(1): 32-38.
[9]	Hanxiao SHI, Leichun WANG. Short-term power load forecasting by graph convolutional network combining LSTM and self-attention mechanism [J]. Journal of Computer Applications, 2024, 44(1): 311-317.
[10]	Xiaobing WANG, Xiongwei ZHANG, Tieyong CAO, Yunfei ZHENG, Yong WANG. Self-distillation object segmentation method via scale-attention knowledge transfer [J]. Journal of Computer Applications, 2024, 44(1): 129-137.
[11]	Hongbin WANG, Xiao FANG, Hong JIANG. Commonsense reasoning and question answering method with three-dimensional semantic features [J]. Journal of Computer Applications, 2024, 44(1): 138-144.
[12]	Junhao LUO, Yan ZHU. Multi-dynamic aware network for unaligned multimodal language sequence sentiment analysis [J]. Journal of Computer Applications, 2024, 44(1): 79-85.
[13]	Mu LI, Yuheng YANG, Xizheng KE. Emotion recognition model based on hybrid-mel gama frequency cross-attention transformer modal [J]. Journal of Computer Applications, 2024, 44(1): 86-93.
[14]	Jia WANG-ZHU, Zhou YU, Jun YU, Jianping FAN. Video dynamic scene graph generation model based on multi-scale spatial-temporal Transformer [J]. Journal of Computer Applications, 2024, 44(1): 47-57.
[15]	Hao YANG, Yi ZHANG. Feature pyramid network algorithm based on context information and multi-scale fusion importance awareness [J]. Journal of Computer Applications, 2023, 43(9): 2727-2734.

Weakly supervised action localization method with snippet contrastive learning

融合片段对比学习的弱监督动作定位方法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 8

References 39

Related Articles 15

Recommended Articles

Metrics