融合片段对比学习的弱监督动作定位方法

doi:10.11772/j.issn.1001-9081.2023020246

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (2): 548-555.DOI: 10.11772/j.issn.1001-9081.2023020246

• 多媒体计算与计算机仿真 • 上一篇

融合片段对比学习的弱监督动作定位方法

党伟超, 张磊(), 高改梅, 刘春霞

太原科技大学计算机科学与技术学院，太原 030024

收稿日期:2023-03-09 修回日期:2023-06-11 接受日期:2023-06-15 发布日期:2023-08-14 出版日期:2024-02-10
通讯作者: 张磊
作者简介:党伟超（1974—），男，山西运城人，副教授，博士，CCF会员，主要研究方向：智能计算、软件可靠性
高改梅（1978—），女，山西吕梁人，副教授，博士，CCF会员，主要研究方向：网络安全、密码学
刘春霞（1977—），女，山西大同人，副教授，硕士，CCF会员，主要研究方向：软件工程、数据库。
基金资助:
太原科技大学博士科研启动基金资助项目(20202063);太原科技大学研究生教育创新项目(SY2022063)

Weakly supervised action localization method with snippet contrastive learning

Weichao DANG, Lei ZHANG(), Gaimei GAO, Chunxia LIU

College of Computer Science and Technology，Taiyuan University of Science and Technology，Taiyuan Shanxi 030024，China

Received:2023-03-09 Revised:2023-06-11 Accepted:2023-06-15 Online:2023-08-14 Published:2024-02-10
Contact: Lei ZHANG
About author:DANG Weichao， born in 1974， Ph. D.， associate professor. His research interests include intelligent computing， software reliability.
GAO Gaimei， born in 1978， Ph. D.， associate professor. Her research interests include network security， cryptography.
LIU Chunxia， born in 1977， M. S.， associate professor. Her research interests include software engineering， database.
Supported by:
Doctoral Research Start-up Fund of Taiyuan University of Science and Technology(20202063);Graduate Education Innovation Project of Taiyuan University of Science and Technology(SY2022063)

摘要/Abstract

摘要：

针对现有基于注意力机制的弱监督动作定位方法对动作边界处的片段容易错误分类的问题，提出一种融合片段对比学习的弱监督动作定位方法。首先，引入三个分支的注意力机制，分别测量每个视频帧是动作实例、上下文以及背景的可能性；其次，基于得到的注意力值构建对应分支的类激活序列；然后，通过片段挖掘算法构造正负样本对；最后，利用片段对比学习引导网络将模糊片段正确归类。实验结果表明，当交并比（IoU）取值0.5时，在THUMOS14与ActivityNet1.3两个公共数据集上，所提方法的平均检测精度（mAP）分别达到了33.9%和40.1%，相较于DGCNN（Dynamic Graph modeling for weakly-supervised temporal action localization Convolutional Neural Network）弱监督动作定位模型在上述两个数据集上分别提升1.1和2.9个百分点，验证了所提方法的有效性。

关键词: 弱监督, 对比学习, 时序动作定位, 注意力机制, 类激活序列

Abstract:

A weakly supervised action localization method， which integrated snippet contrastive learning， was proposed to address the issue of misclassification of snippets at action boundaries in existing attention-based methods. First， an attention mechanism with three branches was introduced to measure the possibility of each video frame being an action instance， context， or background. Second， the Class Activation Sequences （CAS） corresponding to each branch were constructed based on the obtained attention values. Then， positive and negative sample pairs were generated using a snippet mining algorithm. Finally， the network was guided through snippet contrastive learning to correctly classify hard snippets. Experimental results indicated that at an Intersection over Union （IoU） of 0.5， the mean Average Precisions （mAP） of the proposed method on THUMOS14 and ActivityNet1.3 datasets are 33.9% and 40.1% respectively， with improvements of 1.1 and 2.9 percentage points compared to the DGCNN （Dynamic Graph modeling for weakly-supervised temporal action localization Convolutional Neural Network） weakly supervised action localization model， validating the effectiveness of the proposed method.

Key words: weakly-supervised, contrastive learning, temporal action localization, attention mechanism, class activation sequence

中图分类号:

TP391.4

党伟超, 张磊, 高改梅, 刘春霞. 融合片段对比学习的弱监督动作定位方法[J]. 计算机应用, 2024, 44(2): 548-555.

Weichao DANG, Lei ZHANG, Gaimei GAO, Chunxia LIU. Weakly supervised action localization method with snippet contrastive learning[J]. Journal of Computer Applications, 2024, 44(2): 548-555.

图/表 8

图1 片段对比学习的原理

Fig. 1 Principle of snippet contrastive learning

图2 本文模型的总体框架

Fig. 2 Overall framework of proposed model

图3 模糊片段挖掘算法

Fig. 3 Hard snippet mining algorithm

表1 不同弱监督动作定位方法在THUMOS14数据集上的检测结果 (%)

Tab. 1 Detection results of different weakly-supervised action localization methods on THUMOS14 dataset

方法	mAP@IoU
方法	0.1	0.2	0.3	0.4	0.5	0.6	0.7
STPN^［4］	52.0	44.7	35.5	25.8	16.9	9.9	4.3
W-TALC^［29］	55.2	49.6	40.1	31.1	22.8	—	7.6
MAAN^［19］	59.8	50.8	41.1	30.6	20.3	12.0	6.9
BasNet^［30］	58.2	52.3	44.6	36.0	27.0	18.6	10.4
DGAM^［9］	60.0	54.2	46.8	38.2	28.8	19.8	11.4
A2CL-PT^［18］	61.2	56.1	48.1	39.0	30.1	19.2	10.6
TSCN^［31］	63.4	57.6	47.8	37.7	28.7	19.4	10.2
MSA-Net^［32］	65.6	60.7	52.3	41.6	29.7	20.6	10.1
HAM-Net^［33］	65.4	59.0	50.3	41.1	31.0	20.7	11.1
EGA-Net^［34］	64.5	58.4	50.0	41.4	31.5	21.0	10.7
ACS-Net^［35］	—	—	51.4	42.7	32.4	22.0	11.7
DGCNN^［36］	66.3	59.9	52.3	43.2	32.8	22.1	13.1
本文模型	67.7	62.3	53.5	43.3	33.9	22.1	11.1

表2 不同模型在ActivityNet1.3数据集上的检测结果 (%)

Tab. 2 Detection results of different models on ActivityNet1.3 dataset

模型	mAP@IoU
模型	0.5	0.75	0.95
STPN^［4］	29.3	16.9	2.6
MAAN^［19］	33.7	21.9	5.5
TSM^［38］	30.3	19.0	4.5
BasNet^［30］	34.5	22.5	4.9
EGA-Net^［34］	35.4	22.5	4.5
A2CL-PT^［18］	36.8	22.5	5.2
BMUE^［39］	37.0	23.9	5.7
DGCNN^［36］	37.2	23.8	5.8
本文模型	40.1	24.0	6.0

表3 不同平衡因子在THUMOS14数据集上的性能比较

Tab. 3 Performance comparison of different balance factors on THUMOS 14 dataset

平衡因子 $λ 2$	mAP@0.5	平衡因子 $λ 2$	mAP@0.5
0.005	32.5	0.050	33.8
0.007	33.1	0.070	33.8
0.010	33.9	0.100	33.6
0.030	33.6

表3 不同平衡因子在THUMOS14数据集上的性能比较

Tab. 3 Performance comparison of different balance factors on THUMOS 14 dataset

平衡因子 $λ 2$	mAP@0.5	平衡因子 $λ 2$	mAP@0.5
0.005	32.5	0.050	33.8
0.007	33.1	0.070	33.8
0.010	33.9	0.100	33.6
0.030	33.6

表4 动作上下文分支消融实验结果

Tab.4 Ablation experiment results of action context branch

实验	$L c l s$			$L g u i$	$L s$	mAP@IoU/%
实验	$L c l s i n s$	$L c l s c o n$	$L c l s b a k$	$L g u i$	$L s$	0.1	0.3	0.5	0.7
1	√	×	×	×	×	49.9	32.9	16.6	5.3
2	√	×	√	×	×	55.9	41.9	23.0	7.1
3	√	√	×	×	×	67.4	50.8	31.5	10.8
4	√	√	√	×	×	65.6	49.4	29.6	10.0
5	√	√	√	√	√	67.7	53.5	33.9	11.1

表4 动作上下文分支消融实验结果

Tab.4 Ablation experiment results of action context branch

实验	$L c l s$			$L g u i$	$L s$	mAP@IoU/%
实验	$L c l s i n s$	$L c l s c o n$	$L c l s b a k$	$L g u i$	$L s$	0.1	0.3	0.5	0.7
1	√	×	×	×	×	49.9	32.9	16.6	5.3
2	√	×	√	×	×	55.9	41.9	23.0	7.1
3	√	√	×	×	×	67.4	50.8	31.5	10.8
4	√	√	√	×	×	65.6	49.4	29.6	10.0
5	√	√	√	√	√	67.7	53.5	33.9	11.1

表5 注意力引导损失与片段对比损失消融实验结果

Tab.5 Ablation experiment results of attention guided loss and snippet contrast loss

实验	$L c l s$	$L g u i$	$L s$	mAP@IoU/%
实验	$L c l s$	$L g u i$	$L s$	0.1	0.3	0.5	0.7
1	√	×	×	65.6	49.4	29.6	10.0
2	√	×	√	65.5	49.7	29.8	10.1
3	√	√	×	66.4	51.5	32.2	11.0
4	√	√	√	67.7	53.5	33.9	11.1

表5 注意力引导损失与片段对比损失消融实验结果

Tab.5 Ablation experiment results of attention guided loss and snippet contrast loss

实验	$L c l s$	$L g u i$	$L s$	mAP@IoU/%
实验	$L c l s$	$L g u i$	$L s$	0.1	0.3	0.5	0.7
1	√	×	×	65.6	49.4	29.6	10.0
2	√	×	√	65.5	49.7	29.8	10.1
3	√	√	×	66.4	51.5	32.2	11.0
4	√	√	√	67.7	53.5	33.9	11.1

参考文献 39

1	SUN C， SHETTY S， SUKTHANKAR R， et al. Temporal localization of fine-grained actions in videos by domain transfer from web images ［C］// Proceedings of the 23rd ACM International Conference on Multimedia. New York： ACM， 2015：371-380. 10.1145/2733373.2806226
2	胡聪，华钢.基于注意力机制的弱监督动作定位方法［J］.计算机应用， 2022， 42（3）： 960-967.
	HU C， HUA G. Weakly supervised action localization method based on attention mechanism［J］. Journal of Computer Applications， 2022， 42（3）： 960-967.
3	郭文斌，杨兴明，蒋哲远，等.多时间尺度一致性的弱监督时序动作定位［J］.计算机工程与应用， 2023， 59（10）： 151-161. 10.3778/j.issn.1002-8331.2201-0233
	GUO W B， YANG X M， JIANG Z Y， et al. Multi-temporal scales consensus for weakly supervised temporal action localization［J］. Computer Engineering and Applications， 2023， 59（10）： 151-161. 10.3778/j.issn.1002-8331.2201-0233
4	NGUYEN P， HAN B， LIU T， et al. Weakly supervised action localization by sparse temporal pooling network［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6752-6761. 10.1109/cvpr.2018.00706
5	ZENG R， GAN C， CHEN P， et al. Breaking Winner-Takes-All： Iterative-Winners-Out networks for weakly supervised temporal action localization［J］. IEEE Transactions on Image Processing， 2019， 28（12）：5797-5808. 10.1109/tip.2019.2922108
6	SHOU Z， GAO H， ZHANG L， et al. AutoLoc： weakly-supervised temporal action localization in untrimmed videos［C］// Proceedings of the 2018 European Conference on Computer Vision. Cham： Springer， 2018： 162-179. 10.1007/978-3-030-01270-0_10
7	CHEN M， FANG Y， WANG X， et al. Diversity transfer network for Few-Shot learning［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2020， 34：10559-10566. 10.1609/aaai.v34i07.6628
8	ZHUANG C， ZHAI A， YAMINS D. Local aggregation for unsupervised learning of visual embeddings［C］// Proceedings of the 2019 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2019： 6001-6011. 10.1109/iccv.2019.00610
9	SHI B， DAI Q， MU Y， et al. Weakly-supervised action localization by generative attention modeling［C］// Proceedings of the 2020 International Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 1006-1016. 10.1109/cvpr42600.2020.00109
10	ZHANG C， CAO M， YANG D， et al. CoLA： weakly-supervised temporal action localization with snippet contrastive learning［C］// Proceedings of the 2021 International Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 16010-16019. 10.1109/cvpr46437.2021.01575
11	SHOU Z， WANG D， CHANG S-F. Temporal action localization in untrimmed videos via multi-stage CNNs［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 1049-1058. 10.1109/cvpr.2016.119
12	ZHAO Y， XIONG Y， WANG L， et al. Temporal action detection with structured segment networks［C］// Proceedings of the 2017 IEEE Conference on Computer Vision. Piscataway： IEEE， 2017： 2933-2942. 10.1109/iccv.2017.317
13	XU H， DAS A， SAENKO K. R-C3D： region convolutional 3D network for temporal activity detection［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 5794-5803. 10.1109/iccv.2017.617
14	LIN T， ZHAO X， SU H， et al. BSN： boundary sensitive network for temporal action proposal generation［C］// Proceedings of the 2018 European Conference on Computer Vision. Cham： Springer， 2018： 3-21. 10.1007/978-3-030-01225-0_1
15	LIN T， ZHAO X， SHOU Z. Single shot temporal action detection［C］// Proceedings of the 25th ACM International Conference on Multimedia. New York： ACM， 2017： 988-996. 10.1145/3123266.3123343
16	WANG L， XIONG Y， LIN D， et al. UntrimmedNets for weakly supervised action recognition and detection［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 6402-6411. 10.1109/cvpr.2017.678
17	NARAYAN S， CHOLAKKAL H， KHAN F S， et al. 3C-Net： category count and center loss for weakly-supervised action localization［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 8679-8687. 10.1109/iccv.2019.00877
18	MIN K， CORSO J J. Adversarial background-aware loss for weakly-supervised temporal activity localization［C］// Proceedings of the 2020 European Conference on Computer Vision. Cham： Springer， 2020： 283-299. 10.1007/978-3-030-58568-6_17
19	YUAN Y， LYU Y， SHEN X， et al. Marginalized average attentional network for weakly-supervised learning［EB/OL］. ［2023-03-09］. .
20	李希，刘喜平，李旺才，等.对比学习研究综述［J］.小型微型计算机系统， 2023， 44（4）： 787-797.
	LI X， LIU X P， LI W C， et al. Survey on contrastive learning research ［J］. Journal of Chinese Computer Systems， 2023， 44（4）： 787-797.
21	HE K， FAN H， WU Y， et al. Momentum contrast for unsupervised visual representation learning［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 9726-9735. 10.1109/cvpr42600.2020.00975
22	CHEN T， KORNBLITH S， NOROUZI M， et al. A simple framework for contrastive learning of visual representations［EB/OL］. ［2023-03-09］. .
23	GUTMANN M， HYVÄRINEN A. Noise-contrastive estimation： a new estimation principle for unnormalized statistical models［C］// Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. New York： JMLR.org， 2010： 297-304.
24	ZACH C， POCK T， BISCHOF H， et al. A duality based approach for realtime TV-L ¹ optical flow［C］// Proceedings of the 29th DAGM Conference on Pattern Recognition. Berlin： Springer， 2007： 214-223. 10.1007/978-3-540-74936-3_22
25	KAY W， CARREIRA J， SIMONYAN K， et al. The Kinetics human action video dataset［EB/OL］. ［2023-03-09］. .
26	CARREIRA J， ZISSERMAN A. Quo vadis， action recognition？ A new model and the kinetics dataset［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 4724-4733. 10.1109/cvpr.2017.502
27	IDREES H， ZAMIR A R， JIANG Y-G， et al. The THUMOS challenge on action recognition for videos "in the wild"［J］. Computer Vision and Image Understanding， 2017， 155： 1-23. 10.1016/j.cviu.2016.10.018
28	HEILBRON F C， ESCORCIA V， GHANEM B， et al. ActivityNet： a large-scale video benchmark for human activity understanding［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2015： 961-970. 10.1109/cvpr.2015.7298698
29	PAUL S， ROY S， ROY-CHOWDHURY A K. W-TALC： weakly-supervised temporal activity localization and classification ［C］// Proceedings of the 2018 European Conference on Computer Vision. Cham： Springer， 2018： 588-607. 10.1007/978-3-030-01225-0_35
30	LEE P， UH Y， BYUN H. Background suppression network for weakly-supervised temporal action localization［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2020， 34（7）： 11320-11327. 10.1609/aaai.v34i07.6793
31	ZHAI Y， WANG L， TANG W， et al. Two-stream consensus network for weakly-supervised temporal action localization［C］// Proceedings of the 2020 European Conference on Computer Vision. Cham： Springer， 2020： 37-54. 10.1007/978-3-030-58539-6_3
32	YANG W， ZHANG T， MAO Z， et al. Multi-scale structure-aware network for weakly supervised temporal action detection［J］. IEEE Transactions on Image Processing， 2021， 30： 5848-5861. 10.1109/tip.2021.3089361
33	ISLAM A， LONG C， RADKE R. A hybrid attention mechanism for weakly-supervised temporal action localization［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2021， 35（2）： 1637-1645. 10.1609/aaai.v35i2.16256
34	CHENG Y， SUN Y， FAN H， et al. Entropy guided attention network for weakly-supervised action localization［J］. Pattern Recognition， 2022， 129： 108718. 10.1016/j.patcog.2022.108718
35	LIU Z， WANG L， ZHANG Q， et al. ACSNet： action-context separation network for weakly supervised temporal action localization ［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2021， 35（3）： 2233-2241. 10.1609/aaai.v35i3.16322
36	SHI H， ZHANG X-Y， LI C， et al. Dynamic graph modeling for weakly-supervised temporal action localization［C］// Proceedings of the 30th ACM International Conference on Multimedia. New York： ACM， 2022： 3820-3828. 10.1145/3503161.3548077
37	KINGMA D P， BA J. Adam： A method for stochastic optimization［EB/OL］. ［2023-03-09］. .
38	YU T， REN Z， LI Y， et al. Temporal structure mining for weakly supervised action detection［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 5521-5530. 10.1109/iccv.2019.00562
39	LEE P， WANG J， LU Y， et al. Background modeling via uncertainty estimation for weakly-supervised action localization［EB/OL］. ［2023-03-09］. .

[1]	罗歆然, 李天瑞, 贾真. 基于自注意力机制与词汇增强的中文医学命名实体识别[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 385-392.
[2]	邓辅秦, 官桧锋, 谭朝恩, 付兰慧, 王宏民, 林天麟, 张建民. 基于请求与应答通信机制和局部注意力机制的多机器人强化学习路径规划方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 432-438.
[3]	黄子麒, 胡建鹏. 实体类别增强的汽车领域嵌套命名实体识别[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 377-384.
[4]	黄懿蕊, 罗俊玮, 陈景强. 基于对比学习和GIF标记的多模态对话回复检索[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 32-38.
[5]	陈佳, 张鸿. 基于特征增强和语义相关性匹配的图像文本检索方法[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 16-23.
[6]	罗俊豪, 朱焱. 用于未对齐多模态语言序列情感分析的多交互感知网络[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 79-85.
[7]	李牧, 杨宇恒, 柯熙政. 基于混合特征提取与跨模态特征预测融合的情感识别模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 86-93.
[8]	史含笑, 王雷春. 结合LSTM和自注意力机制的图卷积网络短期电力负荷预测[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 311-317.
[9]	王红斌, 房晓, 江虹. 融入三维语义特征的常识推理问答方法[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 138-144.
[10]	王朱佳, 余宙, 俞俊, 范建平. 基于多尺度时空Transformer的视频动态场景图生成模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 47-57.
[11]	佟威, 何理扬, 李锐, 黄威, 黄振亚, 刘淇. 基于无监督语义哈希的高效相似题检索模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 206-216.
[12]	朱志平, 杨燕, 王杰. 基于场景图感知的跨模态图像描述模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 58-64.
[13]	陈丽安, 过弋. 融合个体偏差信息的文本情感分析模型[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 145-151.
[14]	杨昊, 张轶. 基于上下文信息和多尺度融合重要性感知的特征金字塔网络算法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2727-2734.
[15]	袁国龙, 张玉金, 刘洋. 基于残差反馈和自注意力的图像篡改取证网络[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2925-2931.

融合片段对比学习的弱监督动作定位方法

Weakly supervised action localization method with snippet contrastive learning

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 8

参考文献 39

相关文章 15

编辑推荐

Metrics