融合片段对比学习的弱监督动作定位方法

doi:10.11772/j.issn.1001-9081.2023020246

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (2): 548-555.DOI: 10.11772/j.issn.1001-9081.2023020246

所属专题：多媒体计算与计算机仿真

• 多媒体计算与计算机仿真 • 上一篇下一篇

融合片段对比学习的弱监督动作定位方法

党伟超, 张磊(), 高改梅, 刘春霞

太原科技大学计算机科学与技术学院，太原 030024

收稿日期:2023-03-09 修回日期:2023-06-11 接受日期:2023-06-15 发布日期:2023-08-14 出版日期:2024-02-10
通讯作者: 张磊
作者简介:党伟超（1974—），男，山西运城人，副教授，博士，CCF会员，主要研究方向：智能计算、软件可靠性
高改梅（1978—），女，山西吕梁人，副教授，博士，CCF会员，主要研究方向：网络安全、密码学
刘春霞（1977—），女，山西大同人，副教授，硕士，CCF会员，主要研究方向：软件工程、数据库。
基金资助:
太原科技大学博士科研启动基金资助项目(20202063);太原科技大学研究生教育创新项目(SY2022063)

Weakly supervised action localization method with snippet contrastive learning

Weichao DANG, Lei ZHANG(), Gaimei GAO, Chunxia LIU

College of Computer Science and Technology，Taiyuan University of Science and Technology，Taiyuan Shanxi 030024，China

Received:2023-03-09 Revised:2023-06-11 Accepted:2023-06-15 Online:2023-08-14 Published:2024-02-10
Contact: Lei ZHANG
About author:DANG Weichao， born in 1974， Ph. D.， associate professor. His research interests include intelligent computing， software reliability.
GAO Gaimei， born in 1978， Ph. D.， associate professor. Her research interests include network security， cryptography.
LIU Chunxia， born in 1977， M. S.， associate professor. Her research interests include software engineering， database.
Supported by:
Doctoral Research Start-up Fund of Taiyuan University of Science and Technology(20202063);Graduate Education Innovation Project of Taiyuan University of Science and Technology(SY2022063)

摘要/Abstract

摘要：

针对现有基于注意力机制的弱监督动作定位方法对动作边界处的片段容易错误分类的问题，提出一种融合片段对比学习的弱监督动作定位方法。首先，引入三个分支的注意力机制，分别测量每个视频帧是动作实例、上下文以及背景的可能性；其次，基于得到的注意力值构建对应分支的类激活序列；然后，通过片段挖掘算法构造正负样本对；最后，利用片段对比学习引导网络将模糊片段正确归类。实验结果表明，当交并比（IoU）取值0.5时，在THUMOS14与ActivityNet1.3两个公共数据集上，所提方法的平均检测精度（mAP）分别达到了33.9%和40.1%，相较于DGCNN（Dynamic Graph modeling for weakly-supervised temporal action localization Convolutional Neural Network）弱监督动作定位模型在上述两个数据集上分别提升1.1和2.9个百分点，验证了所提方法的有效性。

关键词: 弱监督, 对比学习, 时序动作定位, 注意力机制, 类激活序列

Abstract:

A weakly supervised action localization method， which integrated snippet contrastive learning， was proposed to address the issue of misclassification of snippets at action boundaries in existing attention-based methods. First， an attention mechanism with three branches was introduced to measure the possibility of each video frame being an action instance， context， or background. Second， the Class Activation Sequences （CAS） corresponding to each branch were constructed based on the obtained attention values. Then， positive and negative sample pairs were generated using a snippet mining algorithm. Finally， the network was guided through snippet contrastive learning to correctly classify hard snippets. Experimental results indicated that at an Intersection over Union （IoU） of 0.5， the mean Average Precisions （mAP） of the proposed method on THUMOS14 and ActivityNet1.3 datasets are 33.9% and 40.1% respectively， with improvements of 1.1 and 2.9 percentage points compared to the DGCNN （Dynamic Graph modeling for weakly-supervised temporal action localization Convolutional Neural Network） weakly supervised action localization model， validating the effectiveness of the proposed method.

Key words: weakly-supervised, contrastive learning, temporal action localization, attention mechanism, class activation sequence

中图分类号:

TP391.4

党伟超, 张磊, 高改梅, 刘春霞. 融合片段对比学习的弱监督动作定位方法[J]. 计算机应用, 2024, 44(2): 548-555.

Weichao DANG, Lei ZHANG, Gaimei GAO, Chunxia LIU. Weakly supervised action localization method with snippet contrastive learning[J]. Journal of Computer Applications, 2024, 44(2): 548-555.

图/表 8

图1 片段对比学习的原理

Fig. 1 Principle of snippet contrastive learning

图2 本文模型的总体框架

Fig. 2 Overall framework of proposed model

图3 模糊片段挖掘算法

Fig. 3 Hard snippet mining algorithm

表1 不同弱监督动作定位方法在THUMOS14数据集上的检测结果 (%)

Tab. 1 Detection results of different weakly-supervised action localization methods on THUMOS14 dataset

方法	mAP@IoU
方法	0.1	0.2	0.3	0.4	0.5	0.6	0.7
STPN^［4］	52.0	44.7	35.5	25.8	16.9	9.9	4.3
W-TALC^［29］	55.2	49.6	40.1	31.1	22.8	—	7.6
MAAN^［19］	59.8	50.8	41.1	30.6	20.3	12.0	6.9
BasNet^［30］	58.2	52.3	44.6	36.0	27.0	18.6	10.4
DGAM^［9］	60.0	54.2	46.8	38.2	28.8	19.8	11.4
A2CL-PT^［18］	61.2	56.1	48.1	39.0	30.1	19.2	10.6
TSCN^［31］	63.4	57.6	47.8	37.7	28.7	19.4	10.2
MSA-Net^［32］	65.6	60.7	52.3	41.6	29.7	20.6	10.1
HAM-Net^［33］	65.4	59.0	50.3	41.1	31.0	20.7	11.1
EGA-Net^［34］	64.5	58.4	50.0	41.4	31.5	21.0	10.7
ACS-Net^［35］	—	—	51.4	42.7	32.4	22.0	11.7
DGCNN^［36］	66.3	59.9	52.3	43.2	32.8	22.1	13.1
本文模型	67.7	62.3	53.5	43.3	33.9	22.1	11.1

表2 不同模型在ActivityNet1.3数据集上的检测结果 (%)

Tab. 2 Detection results of different models on ActivityNet1.3 dataset

模型	mAP@IoU
模型	0.5	0.75	0.95
STPN^［4］	29.3	16.9	2.6
MAAN^［19］	33.7	21.9	5.5
TSM^［38］	30.3	19.0	4.5
BasNet^［30］	34.5	22.5	4.9
EGA-Net^［34］	35.4	22.5	4.5
A2CL-PT^［18］	36.8	22.5	5.2
BMUE^［39］	37.0	23.9	5.7
DGCNN^［36］	37.2	23.8	5.8
本文模型	40.1	24.0	6.0

表3 不同平衡因子在THUMOS14数据集上的性能比较

Tab. 3 Performance comparison of different balance factors on THUMOS 14 dataset

平衡因子 $λ 2$	mAP@0.5	平衡因子 $λ 2$	mAP@0.5
0.005	32.5	0.050	33.8
0.007	33.1	0.070	33.8
0.010	33.9	0.100	33.6
0.030	33.6

表3 不同平衡因子在THUMOS14数据集上的性能比较

Tab. 3 Performance comparison of different balance factors on THUMOS 14 dataset

平衡因子 $λ 2$	mAP@0.5	平衡因子 $λ 2$	mAP@0.5
0.005	32.5	0.050	33.8
0.007	33.1	0.070	33.8
0.010	33.9	0.100	33.6
0.030	33.6

表4 动作上下文分支消融实验结果

Tab.4 Ablation experiment results of action context branch

实验	$L c l s$			$L g u i$	$L s$	mAP@IoU/%
实验	$L c l s i n s$	$L c l s c o n$	$L c l s b a k$	$L g u i$	$L s$	0.1	0.3	0.5	0.7
1	√	×	×	×	×	49.9	32.9	16.6	5.3
2	√	×	√	×	×	55.9	41.9	23.0	7.1
3	√	√	×	×	×	67.4	50.8	31.5	10.8
4	√	√	√	×	×	65.6	49.4	29.6	10.0
5	√	√	√	√	√	67.7	53.5	33.9	11.1

表4 动作上下文分支消融实验结果

Tab.4 Ablation experiment results of action context branch

实验	$L c l s$			$L g u i$	$L s$	mAP@IoU/%
实验	$L c l s i n s$	$L c l s c o n$	$L c l s b a k$	$L g u i$	$L s$	0.1	0.3	0.5	0.7
1	√	×	×	×	×	49.9	32.9	16.6	5.3
2	√	×	√	×	×	55.9	41.9	23.0	7.1
3	√	√	×	×	×	67.4	50.8	31.5	10.8
4	√	√	√	×	×	65.6	49.4	29.6	10.0
5	√	√	√	√	√	67.7	53.5	33.9	11.1

表5 注意力引导损失与片段对比损失消融实验结果

Tab.5 Ablation experiment results of attention guided loss and snippet contrast loss

实验	$L c l s$	$L g u i$	$L s$	mAP@IoU/%
实验	$L c l s$	$L g u i$	$L s$	0.1	0.3	0.5	0.7
1	√	×	×	65.6	49.4	29.6	10.0
2	√	×	√	65.5	49.7	29.8	10.1
3	√	√	×	66.4	51.5	32.2	11.0
4	√	√	√	67.7	53.5	33.9	11.1

表5 注意力引导损失与片段对比损失消融实验结果

Tab.5 Ablation experiment results of attention guided loss and snippet contrast loss

实验	$L c l s$	$L g u i$	$L s$	mAP@IoU/%
实验	$L c l s$	$L g u i$	$L s$	0.1	0.3	0.5	0.7
1	√	×	×	65.6	49.4	29.6	10.0
2	√	×	√	65.5	49.7	29.8	10.1
3	√	√	×	66.4	51.5	32.2	11.0
4	√	√	√	67.7	53.5	33.9	11.1

参考文献 39

1	SUN C， SHETTY S， SUKTHANKAR R， et al. Temporal localization of fine-grained actions in videos by domain transfer from web images ［C］// Proceedings of the 23rd ACM International Conference on Multimedia. New York： ACM， 2015：371-380. 10.1145/2733373.2806226
2	胡聪，华钢.基于注意力机制的弱监督动作定位方法［J］.计算机应用， 2022， 42（3）： 960-967.
	HU C， HUA G. Weakly supervised action localization method based on attention mechanism［J］. Journal of Computer Applications， 2022， 42（3）： 960-967.
3	郭文斌，杨兴明，蒋哲远，等.多时间尺度一致性的弱监督时序动作定位［J］.计算机工程与应用， 2023， 59（10）： 151-161. 10.3778/j.issn.1002-8331.2201-0233
	GUO W B， YANG X M， JIANG Z Y， et al. Multi-temporal scales consensus for weakly supervised temporal action localization［J］. Computer Engineering and Applications， 2023， 59（10）： 151-161. 10.3778/j.issn.1002-8331.2201-0233
4	NGUYEN P， HAN B， LIU T， et al. Weakly supervised action localization by sparse temporal pooling network［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6752-6761. 10.1109/cvpr.2018.00706
5	ZENG R， GAN C， CHEN P， et al. Breaking Winner-Takes-All： Iterative-Winners-Out networks for weakly supervised temporal action localization［J］. IEEE Transactions on Image Processing， 2019， 28（12）：5797-5808. 10.1109/tip.2019.2922108
6	SHOU Z， GAO H， ZHANG L， et al. AutoLoc： weakly-supervised temporal action localization in untrimmed videos［C］// Proceedings of the 2018 European Conference on Computer Vision. Cham： Springer， 2018： 162-179. 10.1007/978-3-030-01270-0_10
7	CHEN M， FANG Y， WANG X， et al. Diversity transfer network for Few-Shot learning［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2020， 34：10559-10566. 10.1609/aaai.v34i07.6628
8	ZHUANG C， ZHAI A， YAMINS D. Local aggregation for unsupervised learning of visual embeddings［C］// Proceedings of the 2019 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2019： 6001-6011. 10.1109/iccv.2019.00610
9	SHI B， DAI Q， MU Y， et al. Weakly-supervised action localization by generative attention modeling［C］// Proceedings of the 2020 International Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 1006-1016. 10.1109/cvpr42600.2020.00109
10	ZHANG C， CAO M， YANG D， et al. CoLA： weakly-supervised temporal action localization with snippet contrastive learning［C］// Proceedings of the 2021 International Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 16010-16019. 10.1109/cvpr46437.2021.01575
11	SHOU Z， WANG D， CHANG S-F. Temporal action localization in untrimmed videos via multi-stage CNNs［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 1049-1058. 10.1109/cvpr.2016.119
12	ZHAO Y， XIONG Y， WANG L， et al. Temporal action detection with structured segment networks［C］// Proceedings of the 2017 IEEE Conference on Computer Vision. Piscataway： IEEE， 2017： 2933-2942. 10.1109/iccv.2017.317
13	XU H， DAS A， SAENKO K. R-C3D： region convolutional 3D network for temporal activity detection［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 5794-5803. 10.1109/iccv.2017.617
14	LIN T， ZHAO X， SU H， et al. BSN： boundary sensitive network for temporal action proposal generation［C］// Proceedings of the 2018 European Conference on Computer Vision. Cham： Springer， 2018： 3-21. 10.1007/978-3-030-01225-0_1
15	LIN T， ZHAO X， SHOU Z. Single shot temporal action detection［C］// Proceedings of the 25th ACM International Conference on Multimedia. New York： ACM， 2017： 988-996. 10.1145/3123266.3123343
16	WANG L， XIONG Y， LIN D， et al. UntrimmedNets for weakly supervised action recognition and detection［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 6402-6411. 10.1109/cvpr.2017.678
17	NARAYAN S， CHOLAKKAL H， KHAN F S， et al. 3C-Net： category count and center loss for weakly-supervised action localization［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 8679-8687. 10.1109/iccv.2019.00877
18	MIN K， CORSO J J. Adversarial background-aware loss for weakly-supervised temporal activity localization［C］// Proceedings of the 2020 European Conference on Computer Vision. Cham： Springer， 2020： 283-299. 10.1007/978-3-030-58568-6_17
19	YUAN Y， LYU Y， SHEN X， et al. Marginalized average attentional network for weakly-supervised learning［EB/OL］. ［2023-03-09］. .
20	李希，刘喜平，李旺才，等.对比学习研究综述［J］.小型微型计算机系统， 2023， 44（4）： 787-797.
	LI X， LIU X P， LI W C， et al. Survey on contrastive learning research ［J］. Journal of Chinese Computer Systems， 2023， 44（4）： 787-797.
21	HE K， FAN H， WU Y， et al. Momentum contrast for unsupervised visual representation learning［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 9726-9735. 10.1109/cvpr42600.2020.00975
22	CHEN T， KORNBLITH S， NOROUZI M， et al. A simple framework for contrastive learning of visual representations［EB/OL］. ［2023-03-09］. .
23	GUTMANN M， HYVÄRINEN A. Noise-contrastive estimation： a new estimation principle for unnormalized statistical models［C］// Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. New York： JMLR.org， 2010： 297-304.
24	ZACH C， POCK T， BISCHOF H， et al. A duality based approach for realtime TV-L ¹ optical flow［C］// Proceedings of the 29th DAGM Conference on Pattern Recognition. Berlin： Springer， 2007： 214-223. 10.1007/978-3-540-74936-3_22
25	KAY W， CARREIRA J， SIMONYAN K， et al. The Kinetics human action video dataset［EB/OL］. ［2023-03-09］. .
26	CARREIRA J， ZISSERMAN A. Quo vadis， action recognition？ A new model and the kinetics dataset［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 4724-4733. 10.1109/cvpr.2017.502
27	IDREES H， ZAMIR A R， JIANG Y-G， et al. The THUMOS challenge on action recognition for videos "in the wild"［J］. Computer Vision and Image Understanding， 2017， 155： 1-23. 10.1016/j.cviu.2016.10.018
28	HEILBRON F C， ESCORCIA V， GHANEM B， et al. ActivityNet： a large-scale video benchmark for human activity understanding［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2015： 961-970. 10.1109/cvpr.2015.7298698
29	PAUL S， ROY S， ROY-CHOWDHURY A K. W-TALC： weakly-supervised temporal activity localization and classification ［C］// Proceedings of the 2018 European Conference on Computer Vision. Cham： Springer， 2018： 588-607. 10.1007/978-3-030-01225-0_35
30	LEE P， UH Y， BYUN H. Background suppression network for weakly-supervised temporal action localization［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2020， 34（7）： 11320-11327. 10.1609/aaai.v34i07.6793
31	ZHAI Y， WANG L， TANG W， et al. Two-stream consensus network for weakly-supervised temporal action localization［C］// Proceedings of the 2020 European Conference on Computer Vision. Cham： Springer， 2020： 37-54. 10.1007/978-3-030-58539-6_3
32	YANG W， ZHANG T， MAO Z， et al. Multi-scale structure-aware network for weakly supervised temporal action detection［J］. IEEE Transactions on Image Processing， 2021， 30： 5848-5861. 10.1109/tip.2021.3089361
33	ISLAM A， LONG C， RADKE R. A hybrid attention mechanism for weakly-supervised temporal action localization［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2021， 35（2）： 1637-1645. 10.1609/aaai.v35i2.16256
34	CHENG Y， SUN Y， FAN H， et al. Entropy guided attention network for weakly-supervised action localization［J］. Pattern Recognition， 2022， 129： 108718. 10.1016/j.patcog.2022.108718
35	LIU Z， WANG L， ZHANG Q， et al. ACSNet： action-context separation network for weakly supervised temporal action localization ［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2021， 35（3）： 2233-2241. 10.1609/aaai.v35i3.16322
36	SHI H， ZHANG X-Y， LI C， et al. Dynamic graph modeling for weakly-supervised temporal action localization［C］// Proceedings of the 30th ACM International Conference on Multimedia. New York： ACM， 2022： 3820-3828. 10.1145/3503161.3548077
37	KINGMA D P， BA J. Adam： A method for stochastic optimization［EB/OL］. ［2023-03-09］. .
38	YU T， REN Z， LI Y， et al. Temporal structure mining for weakly supervised action detection［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 5521-5530. 10.1109/iccv.2019.00562
39	LEE P， WANG J， LU Y， et al. Background modeling via uncertainty estimation for weakly-supervised action localization［EB/OL］. ［2023-03-09］. .

[1]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[2]	李力铤, 华蓓, 贺若舟, 徐况. 基于解耦注意力机制的多变量时序预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2732-2738.
[3]	杨兴耀, 陈羽, 于炯, 张祖莲, 陈嘉颖, 王东晓. 结合自我特征和对比学习的推荐模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2704-2710.
[4]	赵志强, 马培红, 黑新宏. 基于双重注意力机制的人群计数方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2886-2892.
[5]	薛凯鹏, 徐涛, 廖春节. 融合自监督和多层交叉注意力的多模态情感分析网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2387-2392.
[6]	汪雨晴, 朱广丽, 段文杰, 李书羽, 周若彤. 基于交互注意力机制的心理咨询文本情感分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2393-2399.
[7]	高鹏淇, 黄鹤鸣, 樊永红. 融合坐标与多头注意力机制的交互语音情感识别[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2400-2406.
[8]	李钟华, 白云起, 王雪津, 黄雷雷, 林初俊, 廖诗宇. 基于图像增强的低照度人脸检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2588-2594.
[9]	莫尚斌, 王文君, 董凌, 高盛祥, 余正涛. 基于多路信息聚合协同解码的单通道语音增强[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2611-2617.
[10]	刘丽, 侯海金, 王安红, 张涛. 基于多尺度注意力的生成式信息隐藏算法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2102-2109.
[11]	徐松, 张文博, 王一帆. 基于时空信息的轻量视频显著性目标检测网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2192-2199.
[12]	李大海, 王忠华, 王振东. 结合空间域和频域信息的双分支低光照图像增强网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2175-2182.
[13]	魏文亮, 王阳萍, 岳彪, 王安政, 张哲. 基于光照权重分配和注意力的红外与可见光图像融合深度学习模型[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2183-2191.
[14]	熊武, 曹从军, 宋雪芳, 邵云龙, 王旭升. 基于多尺度混合域注意力机制的笔迹鉴别方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2225-2232.
[15]	李欢欢, 黄添强, 丁雪梅, 罗海峰, 黄丽清. 基于多尺度时空图卷积网络的交通出行需求预测[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2065-2072.

融合片段对比学习的弱监督动作定位方法

Weakly supervised action localization method with snippet contrastive learning

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 8

参考文献 39

相关文章 15

编辑推荐

Metrics