Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (2): 548-555.DOI: 10.11772/j.issn.1001-9081.2023020246
• Multimedia computing and computer simulation • Previous Articles
Weichao DANG, Lei ZHANG(), Gaimei GAO, Chunxia LIU
Received:
2023-03-09
Revised:
2023-06-11
Accepted:
2023-06-15
Online:
2023-08-14
Published:
2024-02-10
Contact:
Lei ZHANG
About author:
DANG Weichao, born in 1974, Ph. D., associate professor. His research interests include intelligent computing, software reliability.Supported by:
通讯作者:
张磊
作者简介:
党伟超(1974—),男,山西运城人,副教授,博士,CCF会员,主要研究方向:智能计算、软件可靠性基金资助:
CLC Number:
Weichao DANG, Lei ZHANG, Gaimei GAO, Chunxia LIU. Weakly supervised action localization method with snippet contrastive learning[J]. Journal of Computer Applications, 2024, 44(2): 548-555.
党伟超, 张磊, 高改梅, 刘春霞. 融合片段对比学习的弱监督动作定位方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 548-555.
Add to citation manager EndNote|Ris|BibTeX
URL: http://www.joca.cn/EN/10.11772/j.issn.1001-9081.2023020246
方法 | mAP@IoU | ||||||
---|---|---|---|---|---|---|---|
0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | |
STPN[ | 52.0 | 44.7 | 35.5 | 25.8 | 16.9 | 9.9 | 4.3 |
W-TALC[ | 55.2 | 49.6 | 40.1 | 31.1 | 22.8 | — | 7.6 |
MAAN[ | 59.8 | 50.8 | 41.1 | 30.6 | 20.3 | 12.0 | 6.9 |
BasNet[ | 58.2 | 52.3 | 44.6 | 36.0 | 27.0 | 18.6 | 10.4 |
DGAM[ | 60.0 | 54.2 | 46.8 | 38.2 | 28.8 | 19.8 | 11.4 |
A2CL-PT[ | 61.2 | 56.1 | 48.1 | 39.0 | 30.1 | 19.2 | 10.6 |
TSCN[ | 63.4 | 57.6 | 47.8 | 37.7 | 28.7 | 19.4 | 10.2 |
MSA-Net[ | 65.6 | 60.7 | 52.3 | 41.6 | 29.7 | 20.6 | 10.1 |
HAM-Net[ | 65.4 | 59.0 | 50.3 | 41.1 | 31.0 | 20.7 | 11.1 |
EGA-Net[ | 64.5 | 58.4 | 50.0 | 41.4 | 31.5 | 21.0 | 10.7 |
ACS-Net[ | — | — | 51.4 | 42.7 | 32.4 | 22.0 | 11.7 |
DGCNN[ | 66.3 | 59.9 | 52.3 | 43.2 | 32.8 | 22.1 | 13.1 |
本文模型 | 67.7 | 62.3 | 53.5 | 43.3 | 33.9 | 22.1 | 11.1 |
Tab. 1 Detection results of different weakly-supervised action localization methods on THUMOS14 dataset
方法 | mAP@IoU | ||||||
---|---|---|---|---|---|---|---|
0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | |
STPN[ | 52.0 | 44.7 | 35.5 | 25.8 | 16.9 | 9.9 | 4.3 |
W-TALC[ | 55.2 | 49.6 | 40.1 | 31.1 | 22.8 | — | 7.6 |
MAAN[ | 59.8 | 50.8 | 41.1 | 30.6 | 20.3 | 12.0 | 6.9 |
BasNet[ | 58.2 | 52.3 | 44.6 | 36.0 | 27.0 | 18.6 | 10.4 |
DGAM[ | 60.0 | 54.2 | 46.8 | 38.2 | 28.8 | 19.8 | 11.4 |
A2CL-PT[ | 61.2 | 56.1 | 48.1 | 39.0 | 30.1 | 19.2 | 10.6 |
TSCN[ | 63.4 | 57.6 | 47.8 | 37.7 | 28.7 | 19.4 | 10.2 |
MSA-Net[ | 65.6 | 60.7 | 52.3 | 41.6 | 29.7 | 20.6 | 10.1 |
HAM-Net[ | 65.4 | 59.0 | 50.3 | 41.1 | 31.0 | 20.7 | 11.1 |
EGA-Net[ | 64.5 | 58.4 | 50.0 | 41.4 | 31.5 | 21.0 | 10.7 |
ACS-Net[ | — | — | 51.4 | 42.7 | 32.4 | 22.0 | 11.7 |
DGCNN[ | 66.3 | 59.9 | 52.3 | 43.2 | 32.8 | 22.1 | 13.1 |
本文模型 | 67.7 | 62.3 | 53.5 | 43.3 | 33.9 | 22.1 | 11.1 |
模型 | mAP@IoU | ||
---|---|---|---|
0.5 | 0.75 | 0.95 | |
STPN[ | 29.3 | 16.9 | 2.6 |
MAAN[ | 33.7 | 21.9 | 5.5 |
TSM[ | 30.3 | 19.0 | 4.5 |
BasNet[ | 34.5 | 22.5 | 4.9 |
EGA-Net[ | 35.4 | 22.5 | 4.5 |
A2CL-PT[ | 36.8 | 22.5 | 5.2 |
BMUE[ | 37.0 | 23.9 | 5.7 |
DGCNN[ | 37.2 | 23.8 | 5.8 |
本文模型 | 40.1 | 24.0 | 6.0 |
Tab. 2 Detection results of different models on ActivityNet1.3 dataset
模型 | mAP@IoU | ||
---|---|---|---|
0.5 | 0.75 | 0.95 | |
STPN[ | 29.3 | 16.9 | 2.6 |
MAAN[ | 33.7 | 21.9 | 5.5 |
TSM[ | 30.3 | 19.0 | 4.5 |
BasNet[ | 34.5 | 22.5 | 4.9 |
EGA-Net[ | 35.4 | 22.5 | 4.5 |
A2CL-PT[ | 36.8 | 22.5 | 5.2 |
BMUE[ | 37.0 | 23.9 | 5.7 |
DGCNN[ | 37.2 | 23.8 | 5.8 |
本文模型 | 40.1 | 24.0 | 6.0 |
平衡因子 | mAP@0.5 | 平衡因子 | mAP@0.5 |
---|---|---|---|
0.005 | 32.5 | 0.050 | 33.8 |
0.007 | 33.1 | 0.070 | 33.8 |
0.010 | 33.9 | 0.100 | 33.6 |
0.030 | 33.6 |
Tab. 3 Performance comparison of different balance factors on THUMOS 14 dataset
平衡因子 | mAP@0.5 | 平衡因子 | mAP@0.5 |
---|---|---|---|
0.005 | 32.5 | 0.050 | 33.8 |
0.007 | 33.1 | 0.070 | 33.8 |
0.010 | 33.9 | 0.100 | 33.6 |
0.030 | 33.6 |
实验 | mAP@IoU/% | ||||||||
---|---|---|---|---|---|---|---|---|---|
0.1 | 0.3 | 0.5 | 0.7 | ||||||
1 | √ | × | × | × | × | 49.9 | 32.9 | 16.6 | 5.3 |
2 | √ | × | √ | × | × | 55.9 | 41.9 | 23.0 | 7.1 |
3 | √ | √ | × | × | × | 67.4 | 50.8 | 31.5 | 10.8 |
4 | √ | √ | √ | × | × | 65.6 | 49.4 | 29.6 | 10.0 |
5 | √ | √ | √ | √ | √ | 67.7 | 53.5 | 33.9 | 11.1 |
Tab.4 Ablation experiment results of action context branch
实验 | mAP@IoU/% | ||||||||
---|---|---|---|---|---|---|---|---|---|
0.1 | 0.3 | 0.5 | 0.7 | ||||||
1 | √ | × | × | × | × | 49.9 | 32.9 | 16.6 | 5.3 |
2 | √ | × | √ | × | × | 55.9 | 41.9 | 23.0 | 7.1 |
3 | √ | √ | × | × | × | 67.4 | 50.8 | 31.5 | 10.8 |
4 | √ | √ | √ | × | × | 65.6 | 49.4 | 29.6 | 10.0 |
5 | √ | √ | √ | √ | √ | 67.7 | 53.5 | 33.9 | 11.1 |
实验 | mAP@IoU/% | ||||||
---|---|---|---|---|---|---|---|
0.1 | 0.3 | 0.5 | 0.7 | ||||
1 | √ | × | × | 65.6 | 49.4 | 29.6 | 10.0 |
2 | √ | × | √ | 65.5 | 49.7 | 29.8 | 10.1 |
3 | √ | √ | × | 66.4 | 51.5 | 32.2 | 11.0 |
4 | √ | √ | √ | 67.7 | 53.5 | 33.9 | 11.1 |
Tab.5 Ablation experiment results of attention guided loss and snippet contrast loss
实验 | mAP@IoU/% | ||||||
---|---|---|---|---|---|---|---|
0.1 | 0.3 | 0.5 | 0.7 | ||||
1 | √ | × | × | 65.6 | 49.4 | 29.6 | 10.0 |
2 | √ | × | √ | 65.5 | 49.7 | 29.8 | 10.1 |
3 | √ | √ | × | 66.4 | 51.5 | 32.2 | 11.0 |
4 | √ | √ | √ | 67.7 | 53.5 | 33.9 | 11.1 |
1 | SUN C, SHETTY S, SUKTHANKAR R, et al. Temporal localization of fine-grained actions in videos by domain transfer from web images [C]// Proceedings of the 23rd ACM International Conference on Multimedia. New York: ACM, 2015:371-380. 10.1145/2733373.2806226 |
2 | 胡聪, 华钢.基于注意力机制的弱监督动作定位方法[J].计算机应用, 2022, 42(3): 960-967. |
HU C, HUA G. Weakly supervised action localization method based on attention mechanism[J]. Journal of Computer Applications, 2022, 42(3): 960-967. | |
3 | 郭文斌, 杨兴明, 蒋哲远,等.多时间尺度一致性的弱监督时序动作定位[J].计算机工程与应用, 2023, 59(10): 151-161. 10.3778/j.issn.1002-8331.2201-0233 |
GUO W B, YANG X M, JIANG Z Y, et al. Multi-temporal scales consensus for weakly supervised temporal action localization[J]. Computer Engineering and Applications, 2023, 59(10): 151-161. 10.3778/j.issn.1002-8331.2201-0233 | |
4 | NGUYEN P, HAN B, LIU T, et al. Weakly supervised action localization by sparse temporal pooling network[C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 6752-6761. 10.1109/cvpr.2018.00706 |
5 | ZENG R, GAN C, CHEN P, et al. Breaking Winner-Takes-All: Iterative-Winners-Out networks for weakly supervised temporal action localization[J]. IEEE Transactions on Image Processing, 2019, 28(12):5797-5808. 10.1109/tip.2019.2922108 |
6 | SHOU Z, GAO H, ZHANG L, et al. AutoLoc: weakly-supervised temporal action localization in untrimmed videos[C]// Proceedings of the 2018 European Conference on Computer Vision. Cham: Springer, 2018: 162-179. 10.1007/978-3-030-01270-0_10 |
7 | CHEN M, FANG Y, WANG X, et al. Diversity transfer network for Few-Shot learning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34:10559-10566. 10.1609/aaai.v34i07.6628 |
8 | ZHUANG C, ZHAI A, YAMINS D. Local aggregation for unsupervised learning of visual embeddings[C]// Proceedings of the 2019 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2019: 6001-6011. 10.1109/iccv.2019.00610 |
9 | SHI B, DAI Q, MU Y, et al. Weakly-supervised action localization by generative attention modeling[C]// Proceedings of the 2020 International Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 1006-1016. 10.1109/cvpr42600.2020.00109 |
10 | ZHANG C, CAO M, YANG D, et al. CoLA: weakly-supervised temporal action localization with snippet contrastive learning[C]// Proceedings of the 2021 International Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 16010-16019. 10.1109/cvpr46437.2021.01575 |
11 | SHOU Z, WANG D, CHANG S-F. Temporal action localization in untrimmed videos via multi-stage CNNs[C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 1049-1058. 10.1109/cvpr.2016.119 |
12 | ZHAO Y, XIONG Y, WANG L, et al. Temporal action detection with structured segment networks[C]// Proceedings of the 2017 IEEE Conference on Computer Vision. Piscataway: IEEE, 2017: 2933-2942. 10.1109/iccv.2017.317 |
13 | XU H, DAS A, SAENKO K. R-C3D: region convolutional 3D network for temporal activity detection[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 5794-5803. 10.1109/iccv.2017.617 |
14 | LIN T, ZHAO X, SU H, et al. BSN: boundary sensitive network for temporal action proposal generation[C]// Proceedings of the 2018 European Conference on Computer Vision. Cham: Springer, 2018: 3-21. 10.1007/978-3-030-01225-0_1 |
15 | LIN T, ZHAO X, SHOU Z. Single shot temporal action detection[C]// Proceedings of the 25th ACM International Conference on Multimedia. New York: ACM, 2017: 988-996. 10.1145/3123266.3123343 |
16 | WANG L, XIONG Y, LIN D, et al. UntrimmedNets for weakly supervised action recognition and detection[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 6402-6411. 10.1109/cvpr.2017.678 |
17 | NARAYAN S, CHOLAKKAL H, KHAN F S, et al. 3C-Net: category count and center loss for weakly-supervised action localization[C]// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 8679-8687. 10.1109/iccv.2019.00877 |
18 | MIN K, CORSO J J. Adversarial background-aware loss for weakly-supervised temporal activity localization[C]// Proceedings of the 2020 European Conference on Computer Vision. Cham: Springer, 2020: 283-299. 10.1007/978-3-030-58568-6_17 |
19 | YUAN Y, LYU Y, SHEN X, et al. Marginalized average attentional network for weakly-supervised learning[EB/OL]. [2023-03-09]. . |
20 | 李希, 刘喜平, 李旺才,等.对比学习研究综述[J].小型微型计算机系统, 2023, 44(4): 787-797. |
LI X, LIU X P, LI W C, et al. Survey on contrastive learning research [J]. Journal of Chinese Computer Systems, 2023, 44(4): 787-797. | |
21 | HE K, FAN H, WU Y, et al. Momentum contrast for unsupervised visual representation learning[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 9726-9735. 10.1109/cvpr42600.2020.00975 |
22 | CHEN T, KORNBLITH S, NOROUZI M, et al. A simple framework for contrastive learning of visual representations[EB/OL]. [2023-03-09]. . |
23 | GUTMANN M, HYVÄRINEN A. Noise-contrastive estimation: a new estimation principle for unnormalized statistical models[C]// Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. New York: JMLR.org, 2010: 297-304. |
24 | ZACH C, POCK T, BISCHOF H, et al. A duality based approach for realtime TV-L 1 optical flow[C]// Proceedings of the 29th DAGM Conference on Pattern Recognition. Berlin: Springer, 2007: 214-223. 10.1007/978-3-540-74936-3_22 |
25 | KAY W, CARREIRA J, SIMONYAN K, et al. The Kinetics human action video dataset[EB/OL]. [2023-03-09]. . |
26 | CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? A new model and the kinetics dataset[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 4724-4733. 10.1109/cvpr.2017.502 |
27 | IDREES H, ZAMIR A R, JIANG Y-G, et al. The THUMOS challenge on action recognition for videos "in the wild"[J]. Computer Vision and Image Understanding, 2017, 155: 1-23. 10.1016/j.cviu.2016.10.018 |
28 | HEILBRON F C, ESCORCIA V, GHANEM B, et al. ActivityNet: a large-scale video benchmark for human activity understanding[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2015: 961-970. 10.1109/cvpr.2015.7298698 |
29 | PAUL S, ROY S, ROY-CHOWDHURY A K. W-TALC: weakly-supervised temporal activity localization and classification [C]// Proceedings of the 2018 European Conference on Computer Vision. Cham: Springer, 2018: 588-607. 10.1007/978-3-030-01225-0_35 |
30 | LEE P, UH Y, BYUN H. Background suppression network for weakly-supervised temporal action localization[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 11320-11327. 10.1609/aaai.v34i07.6793 |
31 | ZHAI Y, WANG L, TANG W, et al. Two-stream consensus network for weakly-supervised temporal action localization[C]// Proceedings of the 2020 European Conference on Computer Vision. Cham: Springer, 2020: 37-54. 10.1007/978-3-030-58539-6_3 |
32 | YANG W, ZHANG T, MAO Z, et al. Multi-scale structure-aware network for weakly supervised temporal action detection[J]. IEEE Transactions on Image Processing, 2021, 30: 5848-5861. 10.1109/tip.2021.3089361 |
33 | ISLAM A, LONG C, RADKE R. A hybrid attention mechanism for weakly-supervised temporal action localization[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(2): 1637-1645. 10.1609/aaai.v35i2.16256 |
34 | CHENG Y, SUN Y, FAN H, et al. Entropy guided attention network for weakly-supervised action localization[J]. Pattern Recognition, 2022, 129: 108718. 10.1016/j.patcog.2022.108718 |
35 | LIU Z, WANG L, ZHANG Q, et al. ACSNet: action-context separation network for weakly supervised temporal action localization [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(3): 2233-2241. 10.1609/aaai.v35i3.16322 |
36 | SHI H, ZHANG X-Y, LI C, et al. Dynamic graph modeling for weakly-supervised temporal action localization[C]// Proceedings of the 30th ACM International Conference on Multimedia. New York: ACM, 2022: 3820-3828. 10.1145/3503161.3548077 |
37 | KINGMA D P, BA J. Adam: A method for stochastic optimization[EB/OL]. [2023-03-09]. . |
38 | YU T, REN Z, LI Y, et al. Temporal structure mining for weakly supervised action detection[C]// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 5521-5530. 10.1109/iccv.2019.00562 |
39 | LEE P, WANG J, LU Y, et al. Background modeling via uncertainty estimation for weakly-supervised action localization[EB/OL]. [2023-03-09]. . |
[1] | Ziqi HUANG, Jianpeng HU. Entity category enhanced nested named entity recognition in automotive domain [J]. Journal of Computer Applications, 2024, 44(2): 377-384. |
[2] | Xinran LUO, Tianrui LI, Zhen JIA. Chinese medical named entity recognition based on self-attention mechanism and lexicon enhancement [J]. Journal of Computer Applications, 2024, 44(2): 385-392. |
[3] | Fuqin DENG, Huifeng GUAN, Chaoen TAN, Lanhui FU, Hongmin WANG, Tinlun LAM, Jianmin ZHANG. Multi-robot reinforcement learning path planning method based on request-response communication mechanism and local attention mechanism [J]. Journal of Computer Applications, 2024, 44(2): 432-438. |
[4] | Wei TONG, Liyang HE, Rui LI, Wei HUANG, Zhenya HUANG, Qi LIU. Efficient similar exercise retrieval model based on unsupervised semantic hashing [J]. Journal of Computer Applications, 2024, 44(1): 206-216. |
[5] | Jia CHEN, Hong ZHANG. Image text retrieval method based on feature enhancement and semantic correlation matching [J]. Journal of Computer Applications, 2024, 44(1): 16-23. |
[6] | Zhiping ZHU, Yan YANG, Jie WANG. Scene graph-aware cross-modal image captioning model [J]. Journal of Computer Applications, 2024, 44(1): 58-64. |
[7] | Li’an CHEN, Yi GUO. Text sentiment analysis model based on individual bias information [J]. Journal of Computer Applications, 2024, 44(1): 145-151. |
[8] | Yirui HUANG, Junwei LUO, Jingqiang CHEN. Multi-modal dialog reply retrieval based on contrast learning and GIF tag [J]. Journal of Computer Applications, 2024, 44(1): 32-38. |
[9] | Hanxiao SHI, Leichun WANG. Short-term power load forecasting by graph convolutional network combining LSTM and self-attention mechanism [J]. Journal of Computer Applications, 2024, 44(1): 311-317. |
[10] | Xiaobing WANG, Xiongwei ZHANG, Tieyong CAO, Yunfei ZHENG, Yong WANG. Self-distillation object segmentation method via scale-attention knowledge transfer [J]. Journal of Computer Applications, 2024, 44(1): 129-137. |
[11] | Hongbin WANG, Xiao FANG, Hong JIANG. Commonsense reasoning and question answering method with three-dimensional semantic features [J]. Journal of Computer Applications, 2024, 44(1): 138-144. |
[12] | Junhao LUO, Yan ZHU. Multi-dynamic aware network for unaligned multimodal language sequence sentiment analysis [J]. Journal of Computer Applications, 2024, 44(1): 79-85. |
[13] | Mu LI, Yuheng YANG, Xizheng KE. Emotion recognition model based on hybrid-mel gama frequency cross-attention transformer modal [J]. Journal of Computer Applications, 2024, 44(1): 86-93. |
[14] | Jia WANG-ZHU, Zhou YU, Jun YU, Jianping FAN. Video dynamic scene graph generation model based on multi-scale spatial-temporal Transformer [J]. Journal of Computer Applications, 2024, 44(1): 47-57. |
[15] | Hao YANG, Yi ZHANG. Feature pyramid network algorithm based on context information and multi-scale fusion importance awareness [J]. Journal of Computer Applications, 2023, 43(9): 2727-2734. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||