《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (2): 548-555.DOI: 10.11772/j.issn.1001-9081.2023020246
所属专题: 多媒体计算与计算机仿真
收稿日期:
2023-03-09
修回日期:
2023-06-11
接受日期:
2023-06-15
发布日期:
2023-08-14
出版日期:
2024-02-10
通讯作者:
张磊
作者简介:
党伟超(1974—),男,山西运城人,副教授,博士,CCF会员,主要研究方向:智能计算、软件可靠性基金资助:
Weichao DANG, Lei ZHANG(), Gaimei GAO, Chunxia LIU
Received:
2023-03-09
Revised:
2023-06-11
Accepted:
2023-06-15
Online:
2023-08-14
Published:
2024-02-10
Contact:
Lei ZHANG
About author:
DANG Weichao, born in 1974, Ph. D., associate professor. His research interests include intelligent computing, software reliability.Supported by:
摘要:
针对现有基于注意力机制的弱监督动作定位方法对动作边界处的片段容易错误分类的问题,提出一种融合片段对比学习的弱监督动作定位方法。首先,引入三个分支的注意力机制,分别测量每个视频帧是动作实例、上下文以及背景的可能性;其次,基于得到的注意力值构建对应分支的类激活序列;然后,通过片段挖掘算法构造正负样本对;最后,利用片段对比学习引导网络将模糊片段正确归类。实验结果表明,当交并比(IoU)取值0.5时,在THUMOS14与ActivityNet1.3两个公共数据集上,所提方法的平均检测精度(mAP)分别达到了33.9%和40.1%,相较于DGCNN(Dynamic Graph modeling for weakly-supervised temporal action localization Convolutional Neural Network)弱监督动作定位模型在上述两个数据集上分别提升1.1和2.9个百分点,验证了所提方法的有效性。
中图分类号:
党伟超, 张磊, 高改梅, 刘春霞. 融合片段对比学习的弱监督动作定位方法[J]. 计算机应用, 2024, 44(2): 548-555.
Weichao DANG, Lei ZHANG, Gaimei GAO, Chunxia LIU. Weakly supervised action localization method with snippet contrastive learning[J]. Journal of Computer Applications, 2024, 44(2): 548-555.
方法 | mAP@IoU | ||||||
---|---|---|---|---|---|---|---|
0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | |
STPN[ | 52.0 | 44.7 | 35.5 | 25.8 | 16.9 | 9.9 | 4.3 |
W-TALC[ | 55.2 | 49.6 | 40.1 | 31.1 | 22.8 | — | 7.6 |
MAAN[ | 59.8 | 50.8 | 41.1 | 30.6 | 20.3 | 12.0 | 6.9 |
BasNet[ | 58.2 | 52.3 | 44.6 | 36.0 | 27.0 | 18.6 | 10.4 |
DGAM[ | 60.0 | 54.2 | 46.8 | 38.2 | 28.8 | 19.8 | 11.4 |
A2CL-PT[ | 61.2 | 56.1 | 48.1 | 39.0 | 30.1 | 19.2 | 10.6 |
TSCN[ | 63.4 | 57.6 | 47.8 | 37.7 | 28.7 | 19.4 | 10.2 |
MSA-Net[ | 65.6 | 60.7 | 52.3 | 41.6 | 29.7 | 20.6 | 10.1 |
HAM-Net[ | 65.4 | 59.0 | 50.3 | 41.1 | 31.0 | 20.7 | 11.1 |
EGA-Net[ | 64.5 | 58.4 | 50.0 | 41.4 | 31.5 | 21.0 | 10.7 |
ACS-Net[ | — | — | 51.4 | 42.7 | 32.4 | 22.0 | 11.7 |
DGCNN[ | 66.3 | 59.9 | 52.3 | 43.2 | 32.8 | 22.1 | 13.1 |
本文模型 | 67.7 | 62.3 | 53.5 | 43.3 | 33.9 | 22.1 | 11.1 |
表1 不同弱监督动作定位方法在THUMOS14数据集上的检测结果 (%)
Tab. 1 Detection results of different weakly-supervised action localization methods on THUMOS14 dataset
方法 | mAP@IoU | ||||||
---|---|---|---|---|---|---|---|
0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | |
STPN[ | 52.0 | 44.7 | 35.5 | 25.8 | 16.9 | 9.9 | 4.3 |
W-TALC[ | 55.2 | 49.6 | 40.1 | 31.1 | 22.8 | — | 7.6 |
MAAN[ | 59.8 | 50.8 | 41.1 | 30.6 | 20.3 | 12.0 | 6.9 |
BasNet[ | 58.2 | 52.3 | 44.6 | 36.0 | 27.0 | 18.6 | 10.4 |
DGAM[ | 60.0 | 54.2 | 46.8 | 38.2 | 28.8 | 19.8 | 11.4 |
A2CL-PT[ | 61.2 | 56.1 | 48.1 | 39.0 | 30.1 | 19.2 | 10.6 |
TSCN[ | 63.4 | 57.6 | 47.8 | 37.7 | 28.7 | 19.4 | 10.2 |
MSA-Net[ | 65.6 | 60.7 | 52.3 | 41.6 | 29.7 | 20.6 | 10.1 |
HAM-Net[ | 65.4 | 59.0 | 50.3 | 41.1 | 31.0 | 20.7 | 11.1 |
EGA-Net[ | 64.5 | 58.4 | 50.0 | 41.4 | 31.5 | 21.0 | 10.7 |
ACS-Net[ | — | — | 51.4 | 42.7 | 32.4 | 22.0 | 11.7 |
DGCNN[ | 66.3 | 59.9 | 52.3 | 43.2 | 32.8 | 22.1 | 13.1 |
本文模型 | 67.7 | 62.3 | 53.5 | 43.3 | 33.9 | 22.1 | 11.1 |
模型 | mAP@IoU | ||
---|---|---|---|
0.5 | 0.75 | 0.95 | |
STPN[ | 29.3 | 16.9 | 2.6 |
MAAN[ | 33.7 | 21.9 | 5.5 |
TSM[ | 30.3 | 19.0 | 4.5 |
BasNet[ | 34.5 | 22.5 | 4.9 |
EGA-Net[ | 35.4 | 22.5 | 4.5 |
A2CL-PT[ | 36.8 | 22.5 | 5.2 |
BMUE[ | 37.0 | 23.9 | 5.7 |
DGCNN[ | 37.2 | 23.8 | 5.8 |
本文模型 | 40.1 | 24.0 | 6.0 |
表2 不同模型在ActivityNet1.3数据集上的检测结果 (%)
Tab. 2 Detection results of different models on ActivityNet1.3 dataset
模型 | mAP@IoU | ||
---|---|---|---|
0.5 | 0.75 | 0.95 | |
STPN[ | 29.3 | 16.9 | 2.6 |
MAAN[ | 33.7 | 21.9 | 5.5 |
TSM[ | 30.3 | 19.0 | 4.5 |
BasNet[ | 34.5 | 22.5 | 4.9 |
EGA-Net[ | 35.4 | 22.5 | 4.5 |
A2CL-PT[ | 36.8 | 22.5 | 5.2 |
BMUE[ | 37.0 | 23.9 | 5.7 |
DGCNN[ | 37.2 | 23.8 | 5.8 |
本文模型 | 40.1 | 24.0 | 6.0 |
平衡因子 | mAP@0.5 | 平衡因子 | mAP@0.5 |
---|---|---|---|
0.005 | 32.5 | 0.050 | 33.8 |
0.007 | 33.1 | 0.070 | 33.8 |
0.010 | 33.9 | 0.100 | 33.6 |
0.030 | 33.6 |
表3 不同平衡因子在THUMOS14数据集上的性能比较
Tab. 3 Performance comparison of different balance factors on THUMOS 14 dataset
平衡因子 | mAP@0.5 | 平衡因子 | mAP@0.5 |
---|---|---|---|
0.005 | 32.5 | 0.050 | 33.8 |
0.007 | 33.1 | 0.070 | 33.8 |
0.010 | 33.9 | 0.100 | 33.6 |
0.030 | 33.6 |
实验 | mAP@IoU/% | ||||||||
---|---|---|---|---|---|---|---|---|---|
0.1 | 0.3 | 0.5 | 0.7 | ||||||
1 | √ | × | × | × | × | 49.9 | 32.9 | 16.6 | 5.3 |
2 | √ | × | √ | × | × | 55.9 | 41.9 | 23.0 | 7.1 |
3 | √ | √ | × | × | × | 67.4 | 50.8 | 31.5 | 10.8 |
4 | √ | √ | √ | × | × | 65.6 | 49.4 | 29.6 | 10.0 |
5 | √ | √ | √ | √ | √ | 67.7 | 53.5 | 33.9 | 11.1 |
表4 动作上下文分支消融实验结果
Tab.4 Ablation experiment results of action context branch
实验 | mAP@IoU/% | ||||||||
---|---|---|---|---|---|---|---|---|---|
0.1 | 0.3 | 0.5 | 0.7 | ||||||
1 | √ | × | × | × | × | 49.9 | 32.9 | 16.6 | 5.3 |
2 | √ | × | √ | × | × | 55.9 | 41.9 | 23.0 | 7.1 |
3 | √ | √ | × | × | × | 67.4 | 50.8 | 31.5 | 10.8 |
4 | √ | √ | √ | × | × | 65.6 | 49.4 | 29.6 | 10.0 |
5 | √ | √ | √ | √ | √ | 67.7 | 53.5 | 33.9 | 11.1 |
实验 | mAP@IoU/% | ||||||
---|---|---|---|---|---|---|---|
0.1 | 0.3 | 0.5 | 0.7 | ||||
1 | √ | × | × | 65.6 | 49.4 | 29.6 | 10.0 |
2 | √ | × | √ | 65.5 | 49.7 | 29.8 | 10.1 |
3 | √ | √ | × | 66.4 | 51.5 | 32.2 | 11.0 |
4 | √ | √ | √ | 67.7 | 53.5 | 33.9 | 11.1 |
表5 注意力引导损失与片段对比损失消融实验结果
Tab.5 Ablation experiment results of attention guided loss and snippet contrast loss
实验 | mAP@IoU/% | ||||||
---|---|---|---|---|---|---|---|
0.1 | 0.3 | 0.5 | 0.7 | ||||
1 | √ | × | × | 65.6 | 49.4 | 29.6 | 10.0 |
2 | √ | × | √ | 65.5 | 49.7 | 29.8 | 10.1 |
3 | √ | √ | × | 66.4 | 51.5 | 32.2 | 11.0 |
4 | √ | √ | √ | 67.7 | 53.5 | 33.9 | 11.1 |
1 | SUN C, SHETTY S, SUKTHANKAR R, et al. Temporal localization of fine-grained actions in videos by domain transfer from web images [C]// Proceedings of the 23rd ACM International Conference on Multimedia. New York: ACM, 2015:371-380. 10.1145/2733373.2806226 |
2 | 胡聪, 华钢.基于注意力机制的弱监督动作定位方法[J].计算机应用, 2022, 42(3): 960-967. |
HU C, HUA G. Weakly supervised action localization method based on attention mechanism[J]. Journal of Computer Applications, 2022, 42(3): 960-967. | |
3 | 郭文斌, 杨兴明, 蒋哲远,等.多时间尺度一致性的弱监督时序动作定位[J].计算机工程与应用, 2023, 59(10): 151-161. 10.3778/j.issn.1002-8331.2201-0233 |
GUO W B, YANG X M, JIANG Z Y, et al. Multi-temporal scales consensus for weakly supervised temporal action localization[J]. Computer Engineering and Applications, 2023, 59(10): 151-161. 10.3778/j.issn.1002-8331.2201-0233 | |
4 | NGUYEN P, HAN B, LIU T, et al. Weakly supervised action localization by sparse temporal pooling network[C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 6752-6761. 10.1109/cvpr.2018.00706 |
5 | ZENG R, GAN C, CHEN P, et al. Breaking Winner-Takes-All: Iterative-Winners-Out networks for weakly supervised temporal action localization[J]. IEEE Transactions on Image Processing, 2019, 28(12):5797-5808. 10.1109/tip.2019.2922108 |
6 | SHOU Z, GAO H, ZHANG L, et al. AutoLoc: weakly-supervised temporal action localization in untrimmed videos[C]// Proceedings of the 2018 European Conference on Computer Vision. Cham: Springer, 2018: 162-179. 10.1007/978-3-030-01270-0_10 |
7 | CHEN M, FANG Y, WANG X, et al. Diversity transfer network for Few-Shot learning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34:10559-10566. 10.1609/aaai.v34i07.6628 |
8 | ZHUANG C, ZHAI A, YAMINS D. Local aggregation for unsupervised learning of visual embeddings[C]// Proceedings of the 2019 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2019: 6001-6011. 10.1109/iccv.2019.00610 |
9 | SHI B, DAI Q, MU Y, et al. Weakly-supervised action localization by generative attention modeling[C]// Proceedings of the 2020 International Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 1006-1016. 10.1109/cvpr42600.2020.00109 |
10 | ZHANG C, CAO M, YANG D, et al. CoLA: weakly-supervised temporal action localization with snippet contrastive learning[C]// Proceedings of the 2021 International Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 16010-16019. 10.1109/cvpr46437.2021.01575 |
11 | SHOU Z, WANG D, CHANG S-F. Temporal action localization in untrimmed videos via multi-stage CNNs[C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 1049-1058. 10.1109/cvpr.2016.119 |
12 | ZHAO Y, XIONG Y, WANG L, et al. Temporal action detection with structured segment networks[C]// Proceedings of the 2017 IEEE Conference on Computer Vision. Piscataway: IEEE, 2017: 2933-2942. 10.1109/iccv.2017.317 |
13 | XU H, DAS A, SAENKO K. R-C3D: region convolutional 3D network for temporal activity detection[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 5794-5803. 10.1109/iccv.2017.617 |
14 | LIN T, ZHAO X, SU H, et al. BSN: boundary sensitive network for temporal action proposal generation[C]// Proceedings of the 2018 European Conference on Computer Vision. Cham: Springer, 2018: 3-21. 10.1007/978-3-030-01225-0_1 |
15 | LIN T, ZHAO X, SHOU Z. Single shot temporal action detection[C]// Proceedings of the 25th ACM International Conference on Multimedia. New York: ACM, 2017: 988-996. 10.1145/3123266.3123343 |
16 | WANG L, XIONG Y, LIN D, et al. UntrimmedNets for weakly supervised action recognition and detection[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 6402-6411. 10.1109/cvpr.2017.678 |
17 | NARAYAN S, CHOLAKKAL H, KHAN F S, et al. 3C-Net: category count and center loss for weakly-supervised action localization[C]// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 8679-8687. 10.1109/iccv.2019.00877 |
18 | MIN K, CORSO J J. Adversarial background-aware loss for weakly-supervised temporal activity localization[C]// Proceedings of the 2020 European Conference on Computer Vision. Cham: Springer, 2020: 283-299. 10.1007/978-3-030-58568-6_17 |
19 | YUAN Y, LYU Y, SHEN X, et al. Marginalized average attentional network for weakly-supervised learning[EB/OL]. [2023-03-09]. . |
20 | 李希, 刘喜平, 李旺才,等.对比学习研究综述[J].小型微型计算机系统, 2023, 44(4): 787-797. |
LI X, LIU X P, LI W C, et al. Survey on contrastive learning research [J]. Journal of Chinese Computer Systems, 2023, 44(4): 787-797. | |
21 | HE K, FAN H, WU Y, et al. Momentum contrast for unsupervised visual representation learning[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 9726-9735. 10.1109/cvpr42600.2020.00975 |
22 | CHEN T, KORNBLITH S, NOROUZI M, et al. A simple framework for contrastive learning of visual representations[EB/OL]. [2023-03-09]. . |
23 | GUTMANN M, HYVÄRINEN A. Noise-contrastive estimation: a new estimation principle for unnormalized statistical models[C]// Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. New York: JMLR.org, 2010: 297-304. |
24 | ZACH C, POCK T, BISCHOF H, et al. A duality based approach for realtime TV-L 1 optical flow[C]// Proceedings of the 29th DAGM Conference on Pattern Recognition. Berlin: Springer, 2007: 214-223. 10.1007/978-3-540-74936-3_22 |
25 | KAY W, CARREIRA J, SIMONYAN K, et al. The Kinetics human action video dataset[EB/OL]. [2023-03-09]. . |
26 | CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? A new model and the kinetics dataset[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 4724-4733. 10.1109/cvpr.2017.502 |
27 | IDREES H, ZAMIR A R, JIANG Y-G, et al. The THUMOS challenge on action recognition for videos "in the wild"[J]. Computer Vision and Image Understanding, 2017, 155: 1-23. 10.1016/j.cviu.2016.10.018 |
28 | HEILBRON F C, ESCORCIA V, GHANEM B, et al. ActivityNet: a large-scale video benchmark for human activity understanding[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2015: 961-970. 10.1109/cvpr.2015.7298698 |
29 | PAUL S, ROY S, ROY-CHOWDHURY A K. W-TALC: weakly-supervised temporal activity localization and classification [C]// Proceedings of the 2018 European Conference on Computer Vision. Cham: Springer, 2018: 588-607. 10.1007/978-3-030-01225-0_35 |
30 | LEE P, UH Y, BYUN H. Background suppression network for weakly-supervised temporal action localization[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 11320-11327. 10.1609/aaai.v34i07.6793 |
31 | ZHAI Y, WANG L, TANG W, et al. Two-stream consensus network for weakly-supervised temporal action localization[C]// Proceedings of the 2020 European Conference on Computer Vision. Cham: Springer, 2020: 37-54. 10.1007/978-3-030-58539-6_3 |
32 | YANG W, ZHANG T, MAO Z, et al. Multi-scale structure-aware network for weakly supervised temporal action detection[J]. IEEE Transactions on Image Processing, 2021, 30: 5848-5861. 10.1109/tip.2021.3089361 |
33 | ISLAM A, LONG C, RADKE R. A hybrid attention mechanism for weakly-supervised temporal action localization[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(2): 1637-1645. 10.1609/aaai.v35i2.16256 |
34 | CHENG Y, SUN Y, FAN H, et al. Entropy guided attention network for weakly-supervised action localization[J]. Pattern Recognition, 2022, 129: 108718. 10.1016/j.patcog.2022.108718 |
35 | LIU Z, WANG L, ZHANG Q, et al. ACSNet: action-context separation network for weakly supervised temporal action localization [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(3): 2233-2241. 10.1609/aaai.v35i3.16322 |
36 | SHI H, ZHANG X-Y, LI C, et al. Dynamic graph modeling for weakly-supervised temporal action localization[C]// Proceedings of the 30th ACM International Conference on Multimedia. New York: ACM, 2022: 3820-3828. 10.1145/3503161.3548077 |
37 | KINGMA D P, BA J. Adam: A method for stochastic optimization[EB/OL]. [2023-03-09]. . |
38 | YU T, REN Z, LI Y, et al. Temporal structure mining for weakly supervised action detection[C]// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 5521-5530. 10.1109/iccv.2019.00562 |
39 | LEE P, WANG J, LU Y, et al. Background modeling via uncertainty estimation for weakly-supervised action localization[EB/OL]. [2023-03-09]. . |
[1] | 秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974. |
[2] | 李力铤, 华蓓, 贺若舟, 徐况. 基于解耦注意力机制的多变量时序预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2732-2738. |
[3] | 杨兴耀, 陈羽, 于炯, 张祖莲, 陈嘉颖, 王东晓. 结合自我特征和对比学习的推荐模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2704-2710. |
[4] | 赵志强, 马培红, 黑新宏. 基于双重注意力机制的人群计数方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2886-2892. |
[5] | 薛凯鹏, 徐涛, 廖春节. 融合自监督和多层交叉注意力的多模态情感分析网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2387-2392. |
[6] | 汪雨晴, 朱广丽, 段文杰, 李书羽, 周若彤. 基于交互注意力机制的心理咨询文本情感分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2393-2399. |
[7] | 高鹏淇, 黄鹤鸣, 樊永红. 融合坐标与多头注意力机制的交互语音情感识别[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2400-2406. |
[8] | 李钟华, 白云起, 王雪津, 黄雷雷, 林初俊, 廖诗宇. 基于图像增强的低照度人脸检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2588-2594. |
[9] | 莫尚斌, 王文君, 董凌, 高盛祥, 余正涛. 基于多路信息聚合协同解码的单通道语音增强[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2611-2617. |
[10] | 刘丽, 侯海金, 王安红, 张涛. 基于多尺度注意力的生成式信息隐藏算法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2102-2109. |
[11] | 徐松, 张文博, 王一帆. 基于时空信息的轻量视频显著性目标检测网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2192-2199. |
[12] | 李大海, 王忠华, 王振东. 结合空间域和频域信息的双分支低光照图像增强网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2175-2182. |
[13] | 魏文亮, 王阳萍, 岳彪, 王安政, 张哲. 基于光照权重分配和注意力的红外与可见光图像融合深度学习模型[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2183-2191. |
[14] | 熊武, 曹从军, 宋雪芳, 邵云龙, 王旭升. 基于多尺度混合域注意力机制的笔迹鉴别方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2225-2232. |
[15] | 李欢欢, 黄添强, 丁雪梅, 罗海峰, 黄丽清. 基于多尺度时空图卷积网络的交通出行需求预测[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2065-2072. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||