Weakly supervised action localization based on temporal and global contextual feature enhancement

doi:10.11772/j.issn.1001-9081.2024040443

Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (3): 963-971.DOI: 10.11772/j.issn.1001-9081.2024040443

• Multimedia computing and computer simulation • Previous Articles Next Articles

Weakly supervised action localization based on temporal and global contextual feature enhancement

Weichao DANG, Yinghao FAN(), Gaimei GAO, Chunxia LIU

College of Computer Science and Technology，Taiyuan University of Science and Technology，Taiyuan Shanxi 030024，China

Received:2024-04-12 Revised:2024-06-25 Accepted:2024-06-28 Online:2025-03-17 Published:2025-03-10
Contact: Yinghao FAN
About author:DANG Weichao， born in 1974， Ph. D.， associate professor. His research interests include intelligent computing， software reliability.
GAO Gaimei， born in 1978， Ph. D.， associate professor. Her research interests include network security， cryptography.
LIU Chunxia， born in 1977， M. S.， associate professor. Her research interests include software engineering， database.
Supported by:
Shanxi Provincial Natural Science Foundation(202203021211194);Doctoral Research Start-up Fund Project of Taiyuan University of Science and Technology(20202063);Graduate Education Innovation Project of Taiyuan University of Science and Technology(SY2022063)

融合时序与全局上下文特征增强的弱监督动作定位

党伟超, 范英豪(), 高改梅, 刘春霞

太原科技大学计算机科学与技术学院，太原 030024

通讯作者: 范英豪
作者简介:党伟超（1974—），男，山西运城人，副教授，博士，CCF会员，主要研究方向：智能计算、软件可靠性
高改梅（1978－），女，山西吕梁人，副教授，博士，CCF会员，主要研究方向：网络安全、密码学
刘春霞（1977—），女，山西大同人，副教授，硕士，CCF会员，主要研究方向：软件工程、数据库。
基金资助:
山西省自然科学基金资助项目(202203021211194);太原科技大学博士科研启动基金资助项目(20202063);太原科技大学研究生教育创新项目(SY2022063)

Abstract

Abstract:

In view of the inaccuracy of action classification and localization caused by the independent processing of video clips as single action instances in the existing weakly supervised action localization studies， a weakly supervised action localization method that integrates temporal and global contextual feature enhancement was proposed. Firstly， the temporal feature enhancement branch was constructed to enlarge the receptive field by using dilated convolution， and the attention mechanism was introduced to capture the temporal dependency between video clips. Secondly， an EM （Expectation-Maximization） algorithm based on Gaussian Mixture Model （GMM） was designed to capture video context information. At the same time， global contextual feature enhancement was performed by using binary walk propagation. As the result， high-quality Temporal Class Activation Maps （TCAMs） were generated as pseudo labels to supervise the temporal enhancement branch online. Thirdly， the momentum update network was used to obtain a cross-video dictionary that reflects the action features between videos. Finally， cross-video contrastive learning was used to improve the accuracy of action classification. Experimental results show that the proposed method has the mean Average Precision （mAP） of 42.0% and 42.2% on THUMOS’14 and ActivityNet v1.3 datasets when the Intersection-over-Union （IoU） is 0.5， and compared with CCKEE （Cross-video Contextual Knowledge Exploration and Exploitation）， the proposed method has the mAP improved by 2.6 and 0.6 percentage points， respectively， proving the effectiveness of the proposed method.

Key words: weakly supervised action localization, Temporal Class Activation Map (TCAM), momentum update, pseudo label supervision, feature enhancement

摘要：

针对现有的弱监督动作定位研究中将视频片段视为单独动作实例独立处理带来的动作分类及定位不准确问题，提出一种融合时序与全局上下文特征增强的弱监督动作定位方法。首先，构建时序特征增强分支以利用膨胀卷积扩大感受野，并引入注意力机制捕获视频片段间的时序依赖性；其次，设计基于高斯混合模型（GMM）的期望最大化（EM）算法捕获视频的上下文信息，同时利用二分游走传播进行全局上下文特征增强，生成高质量的时序类激活图（TCAM）作为伪标签在线监督时序特征增强分支；再次，通过动量更新网络得到体现视频间动作特征的跨视频字典；最后，利用跨视频对比学习提高动作分类的准确性。实验结果表明，交并比（IoU）取0.5时，所提方法在THUMOS’14和ActivityNet v1.3数据集上分别取得了42.0%和42.2%的平均精度均值（mAP），相较于CCKEE （Cross-video Contextual Knowledge Exploration and Exploitation）方法，在mAP分别提升了2.6与0.6个百分点，验证了所提方法的有效性。

关键词: 弱监督动作定位, 时序类激活图, 动量更新, 伪标签监督, 特征增强

CLC Number:

TP391.4

Weichao DANG, Yinghao FAN, Gaimei GAO, Chunxia LIU. Weakly supervised action localization based on temporal and global contextual feature enhancement[J]. Journal of Computer Applications, 2025, 45(3): 963-971.

党伟超, 范英豪, 高改梅, 刘春霞. 融合时序与全局上下文特征增强的弱监督动作定位[J]. 《计算机应用》唯一官方网站, 2025, 45(3): 963-971.

Figures/Tables 11

Fig. 1 Overall framework of proposed model

Fig. 2 Optical flow enhancement module

Fig. 3 Modal consistency module

Fig. 4 EM attention and binary walk propagation

Fig. 5 Momentum update network

Tab. 1 Detection results of different methods on THUMOS’14 dataset

IoU	mAP/%
IoU	W-TALC	A2CL-PT	FAC-Net	CoLA	AUMN	DCC	BasNet	D2-Net	ASM-Loc	TBUPL	P-MIL	DTRP-Loc	CCKEE	本文方法
平均	34.4	37.8	42.2	40.9	41.5	44.0	35.3	51.5	45.1	43.4	47.1	47.0	46.8	48.0
0.1	55.2	61.2	67.6	66.2	66.2	69.0	58.2	65.7	71.2	70.5	71.8	71.4	72.6	73.1
0.2	49.6	56.1	62.1	59.5	61.9	63.8	52.3	60.2	65.5	64.4	67.5	66.0	67.1	67.6
0.3	40.1	48.1	52.6	51.5	54.9	55.9	44.6	52.3	57.1	55.2	58.9	58.0	59.5	59.6
0.4	31.1	39.0	44.3	41.9	44.4	45.9	36.0	43.4	46.8	44.8	49.0	49.3	49.3	50.3
0.5	22.8	30.1	33.4	32.2	33.3	35.7	27.0	36.0	36.6	33.7	40.0	41.4	39.4	42.0
0.6	—	19.2	22.5	22.0	20.5	24.3	18.6	—	25.2	22.9	27.1	28.0	26.5	28.6
0.7	7.6	10.6	12.7	13.1	9.0	13.7	10.4	—	13.4	12.2	15.1	15.0	13.4	14.9

Tab. 2 Detection results of different methods on ActivityNet v1.3 dataset

IoU	mAP/%
IoU	AUMN	FAC-Net	DGCNN	DCC	ASM-Loc	P-MIL	CCKEE	本文方法
AVG	23.5	24.0	23.9	24.3	25.1	25.5	25.3	25.5
0.50	38.5	37.6	37.2	38.8	41.0	41.8	41.6	42.2
0.75	23.5	24.2	23.8	24.2	24.9	25.4	25.1	25.3
0.95	5.2	6.0	5.8	5.7	6.2	5.2	6.5	6.5

Tab. 3 Ablation experimental results of different branches

IoU	mAP/%
IoU	基线	时序特征增强	全局上下文特征增强	动量更新网络
AVG	44.3	46.5	47.6	48.1
0.1	69.2	72.2	73.0	73.1
0.3	54.2	57.8	59.2	59.6
0.5	38.4	39.7	41.0	42.0
0.7	12.9	13.6	14.6	14.9

Tab. 4 Ablation experimental results of different dilated convolutional layers in optical flow enhancement module

膨胀卷积层数k	IoU	mAP/%	AVG（0.1：0.5）/%
0	0.1	71.6	56.8
	0.3	57.7
	0.5	40.7
1	0.1	72.2	57.4
	0.3	58.0
	0.5	41.1
2	0.1	72.9	57.6
	0.3	57.9
	0.5	41.3
3	0.1	73.1	58.5
	0.3	59.6
	0.5	42.0
4	0.1	72.7	57.2
	0.3	57.6
	0.5	40.8

Tab. 5 Ablation experimental results of Lnorm， Lguide and Lmil

实验序号	$L c l s X$	$L c l s X ¯$	$L n o r m$	$L g u i d e$	$L m i l$	AVG（0.1：0.7）/%
1	√	√	×	×	×	29.5
2	√	√	√	×	×	36.6
3	√	√	×	√	×	44.1
4	√	√	×	×	√	41.6
5	√	√	√	√	×	46.5
6	√	√	√	×	√	43.6
7	√	√	×	√	√	44.3
8	√	√	√	√	√	46.5

Tab. 5 Ablation experimental results of Lnorm， Lguide and Lmil

实验序号	$L c l s X$	$L c l s X ¯$	$L n o r m$	$L g u i d e$	$L m i l$	AVG（0.1：0.7）/%
1	√	√	×	×	×	29.5
2	√	√	√	×	×	36.6
3	√	√	×	√	×	44.1
4	√	√	×	×	√	41.6
5	√	√	√	√	×	46.5
6	√	√	√	×	√	43.6
7	√	√	×	√	√	44.3
8	√	√	√	√	√	46.5

Tab. 6 Time efficiency experimental results of different methods on THUMOS’14 dataset

方法	迭代次数	批次大小	训练时间/s	参数量/10⁶
FAC-Net	100	10	752	2.12
ASM-Loc	800	16	2 640	12.63
P-MIL	200	10	1 380	9.47
本文方法	150	10	1 439	8.55

References 41

1	SULTANI W， CHEN C， SHAH M. Real-world anomaly detection in surveillance videos［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6479-6488.
2	HU W， XIE N， LI L， et al. A survey on visual content-based video indexing and retrieval［J］. IEEE Transactions on Systems， Man， and Cybernetics， Part C （Applications and Reviews）， 2011， 41（6）： 797-819.
3	SHOU Z， CHAN J， ZAREIAN A， et al. CDC： convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 1417-1426.
4	LIN T， ZHAO X， SU H， et al. BSN： boundary sensitive network for temporal action proposal generation［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11208. Cham： Springer， 2018： 3-21.
5	ZHAO Y， XIONG Y， WANG L， et al. Temporal action detection with structured segment networks［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 2933-2942.
6	CHAO Y W， VIJAYANARASIMHAN S， SEYBOLD B， et al. Rethinking the Faster R-CNN architecture for temporal action localization［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018：1130-1139.
7	LONG F， YAO T， QIU Z， et al. Gaussian temporal awareness networks for action localization［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 344-353.
8	ZENG R， HUANG W， GAN C， et al. Graph convolutional networks for temporal action localization［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 7093-7102.
9	胡聪，华钢. 基于注意力机制的弱监督动作定位方法［J］. 计算机应用， 2022， 42（3）：960-967.
	HU C， HUA G. Weakly supervised action localization method based on attention mechanism［J］. Journal of Computer Applications， 2022， 42（3）： 960-967.
10	任冬伟，王旗龙，魏云超，等. 视觉弱监督学习研究进展［J］. 中国图象图形学报， 2022， 27（6）：1768-1798.
	REN D W， WANG Q L， WEI Y C， et al. Progress in weakly supervised learning for visual understanding［J］. Journal of Image and Graphics， 2022， 27（6）： 1768-1798.
11	WANG L， XIONG Y， LIN D， et al. UntrimmedNets for weakly supervised action recognition and detection［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 6402-6411.
12	NGUYEN P， HAN B， LIU T， et al. Weakly supervised action localization by sparse temporal pooling network［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6752-6761.
13	PAUL S， ROY S， ROY-CHOWDHURY A K. W-TALC： weakly-supervised temporal activity localization and classification［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11208. Cham： Springer， 2018： 588-607.
14	CARREIRA J， ZISSERMAN A. Quo Vadis， action recognition？ A new model and the kinetics dataset［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 4724-4733.
15	LUO Z， GUILLORY D， SHI B， et al. Weakly-supervised action localization with expectation-maximization multi-instance learning［C］// Proceedings of the 2020 European Conference on Computer Vision， LNCS 12374. Cham： Springer， 2020：729-745.
16	PARDO A， ALWASSEL H， HEILBRON F C， et al. RefineLoc： iterative refinement for weakly-supervised action localization［C］// Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision. Piscataway： IEEE， 2021： 3318-3327.
17	YANG W， ZHANG T， YU X， et al. Uncertainty guided collaborative training for weakly supervised temporal action detection［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021：53-63.
18	ZHAI Y， WANG L， TANG W， et al. Two-stream consensus network for weakly-supervised temporal action localization［C］// Proceedings of the 2020 European Conference on Computer Vision， LNCS 12351. Cham： Springer， 2020： 37-54.
19	MIN K， CORSO J J. Adversarial background-aware loss for weakly-supervised temporal activity localization［C］// Proceedings of the 2020 European Conference on Computer Vision， LNCS 12359. Cham： Springer， 2020： 283-299.
20	HUANG L， WANG L， LI H. Foreground-action consistency network for weakly supervised temporal action localization［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 7982-7991.
21	CHEN X， HE K. Exploring simple Siamese representation learning［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 15745-15753.
22	CARON M， BOJANOWSKI P， JOULIN A， et al. Deep clustering for unsupervised learning of visual features［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11218. Cham： Springer， 2018： 139-156.
23	TAO L， WANG X， YAMASAKI T. An improved inter-intra contrastive learning framework on self-supervised video representation［J］. IEEE Transactions on Circuits and Systems for Video Technology， 2022， 32（8）： 5266-5280.
24	XU B， SHU X， ZHANG J， et al. Spatiotemporal decouple-and-squeeze contrastive learning for semisupervised skeleton-based action recognition［J］. IEEE Transactions on Neural Networks and Learning Systems， 2024， 35（8）： 11035-11048.
25	CHEN Z， LIN K Y， ZHENG W S. Consistent intra-video contrastive learning with asynchronous long-term memory bank［J］. IEEE Transactions on Circuits and Systems for Video Technology， 2023， 33（3）： 1168-1180.
26	ZHANG C， CAO M， YANG D， et al. CoLA： weakly-supervised temporal action localization with snippet contrastive learning［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 16005-16014.
27	LUO W， ZHANG T， YANG W， et al. Action unit memory network for weakly supervised temporal action localization［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 9964-9974.
28	LI J， YANG T， JI W， et al. Exploring denoised cross-video contrast for weakly-supervised temporal action localization［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 19882-19892.
29	ZACH C， POCK T， BISCHOF H. A duality based approach for realtime TV-L ¹ optical flow［C］// Proceedings of the 2007 DAGM Symposium， LNCS 4713. Berlin： Springer， 2007： 214-223.
30	KAY W， CARREIRA J， SIMONYAN K， et al. The Kinetics human action video dataset［EB/OL］. ［2024-04-09］. .
31	HE K M， FAN H Q， WU Y X， et al. Momentum contrast for unsupervised visual representation learning［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 9729-9738.
32	LEE P， UH Y， BYUN H. Background suppression network for weakly-supervised temporal action localization［C］// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2020： 11320-11327.
33	IDREES H， ZAMIR A R， JIANG Y G， et al. The THUMOS challenge on action recognition for videos “in the wild”［J］. Computer Vision and Image Understanding， 2017， 155： 1-23.
34	HEILBRON F C， ESCORCIA V， GHANEM B， et al. ActivityNet： a large-scale video benchmark for human activity understanding［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2015： 961-970.
35	NARAYAN S， CHOLAKKAL H， HAYAT M， et al. D2-Net： weakly-supervised action localization via discriminative embeddings and denoised activations［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 13588-13597.
36	HE B， YANG X， KANG L， et al. ASM-Loc： action-aware segment modeling for weakly-supervised temporal action localization［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 13915-13925.
37	ZHANG S， ZHAO C. Cross-video contextual knowledge exploration and exploitation for ambiguity reduction in weakly supervised temporal action localization［J］. IEEE Transactions on Circuits and Systems for Video Technology， 2024， 34（6）： 4568-4580.
38	SHI H， ZHANG X Y， LI C， et al. Dynamic graph modeling for weakly-supervised temporal action localization［C］// Proceedings of the 30th ACM International Conference on Multimedia. New York： ACM， 2022： 3820-3828.
39	TANG Y， GE J， GUO K， et al. Towards better utilization of pseudo labels for weakly supervised temporal action localization［J］. Information Sciences， 2023， 623： 693-708.
40	REN H， YANG W， ZHANG T， et al. Proposal-based multiple instance learning for weakly-supervised temporal action localization［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023：2394-2404.
41	HUANG J， KONG M， CHEN L， et al. Temporal RPN learning for weakly-supervised temporal action localization［C］// Proceedings of the 15th Asian Conference on Machine Learning. New York： JMLR.org， 2024： 470-485.

[1]	Binhong XIE, Wanyin GAO, Wangdong LU, Yingjun ZHANG, Rui ZHANG. Dense object counting network with few-shot similarity matching feature enhancement [J]. Journal of Computer Applications, 2025, 45(2): 403-410.
[2]	Benchen YANG, Haoran LI, Haibo JIN. Multi-focus image fusion network with cascade fusion and enhanced reconstruction [J]. Journal of Computer Applications, 2025, 45(2): 594-600.
[3]	Jie WANG, Hua MENG. Image classification algorithm based on overall topological structure of point cloud [J]. Journal of Computer Applications, 2024, 44(4): 1107-1113.
[4]	Xinye LI, Yening HOU, Yinghui KONG, Zhiqi YAN. Few-shot object detection combining feature fusion and enhanced attention [J]. Journal of Computer Applications, 2024, 44(3): 745-751.
[5]	Jia CHEN, Hong ZHANG. Image text retrieval method based on feature enhancement and semantic correlation matching [J]. Journal of Computer Applications, 2024, 44(1): 16-23.
[6]	Bin LU, Jielin LIU. Semantic segmentation for 3D point clouds based on feature enhancement [J]. Journal of Computer Applications, 2023, 43(6): 1818-1825.
[7]	Xiangyue TAN, Xiao HU, Jiaxin YANG, Junjiang XIANG. Camouflaged object detection based on progressive feature enhancement aggregation [J]. Journal of Computer Applications, 2022, 42(7): 2192-2200.
[8]	Suqi ZHANG, Xinxin WANG, Shiyao SHE, Junhua GU. Knowledge graph recommendation model with multiple time scales and feature enhancement [J]. Journal of Computer Applications, 2022, 42(4): 1093-1098.
[9]	Jing WEN, Qiang LI. Object tracking algorithm based on spatio-temporal context information enhancement [J]. Journal of Computer Applications, 2021, 41(12): 3565-3570.

Weakly supervised action localization based on temporal and global contextual feature enhancement

融合时序与全局上下文特征增强的弱监督动作定位

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 11

References 41

Related Articles 9

Recommended Articles

Metrics