Local and long-range temporal complementary modeling for video action recognition

doi:10.11772/j.issn.1001-9081.2025040509

Abstract

Abstract:

Due to the diversity and complexity of spatio-temporal features in videos， as well as the wide variability of actions across different speeds and scales， problems of insufficient capture of local motion details and inadequate mining of long-range temporal dependencies are commonly encountered in the existing action recognition methods. Therefore， a video action recognition network based on complementary modeling of local and long-range temporal information was proposed. The network is composed of the Two-level Fusion Motion Excitation （TFME） and the Temporal Aggregation Channel Excitation （TACE） modules. In the TFME module， the first-order and second-order differences between adjacent feature maps were computed and fused， and the fused weights were used to excite channels of the original feature maps， so as to enhance the fine-grained extraction capability of multi-level motion features， thereby modeling local temporal information. In the TACE module， a hierarchical residual pyramid structure was constructed using a channel grouping strategy， which expanded the temporal receptive field and enhanced the learning ability of multi-scale features. Meanwhile， a Temporal Channel Attention （TCA） mechanism was designed to adjust the aggregated feature maps dynamically and optimize the weight allocation among temporal channels， thereby modeling long-range temporal information. Finally， the above complementary modules were integrated and embedded into a 2D residual network to realize end-to-end action recognition. Experimental results on the Something-SomethingV1 and V2 validation sets show that using only RGB frames with a random 8-frame sampling strategy， the proposed network achieves the Top-1 accuracies of 50.6% and 61.9%， respectively； with a 16-frame sampling strategy， the accuracies are 54.1% and 65.6%， respectively. It can be seen that the proposed network models both local motion details and long-range temporal dependencies efficiently， offering a new way of thinking for action recognition tasks in complex temporal scenarios.

Key words: video action recognition, spatio-temporal feature, local motion, temporal modeling, attention mechanism

摘要：

由于视频时空特征的多样性与复杂性，以及不同速度和尺度下动作的广泛性，现有方法在动作识别任务中普遍存在对局部运动细节捕获不足和对长程时序依赖关系挖掘不充分的问题。因此，提出局部与长程时序互补建模的视频动作识别网络。该网络包含两级融合运动激励（TFME）和时序聚合通道激励（TACE）模块。TFME模块计算与融合相邻两帧特征图的一阶差分与二阶差分，并使用融合后的权重对原始特征图的通道进行激励，以增强多级运动特征的细粒度提取能力，从而建模局部时序信息；TACE模块通过通道分组策略构建分层残差连接的金字塔结构，以扩大时序感受野并增强多尺度特征的学习能力。同时，设计时序通道注意力（TCA）机制对聚合后的特征图进行动态调整，并优化时序通道间的权重分配，从而建模长程时序信息。最后，将上述优势互补的模块融合嵌入二维残差网络，实现端到端的动作识别。实验结果表明，在Something-SomethingV1和V2验证集上，仅使用RGB视频帧作为输入，使用8帧采样策略时，所提网络的Top-1识别准确率分别达到50.6%和61.9%；使用16帧采样策略时，所提网络的Top-1识别准确率分别达到54.1%和65.6%。可见，所提网络能够高效建模视频的局部运动细节与长程时序依赖关系，为复杂时序场景下的动作识别任务提供了新的思路。

关键词: 视频动作识别, 时空特征, 局部运动, 时序建模, 注意力机制

CLC Number:

TP391.4

Zuxi ZHANG, Zhancheng ZHANG, Fuyuan HU. Local and long-range temporal complementary modeling for video action recognition[J]. Journal of Computer Applications, 2026, 46(3): 758-766.

张祖习, 张战成, 胡伏原. 局部与长程时序互补建模的视频动作识别[J]. 《计算机应用》唯一官方网站, 2026, 46(3): 758-766.

Figures/Tables 8

References 43

[1]	倪苒岩，张轶. 基于视频时空特征的行为识别方法［J］. 计算机应用， 2023， 43（2）： 521-528.
	NI R Y， ZHANG Y. Action recognition method based on video spatio-temporal features ［J］. Journal of Computer Applications， 2023， 43（2）： 521-528.
[2]	ZHANG M， TIAN G， ZHANG Y， et al. Service skill improvement for home robots： autonomous generation of action sequence based on reinforcement learning ［J］. Knowledge-Based Systems， 2021， 212： No.106605.
[3]	WANG H， SCHMID C. Action recognition with improved trajectories ［C］// Proceedings of the 2013 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2013： 3551-3558.
[4]	MORSHED M G， SULTANA T， ALAM A， et al. Human action recognition： a taxonomy-based survey， updates， and opportunities［J］. Sensors， 2023， 23（4）： No.2182.
[5]	TANG Z， ZHAO Y， WEN Y， et al. A survey on backbones for deep video action recognition ［C］// Proceedings of the 2024 International Conference on Multimedia and Expo Workshops. Piscataway： IEEE， 2024： 1-6.
[6]	王丽芳，吴荆双，尹鹏亮，等. 基于注意力机制和能量函数的动作识别算法［J］. 计算机应用， 2025， 45（1）： 234-239.
	WANG L F， WU J S， YIN P L， et al. Action recognition algorithm based on attention mechanism and energy function ［J］. Journal of Computer Applications， 2025， 45（1）： 234-239.
[7]	WANG Z， SHE Q， SMOLIC A. ACTION-Net： multipath excitation for action recognition ［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 13209-13218.
[8]	JIANG Z， ZHANG Y， HU S. ESTI： an action recognition network with enhanced spatio-temporal information ［J］. International Journal of Machine Learning and Cybernetics， 2023， 14（9）： 3059-3070.
[9]	LIU T， MA Y， YANG W， et al. Spatial-temporal interaction learning based two-stream network for action recognition ［J］. Information Sciences， 2022， 606： 864-876.
[10]	WANG Z， LU H， JIN J， et al. Human action recognition based on improved two-stream convolution network ［J］. Applied Sciences， 2022， 12（12）： No.5784.
[11]	LIN J， GAN C， HAN S. TSM： temporal shift module for efficient video understanding ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 7082-7092.
[12]	CAI Z. A novel spatio-temporal-wise network for action recognition［J］. IEEE Access， 2023， 11： 49071-49080.
[13]	CARREIRA J， ZISSERMAN A. Quo vadis， action recognition？ a new model and the kinetics dataset ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 4724-4733.
[14]	CHEN B， MENG F， TANG H， et al. Two-level attention module based on spurious-3D residual networks for human action recognition ［J］. Sensors， 2023， 23（3）： No.1707.
[15]	DOSOVITSKIY A， BEYER L， KOLESNIKOV A， et al. An image is worth 16x16 words： Transformers for image recognition at scale［EB/OL］. ［2025-03-12］..
[16]	ARNAB A， DEHGHANI M， HEIGOLD G， et al. ViViT： a video Vision Transformer ［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 6816-6826.
[17]	WANG L， HUANG B， ZHAO Z， et al. VideoMAE V2： scaling video masked autoencoders with dual masking ［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023： 14549-14560.
[18]	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need ［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 6000-6010.
[19]	SIMONYAN K， ZISSERMAN A. Two-stream convolutional networks for action recognition in videos ［C］// Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1. Cambridge： MIT Press， 2014： 568-576.
[20]	WANG L， XIONG Y， WANG Z， et al. Temporal segment networks： towards good practices for deep action recognition ［C］// Proceedings of the 2016 European Conference on Computer Vision， LNCS 9912. Cham： Springer， 2016： 20-36.
[21]	LI Y， JI B， SHI X， et al. TEA： temporal excitation and aggregation for action recognition ［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 906-915.
[22]	WANG L， TONG Z， JI B， et al. TDN： temporal difference networks for efficient action recognition ［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 1895-1904.
[23]	CHEN Y， GE H， LIU Y， et al. AGPN： action granularity pyramid network for video action recognition ［J］. IEEE Transactions on Circuits and Systems for Video Technology， 2023， 33（8）： 3912-3923.
[24]	LI C， CHENG C， YU M， et al. Joint coarse to fine-grained spatio-temporal modeling for video action recognition ［J］. IEEE Transactions on Biometrics， Behavior， and Identity Science， 2025， 7（3）： 444-457.
[25]	LI K， LI X， WANG Y， et al. CT-Net： channel tensorization network for video classification ［EB/OL］. ［2025-04-09］..
[26]	WANG X， GIRSHICK R， GUPTA A， et al. Non-local neural networks ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 7794-7803.
[27]	XIONG X， MIN W， HAN Q， et al. Action recognition using action sequences optimization and two-stream 3D dilated neural network ［J］. Computational Intelligence and Neuroscience， 2022， 2022： No.6608448.
[28]	TAO Y， TAO H， ZHUANG Z， et al. Quantized iterative learning control of communication-constrained systems with encoding and decoding mechanism ［J］. Transactions of the Institute of Measurement and Control， 2024， 46（10）： 1943-1954.
[29]	SONG X， PENG Z， SONG S， et al. Anti-disturbance state estimation for PDT-switched RDNNs utilizing time-sampling and space-splitting measurements ［J］. Communications in Nonlinear Science and Numerical Simulation， 2024， 132： No.107945.
[30]	HU J， SHEN L， SUN G. Squeeze-and-excitation networks ［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 7132-7141.
[31]	WANG Q， WU B， ZHU P， et al. ECA-Net： efficient channel attention for deep convolutional neural networks ［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 11531-11539.
[32]	LI Z， LI J， MA Y， et al. Spatio-temporal adaptive network with bidirectional temporal difference for action recognition ［J］. IEEE Transactions on Circuits and Systems for Video Technology， 2023， 33（9）： 5174-5185.
[33]	HE K， ZHANG X， REN S， et al. Deep residual learning for image recognition ［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778.
[34]	JIANG B， WANG M， GAN W， et al. STM： spatiotemporal and motion encoding for action recognition ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 2000-2009.
[35]	GAO S H， CHENG M M， ZHAO K， et al. Res2Net： a new multi-scale backbone architecture ［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2021， 43（2）： 652-662.
[36]	GAO H， WANG Z， JI S. ChannelNets： compact and efficient convolutional neural networks via channel-wise convolutions ［C］// Proceedings of the 32nd Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2018： 5203-5211.
[37]	GOYAL R， EBRAHIMI KAHOU S， MICHALSKI V， et al. The “something something” video database for learning and evaluating visual common sense ［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 5843-5851.
[38]	KAY W， CARREIRA J， SIMONYAN K， et al. The kinetics human action video dataset ［EB/OL］. ［2025-04-26］..
[39]	WANG X， GUPTA A. Videos as space-time region graphs ［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11209. Cham： Springer， 2018： 413-431.
[40]	ZOLFAGHARI M， SINGH K， BROX T. ECO： efficient convolutional network for online video understanding ［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11206. Cham： Springer， 2018： 713-730.
[41]	FAN Q， CHEN C F R， KUEHNE H， et al. More is less： learning efficient video representations by big-little network and depthwise temporal aggregation ［C］// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2019： 2264-2273.
[42]	LIU Z， WANG L， WU W， et al. TAM： temporal adaptive module for video recognition ［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 13688-13698.
[43]	WANG Y， ZHANG H， YUE Y， et al. Uni-AdaFocus： spatial-temporal dynamic computation for video recognition ［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2025， 47（3）： 1782-1799.

方法	骨干网络	预训练	帧数	浮点运算量（GFLOPs）	V1-Top1/%	V2-Top1/%
NL-I3D	3DRes50	ImgNet+K400	32×3×2	153×3×2	41.6	—
NL-I3D+GCN	3DRes50+GCN	ImgNet+K400	32×3×2	303×3×2	46.1	—
ECO_En-RGB	BNIncep+3DRes18	K400	92×1×1	267×1×1	46.4	—
TSN-RGB	ResNet-50	ImageNet	8×1×1	16×1×1	19.5	30.0
TSN-RGB	ResNet-50	ImageNet	16×1×1	33×1×1	19.7	—
TSM-RGB	ResNet-50	ImageNet	8×1×1	33×1×1	43.4	55.6
TSM-RGB	ResNet-50	ImageNet	16×1×1	65×1×1	44.8	—
bLVNet-TAM	bLResNet-50	ImageNet	8×1×1	24×1×1	46.4	59.1
bLVNet-TAM	bLResNet-50	SSV1/SSV2	16×1×1	48×1×1	48.4	61.7
STM	ResNet-50	ImageNet	8×3×10	33×3×10	49.2	62.3
STM	ResNet-50	ImageNet	16×3×10	67×3×10	50.7	64.2
TEA	ResNet-50	ImageNet	8×1×1	35×1×1	48.9	—
TEA	ResNet-50	ImageNet	16×1×1	70×1×1	51.9	—
TANet	ResNet-50	ImageNet	8×1×1	33×1×1	47.3	60.5
TANet	ResNet-50	ImageNet	16×1×1	66×1×1	47.6	62.5
TDN	ResNet-50	ImageNet	8×3×10	36×3×10	52.3	64.0
TDN	ResNet-50	ImageNet	16×3×10	72×3×10	53.9	65.3
Uni-AdaFocus-TSM（96²）	MN2+R50	ImageNet	（8+12）×1×1	9×1×1	48.9	62.5
Uni-AdaFocus-TSM（128²）	MN2+R50	ImageNet	（8+16）×1×1	19×1×1	51.0	64.2
JCFG-STM	ResNet-50	ImageNet	8×3×10	35×3×10	—	62.0
JCFG-STM	ResNet-50	ImageNet	16×3×10	70×3×10	—	62.2
本文方法	ResNet-50	ImageNet	8×1×1	35×1×1	50.6	61.9
	ResNet-50	ImageNet	16×1×1	70×1×1	52.7	63.2
	ResNet-50	ImageNet	16×3×10	70×3×10	54.1	65.6

方法	骨干网络	预训练	帧数	浮点运算量（GFLOPs）	V1-Top1/%	V2-Top1/%
NL-I3D	3DRes50	ImgNet+K400	32×3×2	153×3×2	41.6	—
NL-I3D+GCN	3DRes50+GCN	ImgNet+K400	32×3×2	303×3×2	46.1	—
ECO_En-RGB	BNIncep+3DRes18	K400	92×1×1	267×1×1	46.4	—
TSN-RGB	ResNet-50	ImageNet	8×1×1	16×1×1	19.5	30.0
TSN-RGB	ResNet-50	ImageNet	16×1×1	33×1×1	19.7	—
TSM-RGB	ResNet-50	ImageNet	8×1×1	33×1×1	43.4	55.6
TSM-RGB	ResNet-50	ImageNet	16×1×1	65×1×1	44.8	—
bLVNet-TAM	bLResNet-50	ImageNet	8×1×1	24×1×1	46.4	59.1
bLVNet-TAM	bLResNet-50	SSV1/SSV2	16×1×1	48×1×1	48.4	61.7
STM	ResNet-50	ImageNet	8×3×10	33×3×10	49.2	62.3
STM	ResNet-50	ImageNet	16×3×10	67×3×10	50.7	64.2
TEA	ResNet-50	ImageNet	8×1×1	35×1×1	48.9	—
TEA	ResNet-50	ImageNet	16×1×1	70×1×1	51.9	—
TANet	ResNet-50	ImageNet	8×1×1	33×1×1	47.3	60.5
TANet	ResNet-50	ImageNet	16×1×1	66×1×1	47.6	62.5
TDN	ResNet-50	ImageNet	8×3×10	36×3×10	52.3	64.0
TDN	ResNet-50	ImageNet	16×3×10	72×3×10	53.9	65.3
Uni-AdaFocus-TSM（96²）	MN2+R50	ImageNet	（8+12）×1×1	9×1×1	48.9	62.5
Uni-AdaFocus-TSM（128²）	MN2+R50	ImageNet	（8+16）×1×1	19×1×1	51.0	64.2
JCFG-STM	ResNet-50	ImageNet	8×3×10	35×3×10	—	62.0
JCFG-STM	ResNet-50	ImageNet	16×3×10	70×3×10	—	62.2
本文方法	ResNet-50	ImageNet	8×1×1	35×1×1	50.6	61.9
	ResNet-50	ImageNet	16×1×1	70×1×1	52.7	63.2
	ResNet-50	ImageNet	16×3×10	70×3×10	54.1	65.6

方法	参数量/10⁶	V1-Top1/%
TEA block	24.30	48.9
TFME+MTA block	24.51	49.6

方法	参数量/10⁶	V1-Top1/%
TEA block	24.30	48.9
TFME+MTA block	24.51	49.6

方法	参数量/10⁶	V1-Top1/%
TEA block	24.30	48.9
ME+TACE block	24.75	49.8
TFME+TACE block	24.96	50.6