《计算机应用》唯一官方网站 ›› 2026, Vol. 46 ›› Issue (3): 758-766.DOI: 10.11772/j.issn.1001-9081.2025040509
收稿日期:2025-05-09
修回日期:2025-07-14
接受日期:2025-07-17
发布日期:2026-03-16
出版日期:2026-03-10
通讯作者:
张战成
作者简介:张祖习(1998—),男,江西九江人,硕士研究生,CCF会员,主要研究方向:计算机视觉、视频动作识别基金资助:
Zuxi ZHANG, Zhancheng ZHANG(
), Fuyuan HU
Received:2025-05-09
Revised:2025-07-14
Accepted:2025-07-17
Online:2026-03-16
Published:2026-03-10
Contact:
Zhancheng ZHANG
About author:ZHANG Zuxi, born in 1998, M. S. candidate. His research interests include computer vision, video action recognition.Supported by:摘要:
由于视频时空特征的多样性与复杂性,以及不同速度和尺度下动作的广泛性,现有方法在动作识别任务中普遍存在对局部运动细节捕获不足和对长程时序依赖关系挖掘不充分的问题。因此,提出局部与长程时序互补建模的视频动作识别网络。该网络包含两级融合运动激励(TFME)和时序聚合通道激励(TACE)模块。TFME模块计算与融合相邻两帧特征图的一阶差分与二阶差分,并使用融合后的权重对原始特征图的通道进行激励,以增强多级运动特征的细粒度提取能力,从而建模局部时序信息;TACE模块通过通道分组策略构建分层残差连接的金字塔结构,以扩大时序感受野并增强多尺度特征的学习能力。同时,设计时序通道注意力(TCA)机制对聚合后的特征图进行动态调整,并优化时序通道间的权重分配,从而建模长程时序信息。最后,将上述优势互补的模块融合嵌入二维残差网络,实现端到端的动作识别。实验结果表明,在Something-SomethingV1和V2验证集上,仅使用RGB视频帧作为输入,使用8帧采样策略时,所提网络的Top-1识别准确率分别达到50.6%和61.9%;使用16帧采样策略时,所提网络的Top-1识别准确率分别达到54.1%和65.6%。可见,所提网络能够高效建模视频的局部运动细节与长程时序依赖关系,为复杂时序场景下的动作识别任务提供了新的思路。
中图分类号:
张祖习, 张战成, 胡伏原. 局部与长程时序互补建模的视频动作识别[J]. 计算机应用, 2026, 46(3): 758-766.
Zuxi ZHANG, Zhancheng ZHANG, Fuyuan HU. Local and long-range temporal complementary modeling for video action recognition[J]. Journal of Computer Applications, 2026, 46(3): 758-766.
| 方法 | 骨干网络 | 预训练 | 帧数 | 浮点运算量(GFLOPs) | V1-Top1/% | V2-Top1/% |
|---|---|---|---|---|---|---|
| NL-I3D | 3DRes50 | ImgNet+K400 | 32×3×2 | 153×3×2 | 41.6 | — |
| NL-I3D+GCN | 3DRes50+GCN | ImgNet+K400 | 32×3×2 | 303×3×2 | 46.1 | — |
| ECOEn-RGB | BNIncep+3DRes18 | K400 | 92×1×1 | 267×1×1 | 46.4 | — |
| TSN-RGB | ResNet-50 | ImageNet | 8×1×1 | 16×1×1 | 19.5 | 30.0 |
| TSN-RGB | ResNet-50 | ImageNet | 16×1×1 | 33×1×1 | 19.7 | — |
| TSM-RGB | ResNet-50 | ImageNet | 8×1×1 | 33×1×1 | 43.4 | 55.6 |
| TSM-RGB | ResNet-50 | ImageNet | 16×1×1 | 65×1×1 | 44.8 | — |
| bLVNet-TAM | bLResNet-50 | ImageNet | 8×1×1 | 24×1×1 | 46.4 | 59.1 |
| bLVNet-TAM | bLResNet-50 | SSV1/SSV2 | 16×1×1 | 48×1×1 | 48.4 | 61.7 |
| STM | ResNet-50 | ImageNet | 8×3×10 | 33×3×10 | 49.2 | 62.3 |
| STM | ResNet-50 | ImageNet | 16×3×10 | 67×3×10 | 50.7 | 64.2 |
| TEA | ResNet-50 | ImageNet | 8×1×1 | 35×1×1 | 48.9 | — |
| TEA | ResNet-50 | ImageNet | 16×1×1 | 70×1×1 | 51.9 | — |
| TANet | ResNet-50 | ImageNet | 8×1×1 | 33×1×1 | 47.3 | 60.5 |
| TANet | ResNet-50 | ImageNet | 16×1×1 | 66×1×1 | 47.6 | 62.5 |
| TDN | ResNet-50 | ImageNet | 8×3×10 | 36×3×10 | 52.3 | 64.0 |
| TDN | ResNet-50 | ImageNet | 16×3×10 | 72×3×10 | 53.9 | 65.3 |
| Uni-AdaFocus-TSM(962) | MN2+R50 | ImageNet | (8+12)×1×1 | 9×1×1 | 48.9 | 62.5 |
| Uni-AdaFocus-TSM(1282) | MN2+R50 | ImageNet | (8+16)×1×1 | 19×1×1 | 51.0 | 64.2 |
| JCFG-STM | ResNet-50 | ImageNet | 8×3×10 | 35×3×10 | — | 62.0 |
| JCFG-STM | ResNet-50 | ImageNet | 16×3×10 | 70×3×10 | — | 62.2 |
| 本文方法 | ResNet-50 | ImageNet | 8×1×1 | 35×1×1 | 50.6 | 61.9 |
| ResNet-50 | ImageNet | 16×1×1 | 70×1×1 | 52.7 | 63.2 | |
| ResNet-50 | ImageNet | 16×3×10 | 70×3×10 | 54.1 | 65.6 |
表1 不同方法在Something-SomethingV1和V2数据集上的动作识别结果的比较
Tab. 1 Action recognition result comparison of different methods on Something-SomethingV1 and V2 datasets
| 方法 | 骨干网络 | 预训练 | 帧数 | 浮点运算量(GFLOPs) | V1-Top1/% | V2-Top1/% |
|---|---|---|---|---|---|---|
| NL-I3D | 3DRes50 | ImgNet+K400 | 32×3×2 | 153×3×2 | 41.6 | — |
| NL-I3D+GCN | 3DRes50+GCN | ImgNet+K400 | 32×3×2 | 303×3×2 | 46.1 | — |
| ECOEn-RGB | BNIncep+3DRes18 | K400 | 92×1×1 | 267×1×1 | 46.4 | — |
| TSN-RGB | ResNet-50 | ImageNet | 8×1×1 | 16×1×1 | 19.5 | 30.0 |
| TSN-RGB | ResNet-50 | ImageNet | 16×1×1 | 33×1×1 | 19.7 | — |
| TSM-RGB | ResNet-50 | ImageNet | 8×1×1 | 33×1×1 | 43.4 | 55.6 |
| TSM-RGB | ResNet-50 | ImageNet | 16×1×1 | 65×1×1 | 44.8 | — |
| bLVNet-TAM | bLResNet-50 | ImageNet | 8×1×1 | 24×1×1 | 46.4 | 59.1 |
| bLVNet-TAM | bLResNet-50 | SSV1/SSV2 | 16×1×1 | 48×1×1 | 48.4 | 61.7 |
| STM | ResNet-50 | ImageNet | 8×3×10 | 33×3×10 | 49.2 | 62.3 |
| STM | ResNet-50 | ImageNet | 16×3×10 | 67×3×10 | 50.7 | 64.2 |
| TEA | ResNet-50 | ImageNet | 8×1×1 | 35×1×1 | 48.9 | — |
| TEA | ResNet-50 | ImageNet | 16×1×1 | 70×1×1 | 51.9 | — |
| TANet | ResNet-50 | ImageNet | 8×1×1 | 33×1×1 | 47.3 | 60.5 |
| TANet | ResNet-50 | ImageNet | 16×1×1 | 66×1×1 | 47.6 | 62.5 |
| TDN | ResNet-50 | ImageNet | 8×3×10 | 36×3×10 | 52.3 | 64.0 |
| TDN | ResNet-50 | ImageNet | 16×3×10 | 72×3×10 | 53.9 | 65.3 |
| Uni-AdaFocus-TSM(962) | MN2+R50 | ImageNet | (8+12)×1×1 | 9×1×1 | 48.9 | 62.5 |
| Uni-AdaFocus-TSM(1282) | MN2+R50 | ImageNet | (8+16)×1×1 | 19×1×1 | 51.0 | 64.2 |
| JCFG-STM | ResNet-50 | ImageNet | 8×3×10 | 35×3×10 | — | 62.0 |
| JCFG-STM | ResNet-50 | ImageNet | 16×3×10 | 70×3×10 | — | 62.2 |
| 本文方法 | ResNet-50 | ImageNet | 8×1×1 | 35×1×1 | 50.6 | 61.9 |
| ResNet-50 | ImageNet | 16×1×1 | 70×1×1 | 52.7 | 63.2 | |
| ResNet-50 | ImageNet | 16×3×10 | 70×3×10 | 54.1 | 65.6 |
| 方法 | 参数量/106 | V1-Top1/% |
|---|---|---|
| TEA block | 24.30 | 48.9 |
| TFME+MTA block | 24.51 | 49.6 |
表2 局部时序建模的性能对比
Tab. 2 Performance comparison of local temporal modeling
| 方法 | 参数量/106 | V1-Top1/% |
|---|---|---|
| TEA block | 24.30 | 48.9 |
| TFME+MTA block | 24.51 | 49.6 |
| 方法 | 参数量/106 | V1-Top1/% |
|---|---|---|
| TEA block | 24.30 | 48.9 |
| ME+TACE block | 24.75 | 49.8 |
| TFME+TACE block | 24.96 | 50.6 |
表3 长程时序建模的性能对比
Tab. 3 Performance comparison of long-range temporal modeling
| 方法 | 参数量/106 | V1-Top1/% |
|---|---|---|
| TEA block | 24.30 | 48.9 |
| ME+TACE block | 24.75 | 49.8 |
| TFME+TACE block | 24.96 | 50.6 |
| [1] | 倪苒岩,张轶. 基于视频时空特征的行为识别方法[J]. 计算机应用, 2023, 43(2): 521-528. |
| NI R Y, ZHANG Y. Action recognition method based on video spatio-temporal features [J]. Journal of Computer Applications, 2023, 43(2): 521-528. | |
| [2] | ZHANG M, TIAN G, ZHANG Y, et al. Service skill improvement for home robots: autonomous generation of action sequence based on reinforcement learning [J]. Knowledge-Based Systems, 2021, 212: No.106605. |
| [3] | WANG H, SCHMID C. Action recognition with improved trajectories [C]// Proceedings of the 2013 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2013: 3551-3558. |
| [4] | MORSHED M G, SULTANA T, ALAM A, et al. Human action recognition: a taxonomy-based survey, updates, and opportunities[J]. Sensors, 2023, 23(4): No.2182. |
| [5] | TANG Z, ZHAO Y, WEN Y, et al. A survey on backbones for deep video action recognition [C]// Proceedings of the 2024 International Conference on Multimedia and Expo Workshops. Piscataway: IEEE, 2024: 1-6. |
| [6] | 王丽芳,吴荆双,尹鹏亮,等. 基于注意力机制和能量函数的动作识别算法[J]. 计算机应用, 2025, 45(1): 234-239. |
| WANG L F, WU J S, YIN P L, et al. Action recognition algorithm based on attention mechanism and energy function [J]. Journal of Computer Applications, 2025, 45(1): 234-239. | |
| [7] | WANG Z, SHE Q, SMOLIC A. ACTION-Net: multipath excitation for action recognition [C]// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 13209-13218. |
| [8] | JIANG Z, ZHANG Y, HU S. ESTI: an action recognition network with enhanced spatio-temporal information [J]. International Journal of Machine Learning and Cybernetics, 2023, 14(9): 3059-3070. |
| [9] | LIU T, MA Y, YANG W, et al. Spatial-temporal interaction learning based two-stream network for action recognition [J]. Information Sciences, 2022, 606: 864-876. |
| [10] | WANG Z, LU H, JIN J, et al. Human action recognition based on improved two-stream convolution network [J]. Applied Sciences, 2022, 12(12): No.5784. |
| [11] | LIN J, GAN C, HAN S. TSM: temporal shift module for efficient video understanding [C]// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 7082-7092. |
| [12] | CAI Z. A novel spatio-temporal-wise network for action recognition[J]. IEEE Access, 2023, 11: 49071-49080. |
| [13] | CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? a new model and the kinetics dataset [C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 4724-4733. |
| [14] | CHEN B, MENG F, TANG H, et al. Two-level attention module based on spurious-3D residual networks for human action recognition [J]. Sensors, 2023, 23(3): No.1707. |
| [15] | DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[EB/OL]. [2025-03-12].. |
| [16] | ARNAB A, DEHGHANI M, HEIGOLD G, et al. ViViT: a video Vision Transformer [C]// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 6816-6826. |
| [17] | WANG L, HUANG B, ZHAO Z, et al. VideoMAE V2: scaling video masked autoencoders with dual masking [C]// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 14549-14560. |
| [18] | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6000-6010. |
| [19] | SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos [C]// Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1. Cambridge: MIT Press, 2014: 568-576. |
| [20] | WANG L, XIONG Y, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition [C]// Proceedings of the 2016 European Conference on Computer Vision, LNCS 9912. Cham: Springer, 2016: 20-36. |
| [21] | LI Y, JI B, SHI X, et al. TEA: temporal excitation and aggregation for action recognition [C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 906-915. |
| [22] | WANG L, TONG Z, JI B, et al. TDN: temporal difference networks for efficient action recognition [C]// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 1895-1904. |
| [23] | CHEN Y, GE H, LIU Y, et al. AGPN: action granularity pyramid network for video action recognition [J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(8): 3912-3923. |
| [24] | LI C, CHENG C, YU M, et al. Joint coarse to fine-grained spatio-temporal modeling for video action recognition [J]. IEEE Transactions on Biometrics, Behavior, and Identity Science, 2025, 7(3): 444-457. |
| [25] | LI K, LI X, WANG Y, et al. CT-Net: channel tensorization network for video classification [EB/OL]. [2025-04-09].. |
| [26] | WANG X, GIRSHICK R, GUPTA A, et al. Non-local neural networks [C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 7794-7803. |
| [27] | XIONG X, MIN W, HAN Q, et al. Action recognition using action sequences optimization and two-stream 3D dilated neural network [J]. Computational Intelligence and Neuroscience, 2022, 2022: No.6608448. |
| [28] | TAO Y, TAO H, ZHUANG Z, et al. Quantized iterative learning control of communication-constrained systems with encoding and decoding mechanism [J]. Transactions of the Institute of Measurement and Control, 2024, 46(10): 1943-1954. |
| [29] | SONG X, PENG Z, SONG S, et al. Anti-disturbance state estimation for PDT-switched RDNNs utilizing time-sampling and space-splitting measurements [J]. Communications in Nonlinear Science and Numerical Simulation, 2024, 132: No.107945. |
| [30] | HU J, SHEN L, SUN G. Squeeze-and-excitation networks [C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 7132-7141. |
| [31] | WANG Q, WU B, ZHU P, et al. ECA-Net: efficient channel attention for deep convolutional neural networks [C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 11531-11539. |
| [32] | LI Z, LI J, MA Y, et al. Spatio-temporal adaptive network with bidirectional temporal difference for action recognition [J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(9): 5174-5185. |
| [33] | HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778. |
| [34] | JIANG B, WANG M, GAN W, et al. STM: spatiotemporal and motion encoding for action recognition [C]// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 2000-2009. |
| [35] | GAO S H, CHENG M M, ZHAO K, et al. Res2Net: a new multi-scale backbone architecture [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(2): 652-662. |
| [36] | GAO H, WANG Z, JI S. ChannelNets: compact and efficient convolutional neural networks via channel-wise convolutions [C]// Proceedings of the 32nd Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2018: 5203-5211. |
| [37] | GOYAL R, EBRAHIMI KAHOU S, MICHALSKI V, et al. The “something something” video database for learning and evaluating visual common sense [C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 5843-5851. |
| [38] | KAY W, CARREIRA J, SIMONYAN K, et al. The kinetics human action video dataset [EB/OL]. [2025-04-26].. |
| [39] | WANG X, GUPTA A. Videos as space-time region graphs [C]// Proceedings of the 2018 European Conference on Computer Vision, LNCS 11209. Cham: Springer, 2018: 413-431. |
| [40] | ZOLFAGHARI M, SINGH K, BROX T. ECO: efficient convolutional network for online video understanding [C]// Proceedings of the 2018 European Conference on Computer Vision, LNCS 11206. Cham: Springer, 2018: 713-730. |
| [41] | FAN Q, CHEN C F R, KUEHNE H, et al. More is less: learning efficient video representations by big-little network and depthwise temporal aggregation [C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2019: 2264-2273. |
| [42] | LIU Z, WANG L, WU W, et al. TAM: temporal adaptive module for video recognition [C]// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 13688-13698. |
| [43] | WANG Y, ZHANG H, YUE Y, et al. Uni-AdaFocus: spatial-temporal dynamic computation for video recognition [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025, 47(3): 1782-1799. |
| [1] | 韩锋, 卜永丰, 梁浩翔, 黄舒雯, 张朝阳, 孙士杰. 基于多层次时空交互依赖的车辆轨迹异常检测[J]. 《计算机应用》唯一官方网站, 2026, 46(2): 604-612. |
| [2] | 吴俊锐, 杨江川, 喻海生, 邹赛, 汪文勇. 基于复增强注意力机制图神经网络的确定性网络性能评估方法[J]. 《计算机应用》唯一官方网站, 2026, 46(2): 505-517. |
| [3] | 罗虎, 张明书. 基于跨模态注意力机制与对比学习的谣言检测方法[J]. 《计算机应用》唯一官方网站, 2026, 46(2): 361-367. |
| [4] | 林金娇, 张灿舜, 陈淑娅, 王天鑫, 连剑, 徐庸辉. 基于改进图注意力网络的车险欺诈检测方法[J]. 《计算机应用》唯一官方网站, 2026, 46(2): 437-444. |
| [5] | 李名, 王孟齐, 张爱丽, 任花, 窦育强. 基于条件生成对抗网络和混合注意力机制的图像隐写方法[J]. 《计算机应用》唯一官方网站, 2026, 46(2): 475-484. |
| [6] | 张四中, 刘建阳, 李林峰. 基于X3D的轨迹引导感知学习的动作质量评估模型[J]. 《计算机应用》唯一官方网站, 2026, 46(2): 555-563. |
| [7] | 张日丰, 李广明, 欧阳裕荣. 反射先验图引导的低光图像增强网络[J]. 《计算机应用》唯一官方网站, 2026, 46(2): 546-554. |
| [8] | 徐千惠, 钮可, 朱顺哲, 石林, 李军. 增强型可逆神经网络视频隐写网络GAB3D-SEVSN[J]. 《计算机应用》唯一官方网站, 2026, 46(2): 467-474. |
| [9] | 李亚男, 郭梦阳, 邓国军, 陈允峰, 任建吉, 原永亮. 基于多模态融合特征的并分支发动机寿命预测方法[J]. 《计算机应用》唯一官方网站, 2026, 46(1): 305-313. |
| [10] | 昝志辉, 王雅静, 李珂, 杨智翔, 杨光宇. 基于SAA-CNN-BiLSTM网络的多特征融合语音情感识别方法[J]. 《计算机应用》唯一官方网站, 2026, 46(1): 69-76. |
| [11] | 王丽芳, 任文婧, 郭晓东, 张荣国, 胡立华. 用于低剂量CT图像降噪的多路特征生成对抗网络[J]. 《计算机应用》唯一官方网站, 2026, 46(1): 270-279. |
| [12] | 马英杰, 覃晶滢, 赵耿, 肖靖. 面向物联网图像的深度压缩感知网络及其混沌加密保护方法[J]. 《计算机应用》唯一官方网站, 2026, 46(1): 144-151. |
| [13] | 吕景刚, 彭绍睿, 高硕, 周金. 复频域注意力和多尺度频域增强驱动的语音增强网络[J]. 《计算机应用》唯一官方网站, 2025, 45(9): 2957-2965. |
| [14] | 李维刚, 邵佳乐, 田志强. 基于双注意力机制和多尺度融合的点云分类与分割网络[J]. 《计算机应用》唯一官方网站, 2025, 45(9): 3003-3010. |
| [15] | 王翔, 陈志祥, 毛国君. 融合局部和全局相关性的多变量时间序列预测方法[J]. 《计算机应用》唯一官方网站, 2025, 45(9): 2806-2816. |
| 阅读次数 | ||||||
|
全文 |
|
|||||
|
摘要 |
|
|||||