《计算机应用》唯一官方网站 ›› 2026, Vol. 46 ›› Issue (3): 758-766.DOI: 10.11772/j.issn.1001-9081.2025040509

• 人工智能 • 上一篇    下一篇

局部与长程时序互补建模的视频动作识别

张祖习, 张战成(), 胡伏原   

  1. 苏州科技大学 电子与信息工程学院,江苏 苏州 215009
  • 收稿日期:2025-05-09 修回日期:2025-07-14 接受日期:2025-07-17 发布日期:2026-03-16 出版日期:2026-03-10
  • 通讯作者: 张战成
  • 作者简介:张祖习(1998—),男,江西九江人,硕士研究生,CCF会员,主要研究方向:计算机视觉、视频动作识别
    胡伏原(1978—),男,湖南岳阳人,教授,博士,CCF会员,主要研究方向:图像处理、模式识别、信息安全。
  • 基金资助:
    国家自然科学基金资助项目(62476189)

Local and long-range temporal complementary modeling for video action recognition

Zuxi ZHANG, Zhancheng ZHANG(), Fuyuan HU   

  1. School of Electronics and Information Engineering,Suzhou University of Science and Technology,Suzhou Jiangsu 215009,China
  • Received:2025-05-09 Revised:2025-07-14 Accepted:2025-07-17 Online:2026-03-16 Published:2026-03-10
  • Contact: Zhancheng ZHANG
  • About author:ZHANG Zuxi, born in 1998, M. S. candidate. His research interests include computer vision, video action recognition.
    HU Fuyuan, born in 1978, Ph. D., professor. His research interests include image processing, pattern recognition, information security.
  • Supported by:
    National Natural Science Foundation of China(62476189)

摘要:

由于视频时空特征的多样性与复杂性,以及不同速度和尺度下动作的广泛性,现有方法在动作识别任务中普遍存在对局部运动细节捕获不足和对长程时序依赖关系挖掘不充分的问题。因此,提出局部与长程时序互补建模的视频动作识别网络。该网络包含两级融合运动激励(TFME)和时序聚合通道激励(TACE)模块。TFME模块计算与融合相邻两帧特征图的一阶差分与二阶差分,并使用融合后的权重对原始特征图的通道进行激励,以增强多级运动特征的细粒度提取能力,从而建模局部时序信息;TACE模块通过通道分组策略构建分层残差连接的金字塔结构,以扩大时序感受野并增强多尺度特征的学习能力。同时,设计时序通道注意力(TCA)机制对聚合后的特征图进行动态调整,并优化时序通道间的权重分配,从而建模长程时序信息。最后,将上述优势互补的模块融合嵌入二维残差网络,实现端到端的动作识别。实验结果表明,在Something-SomethingV1和V2验证集上,仅使用RGB视频帧作为输入,使用8帧采样策略时,所提网络的Top-1识别准确率分别达到50.6%和61.9%;使用16帧采样策略时,所提网络的Top-1识别准确率分别达到54.1%和65.6%。可见,所提网络能够高效建模视频的局部运动细节与长程时序依赖关系,为复杂时序场景下的动作识别任务提供了新的思路。

关键词: 视频动作识别, 时空特征, 局部运动, 时序建模, 注意力机制

Abstract:

Due to the diversity and complexity of spatio-temporal features in videos, as well as the wide variability of actions across different speeds and scales, problems of insufficient capture of local motion details and inadequate mining of long-range temporal dependencies are commonly encountered in the existing action recognition methods. Therefore, a video action recognition network based on complementary modeling of local and long-range temporal information was proposed. The network is composed of the Two-level Fusion Motion Excitation (TFME) and the Temporal Aggregation Channel Excitation (TACE) modules. In the TFME module, the first-order and second-order differences between adjacent feature maps were computed and fused, and the fused weights were used to excite channels of the original feature maps, so as to enhance the fine-grained extraction capability of multi-level motion features, thereby modeling local temporal information. In the TACE module, a hierarchical residual pyramid structure was constructed using a channel grouping strategy, which expanded the temporal receptive field and enhanced the learning ability of multi-scale features. Meanwhile, a Temporal Channel Attention (TCA) mechanism was designed to adjust the aggregated feature maps dynamically and optimize the weight allocation among temporal channels, thereby modeling long-range temporal information. Finally, the above complementary modules were integrated and embedded into a 2D residual network to realize end-to-end action recognition. Experimental results on the Something-SomethingV1 and V2 validation sets show that using only RGB frames with a random 8-frame sampling strategy, the proposed network achieves the Top-1 accuracies of 50.6% and 61.9%, respectively; with a 16-frame sampling strategy, the accuracies are 54.1% and 65.6%, respectively. It can be seen that the proposed network models both local motion details and long-range temporal dependencies efficiently, offering a new way of thinking for action recognition tasks in complex temporal scenarios.

Key words: video action recognition, spatio-temporal feature, local motion, temporal modeling, attention mechanism

中图分类号: