《计算机应用》唯一官方网站

• •    下一篇

局部与长程时序互补建模的视频动作识别

张祖习,张战成*,胡伏原   

  1. 苏州科技大学 电子与信息工程学院,江苏 苏州 215009
  • 收稿日期:2025-05-08 修回日期:2025-07-14 接受日期:2025-07-17 发布日期:2025-07-24 出版日期:2025-07-24
  • 通讯作者: 张战成
  • 基金资助:
    未知环境下域变化持续学习方法研究

Local and long-range temporal complementary modeling for video action recognition

  • Received:2025-05-08 Revised:2025-07-14 Accepted:2025-07-17 Online:2025-07-24 Published:2025-07-24

摘要: 由于视频时空特征的多样性与复杂性,以及不同速度和尺度下动作的广泛性,针对现有方法在动作识别任务中普遍存在局部运动细节捕获不足和长程时序依赖关系挖掘不充分的问题,提出局部与长程时序互补建模的视频动作识别网络。该网络包含两级融合运动激励(SFME)和时序聚合通道激励(TACE)模块。SFME通过计算与融合相邻两帧特征图的一阶差分与二阶差分,并将融合后的权重对原始特征图的通道进行激励,以增强多级运动特征的细粒度提取能力,从而建模局部时序信息。TACE通过通道分组策略构建分层残差连接的金字塔结构,以扩大时序感受野和增强多尺度特征的学习能力。同时,设计时序通道注意力(TCA)机制,对聚合后的特征图进行动态调整,优化时序通道间的权重分配,从而建模长程时序信息。最后,将上述优势互补的模块融合嵌入至二维残差网络,实现端到端的动作识别。在Something-SomethingV1和V2两种验证集上,仅使用RGB视频帧作为输入,随机采样8帧策略时,所提网络的Top-1识别准确率分别达到50.6%和61.9%;16帧策略时,分别达到54.1%和65.6%。实验结果表明,所提网络能够高效建模视频的局部运动细节与长程时序依赖关系,为复杂时序场景下的动作识别任务提供新的思路。

关键词: 视频动作识别, 时空特征, 局部运动, 时序建模, 注意力机制

Abstract: Due to the diversity and complexity of spatiotemporal features in videos, as well as the wide variability of actions across different speeds and scales, problems of insufficient capture of local motion details and inadequate modeling of long-range temporal dependencies are commonly encountered in existing action recognition methods. To address these issues, a video action recognition network based on the complementary modeling of local and long-range temporal information was proposed. The network was composed of the Secondary Fusion Motion Excitation (SFME) and the Temporal Aggregation Channel Excitation (TACE) modules. In the SFME module, the first-order and second-order differences between adjacent feature maps were computed and fused, and the fused weights were used to excite the channels of the original feature maps, so as to enhance the fine-grained extraction capability of multi-level motion features and to model local temporal information. In the TACE module, a hierarchical residual pyramid structure was constructed using a channel grouping strategy, which expanded the temporal receptive field and enhanced the learning ability of multi-scale features. Meanwhile, a Temporal Channel Attention (TCA) mechanism was designed to dynamically adjust the aggregated feature maps and optimize the weight allocation among temporal channels, thereby modeling long-range temporal information. Finally, the above complementary modules were integrated into a 2D residual network to realize end-to-end action recognition. On the Something-Something V1 and V2 benchmarks, using only RGB frames with a random 8-frame sampling strategy, the proposed network achieves Top-1 accuracies of 50.6% and 61.9%, respectively. With a 16-frame sampling strategy, the accuracies further increase to 54.1% and 65.6%, respectively. The experimental results demonstrate that the proposed network effectively captures both local motion details and long-range temporal dependencies, offering a new perspective for action recognition in complex temporal scenarios.

Key words: video action recognition, spatio-temporal features, local motion, temporal modeling, attention mechanism

中图分类号: