局部与长程时序互补建模的视频动作识别

doi:10.11772/j.issn.1001-9081.2025040509

《计算机应用》唯一官方网站

• • 下一篇

局部与长程时序互补建模的视频动作识别

张祖习，张战成^*，胡伏原

苏州科技大学电子与信息工程学院，江苏苏州 215009

收稿日期:2025-05-08 修回日期:2025-07-14 接受日期:2025-07-17 发布日期:2025-07-24 出版日期:2025-07-24
通讯作者: 张战成
基金资助:
未知环境下域变化持续学习方法研究

Local and long-range temporal complementary modeling for video action recognition

Received:2025-05-08 Revised:2025-07-14 Accepted:2025-07-17 Online:2025-07-24 Published:2025-07-24

摘要/Abstract

摘要： 由于视频时空特征的多样性与复杂性，以及不同速度和尺度下动作的广泛性，针对现有方法在动作识别任务中普遍存在局部运动细节捕获不足和长程时序依赖关系挖掘不充分的问题，提出局部与长程时序互补建模的视频动作识别网络。该网络包含两级融合运动激励（SFME）和时序聚合通道激励（TACE）模块。SFME通过计算与融合相邻两帧特征图的一阶差分与二阶差分，并将融合后的权重对原始特征图的通道进行激励，以增强多级运动特征的细粒度提取能力，从而建模局部时序信息。TACE通过通道分组策略构建分层残差连接的金字塔结构，以扩大时序感受野和增强多尺度特征的学习能力。同时，设计时序通道注意力（TCA）机制，对聚合后的特征图进行动态调整，优化时序通道间的权重分配，从而建模长程时序信息。最后，将上述优势互补的模块融合嵌入至二维残差网络，实现端到端的动作识别。在Something-SomethingV1和V2两种验证集上，仅使用RGB视频帧作为输入，随机采样8帧策略时，所提网络的Top-1识别准确率分别达到50.6%和61.9%；16帧策略时，分别达到54.1%和65.6%。实验结果表明，所提网络能够高效建模视频的局部运动细节与长程时序依赖关系，为复杂时序场景下的动作识别任务提供新的思路。

关键词: 视频动作识别, 时空特征, 局部运动, 时序建模, 注意力机制

Abstract: Due to the diversity and complexity of spatiotemporal features in videos, as well as the wide variability of actions across different speeds and scales, problems of insufficient capture of local motion details and inadequate modeling of long-range temporal dependencies are commonly encountered in existing action recognition methods. To address these issues, a video action recognition network based on the complementary modeling of local and long-range temporal information was proposed. The network was composed of the Secondary Fusion Motion Excitation (SFME) and the Temporal Aggregation Channel Excitation (TACE) modules. In the SFME module, the first-order and second-order differences between adjacent feature maps were computed and fused, and the fused weights were used to excite the channels of the original feature maps, so as to enhance the fine-grained extraction capability of multi-level motion features and to model local temporal information. In the TACE module, a hierarchical residual pyramid structure was constructed using a channel grouping strategy, which expanded the temporal receptive field and enhanced the learning ability of multi-scale features. Meanwhile, a Temporal Channel Attention (TCA) mechanism was designed to dynamically adjust the aggregated feature maps and optimize the weight allocation among temporal channels, thereby modeling long-range temporal information. Finally, the above complementary modules were integrated into a 2D residual network to realize end-to-end action recognition. On the Something-Something V1 and V2 benchmarks, using only RGB frames with a random 8-frame sampling strategy, the proposed network achieves Top-1 accuracies of 50.6% and 61.9%, respectively. With a 16-frame sampling strategy, the accuracies further increase to 54.1% and 65.6%, respectively. The experimental results demonstrate that the proposed network effectively captures both local motion details and long-range temporal dependencies, offering a new perspective for action recognition in complex temporal scenarios.

Key words: video action recognition, spatio-temporal features, local motion, temporal modeling, attention mechanism

中图分类号:

TP391.4

张祖习张战成胡伏原. 局部与长程时序互补建模的视频动作识别[J]. 计算机应用, DOI: 10.11772/j.issn.1001-9081.2025040509.

[1]	郭慧洁, 窦天凤, 张振琳, 亓开元, 吴栋, 曲志坚, 李钊, 任崇广. 基于时间依赖建模的动态贝叶斯网络交通预测[J]. 《计算机应用》唯一官方网站, 2026, 46(5): 1507-1517.
[2]	彭文, 张博凯, 林金炜. 融合图像纹理增强与超分辨率的染色体级联分类框架[J]. 《计算机应用》唯一官方网站, 2026, 46(5): 1647-1657.
[3]	王倩飞, 李旸, 李德玉, 王素格. 基于大语言模型的双通道特征融合表示的短文本聚类方法[J]. 《计算机应用》唯一官方网站, 2026, 46(5): 1441-1449.
[4]	胡静, 陈世堃, 王芳, 张睿, 王勇. 基于线性可变形卷积与双域协同动态注意力的矿石图像分割[J]. 《计算机应用》唯一官方网站, 2026, 46(5): 1692-1702.
[5]	荆莹, 李然, 蒋卓, 付子扬, 杜晶颐, 刘琪, 刘吉航. 引入自动提示编码器的SAM睑板腺统一密集分割方法[J]. 《计算机应用》唯一官方网站, 2026, 46(5): 1667-1676.
[6]	郑宝源, 贺超波. 图扩散与双视图特征学习增强的图卷积网络[J]. 《计算机应用》唯一官方网站, 2026, 46(5): 1370-1377.
[7]	宋芮芮, 王雷春, 何运平, 魏金香, 卢祥凤, 刘小萌. 基于混合自注意力和差异归一化的长时间序列预测[J]. 《计算机应用》唯一官方网站, 2026, 46(5): 1499-1506.
[8]	张红瑞, 冯威铭, 杨潞霞, 马永杰. 基于YOLO11改进的水下小目标检测算法CSAF-YOLO[J]. 《计算机应用》唯一官方网站, 2026, 46(5): 1578-1585.
[9]	李文浩, 郭银章. 基于双层多尺度动态GCN模型的城市交通流量预测[J]. 《计算机应用》唯一官方网站, 2026, 46(4): 1323-1333.
[10]	秦传东, 索志强. 融合改进的ResNet50与集成分类器的皮肤癌分类[J]. 《计算机应用》唯一官方网站, 2026, 46(4): 1354-1362.
[11]	豆旭梦, 解滨, 张朝晖, 赵振刚, 段菡煜, 郭澳磊. 基于结构‒网络协同特征与网格注意力增强KAN的药物靶标相互作用预测[J]. 《计算机应用》唯一官方网站, 2026, 46(4): 1344-1353.
[12]	刘欢娴, 王洪涛, 王宪奥, 王洪梅, 徐伟峰. 跨模态语义关联的多模态事实验证[J]. 《计算机应用》唯一官方网站, 2026, 46(4): 1069-1076.
[13]	白翔, 李巨川, 王慧民, 景超, 钮键, 张兴忠, 程永强. 基于改进Swin Transformer的电力图像检索方法[J]. 《计算机应用》唯一官方网站, 2026, 46(4): 1334-1343.
[14]	刘晓霞, 况立群, 王松, 焦世超, 韩慧妍, 熊风光. 多尺度时空解耦的骨架行为识别对比学习[J]. 《计算机应用》唯一官方网站, 2026, 46(3): 767-774.
[15]	邵培荣, 蔺素珍, 王彦博. 以人为中心的细节增强虚拟试衣方法[J]. 《计算机应用》唯一官方网站, 2026, 46(3): 915-923.

局部与长程时序互补建模的视频动作识别

Local and long-range temporal complementary modeling for video action recognition

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics