Journal of Computer Applications

    Next Articles

Rhythmic modeling and motion-aware for few-shot action recognition

  

  • Received:2025-11-28 Revised:2026-01-30 Accepted:2026-02-09 Online:2026-03-13 Published:2026-03-13
  • Supported by:
    National Natural Science Foundation of China;Suzhou’s key core technology “lis thanging marshal”project

节奏建模与运动感知的少样本动作识别

陶文浩1,张战成2,张洪祯3,王号天1,胡伏原1   

  1. 1. 苏州科技大学
    2. 苏州科技大学,电子与信息工程学院
    3. 深圳大学
  • 通讯作者: 张战成
  • 基金资助:
    国家自然科学基金;苏州市“揭榜挂帅”关键核心技术

Abstract: Few-shot action recognition aims to identify novel action categories with a severely limited number of labeled samples. Existing approaches commonly relied on single-frame spatial cues or coarse temporal features, which hindered the modeling of subtle inter-frame displacements and rhythmic variations, and thus limited the representation of fine-grained temporal structures. To address these shortcomings, a motion-aware contrastive learning module (MACL) was introduced to guide Transformer attention toward motion-dominant regions and to enhance the dynamic consistency and discriminative power of feature representations. In addition, a displacement-aware motion encoder (DME) was devised, in which inter-frame displacement was explicitly modeled via correlation analysis and Gaussian smoothing, and key dynamic changes were adaptively captured through a rhythm modeling mechanism. The overall framework was constructed using a dual-branch architecture and was trained with end-to-end multi-task optimization. On the Something-Something V2 dataset, accuracies of 58.2% and 72.4% were obtained under the 1-shot and 5-shot settings, respectively; on the Kinetics dataset, accuracies of 74.4% and 86.2% were achieved for the same settings. These results indicate that displacement cues and rhythmic variations are more effectively exploited, enabling richer representations of fine-grained temporal dynamics and improved category separability and generalization in few-shot scenarios.

Key words: few-shot action recognition, rhythm modeling, contrastive learning, motion awareness, dual-branch architecture

摘要: 少样本动作识别任务旨在标注样本极少的情况下实现对新类别视频动作的准确识别,然而现有方法普遍依赖于单帧空间信息或粗粒度特征,难以刻画帧间微小位移与动作节奏等细粒度动态,导致对关键时序结构的建模不足。针对这一问题,提出运动感知对比学习模块(MACL),通过引导Transformer关注运动主导区域,提升特征表示的动态一致性与判别能力;同时构建位移感知运动编码器(DME),通过相关性分析与高斯平滑机制显式建模帧间位移,结合节奏建模模块自适应捕捉关键帧动态,提升对时序动态的刻画能力。整体模型在双分支架构基础上构建,并采用端到端的多任务联合优化。以1-shot和5-shot准确率为评价指标,在Something-Something V2数据集上准确率分别为58.2%和72.4%;在Kinetics数据集上准确率分别为74.4%和86.2%;实验结果表明该方法能够更充分利用位移线索与节奏变化信息,提升模型对细粒度动态结构的表征能力,并在少样本场景下获得更高的类别区分度与泛化能力。

关键词: 少样本动作识别, 节奏建模, 对比学习, 运动感知, 双分支架构

CLC Number: