Rhythmic modeling and motion-aware for few-shot action recognition

doi:10.11772/j.issn.1001-9081.2025111393

Journal of Computer Applications

Received:2025-11-28 Revised:2026-01-30 Accepted:2026-02-09 Online:2026-03-13 Published:2026-03-13
Supported by:
National Natural Science Foundation of China;Suzhou’s key core technology “lis thanging marshal”project

节奏建模与运动感知的少样本动作识别

陶文浩¹,张战成²,张洪祯³,王号天¹,胡伏原¹

1. 苏州科技大学
2. 苏州科技大学，电子与信息工程学院
3. 深圳大学

通讯作者: 张战成
基金资助:
国家自然科学基金;苏州市“揭榜挂帅”关键核心技术

Abstract

Abstract: Few-shot action recognition aims to identify novel action categories with a severely limited number of labeled samples. Existing approaches commonly relied on single-frame spatial cues or coarse temporal features, which hindered the modeling of subtle inter-frame displacements and rhythmic variations, and thus limited the representation of fine-grained temporal structures. To address these shortcomings, a motion-aware contrastive learning module (MACL) was introduced to guide Transformer attention toward motion-dominant regions and to enhance the dynamic consistency and discriminative power of feature representations. In addition, a displacement-aware motion encoder (DME) was devised, in which inter-frame displacement was explicitly modeled via correlation analysis and Gaussian smoothing, and key dynamic changes were adaptively captured through a rhythm modeling mechanism. The overall framework was constructed using a dual-branch architecture and was trained with end-to-end multi-task optimization. On the Something-Something V2 dataset, accuracies of 58.2% and 72.4% were obtained under the 1-shot and 5-shot settings, respectively; on the Kinetics dataset, accuracies of 74.4% and 86.2% were achieved for the same settings. These results indicate that displacement cues and rhythmic variations are more effectively exploited, enabling richer representations of fine-grained temporal dynamics and improved category separability and generalization in few-shot scenarios.

Key words: few-shot action recognition, rhythm modeling, contrastive learning, motion awareness, dual-branch architecture

摘要： 少样本动作识别任务旨在标注样本极少的情况下实现对新类别视频动作的准确识别，然而现有方法普遍依赖于单帧空间信息或粗粒度特征，难以刻画帧间微小位移与动作节奏等细粒度动态，导致对关键时序结构的建模不足。针对这一问题，提出运动感知对比学习模块（MACL），通过引导Transformer关注运动主导区域，提升特征表示的动态一致性与判别能力；同时构建位移感知运动编码器（DME），通过相关性分析与高斯平滑机制显式建模帧间位移，结合节奏建模模块自适应捕捉关键帧动态，提升对时序动态的刻画能力。整体模型在双分支架构基础上构建，并采用端到端的多任务联合优化。以1-shot和5-shot准确率为评价指标，在Something-Something V2数据集上准确率分别为58.2%和72.4%；在Kinetics数据集上准确率分别为74.4%和86.2%；实验结果表明该方法能够更充分利用位移线索与节奏变化信息，提升模型对细粒度动态结构的表征能力，并在少样本场景下获得更高的类别区分度与泛化能力。

关键词: 少样本动作识别, 节奏建模, 对比学习, 运动感知, 双分支架构

CLC Number:

TP391.4

陶文浩张战成张洪祯王号天胡伏原. 节奏建模与运动感知的少样本动作识别[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2025111393.

[1]	Xiaoxia LIU, Liqun KUANG, Song WANG, Shichao JIAO, Huiyan HAN, Fengguang XIONG. Multi-scale spatio-temporal decoupling for contrastive learning of skeleton action recognition [J]. Journal of Computer Applications, 2026, 46(3): 767-774.
[2]	Yuhang XIAO, Guanfeng LI, Yuyin CHEN, Jing QIN. Few-shot relation extraction model with graph-based multi-view contrastive learning [J]. Journal of Computer Applications, 2026, 46(3): 732-740.
[3]	Limei DONG, Yanzi LI, Jiayin LI, Li XU. Neighborhood-enhanced unsupervised graph anomaly detection [J]. Journal of Computer Applications, 2026, 46(2): 458-466.
[4]	Hu LUO, Mingshu ZHANG. Rumor detection method based on cross-modal attention mechanism and contrastive learning [J]. Journal of Computer Applications, 2026, 46(2): 361-367.
[5]	Wen LI, Kairong LI, Kai YANG. Subgraph-aware contrastive learning with data augmentation [J]. Journal of Computer Applications, 2026, 46(1): 1-9.
[6]	Xingyao YANG, Zheng QI, Jiong YU, Zulian ZHANG, Shuai MA, Hongtao SHEN. Session-based recommendation model based on time-aware and space-enhanced dual channel graph neural network [J]. Journal of Computer Applications, 2026, 46(1): 104-112.
[7]	Ziyang CHENG, Ruizhang HUANG, Jingjing XUE. Deep evolutionary topic clustering model [J]. Journal of Computer Applications, 2026, 46(1): 85-94.
[8]	Chao LIU, Yanhua YU. Knowledge-aware recommendation model combining denoising strategy and multi-view contrastive learning [J]. Journal of Computer Applications, 2025, 45(9): 2827-2837.
[9]	Zhixiong XU, Bo LI, Xiaoyong BIAN, Qiren HU. Adversarial sample embedded attention U-Net for 3D medical image segmentation [J]. Journal of Computer Applications, 2025, 45(9): 3011-3016.
[10]	Zhiyuan WANG, Tao PENG, Jie YANG. Integrating internal and external data for out-of-distribution detection training and testing [J]. Journal of Computer Applications, 2025, 45(8): 2497-2506.
[11]	Jin XIE, Surong CHU, Yan QIANG, Juanjuan ZHAO, Hua ZHANG, Yong GAO. Dual-branch distribution consistency contrastive learning model for hard negative sample identification in chest X-rays [J]. Journal of Computer Applications, 2025, 45(7): 2369-2377.
[12]	Zhenzhou WANG, Fangfang GUO, Jingfang SU, He SU, Jianchao WANG. Robustness optimization method of visual model for intelligent inspection [J]. Journal of Computer Applications, 2025, 45(7): 2361-2368.
[13]	Wenjing YAN, Ruidong WANG, Min ZUO, Qingchuan ZHANG. Recipe recommendation model based on hierarchical learning of flavor embedding heterogeneous graph [J]. Journal of Computer Applications, 2025, 45(6): 1869-1878.
[14]	Mingfeng YU, Yongbin QIN, Ruizhang HUANG, Yanping CHEN, Chuan LIN. Multi-label text classification method based on contrastive learning enhanced dual-attention mechanism [J]. Journal of Computer Applications, 2025, 45(6): 1732-1740.
[15]	Chaoying JIANG, Qian LI, Ning LIU, Lei LIU, Lizhen CUI. Readmission prediction model based on graph contrastive learning [J]. Journal of Computer Applications, 2025, 45(6): 1784-1792.