Multi-scale spatio-temporal decoupling for contrastive learning of skeleton action recognition

doi:10.11772/j.issn.1001-9081.2025030310

Journal of Computer Applications ›› 2026, Vol. 46 ›› Issue (3): 767-774.DOI: 10.11772/j.issn.1001-9081.2025030310

• Artificial intelligence • Previous Articles Next Articles

Multi-scale spatio-temporal decoupling for contrastive learning of skeleton action recognition

Xiaoxia LIU¹^,²^,³, Liqun KUANG¹^,²^,³(), Song WANG¹^,²^,³, Shichao JIAO¹^,²^,³, Huiyan HAN¹^,²^,³, Fengguang XIONG¹^,²^,³

^1.School of Computer Science and Technology，North University of China，Taiyuan Shanxi 030051，China
^2.Shanxi Key Laboratory of Machine Vision and Virtual Reality （North University of China），Taiyuan Shanxi 030051，China
^3.Shanxi Vision Information Processing and Intelligent Robot Engineering Research Center，Taiyuan Shanxi 030051，China

Received:2025-03-27 Revised:2025-04-27 Accepted:2025-04-28 Online:2025-05-09 Published:2026-03-10
Contact: Liqun KUANG
About author:LIU Xiaoxia， born in 2000， M. S. candidate. Her research interests include human behavior recognition.
WANG Song， born in 1998， Ph. D. candidate. His research interests include image fusion， multimodal data fusion.
JIAO Shichao， born in 1994， Ph. D.， lecturer. His research interests include artificial intelligence， computer vision.
HAN Huiyan， born in 1980， Ph. D.， associate professor. Her research interests include artificial intelligence， computer vision.
XIONG Fengguang， born in 1979， Ph. D.， associate professor. His research interests include artificial intelligence， computer vision.
Supported by:
Shanxi Province Science and Technology Special Plan “Taking on Challenging Projects by Responding to Calls for Solutions” Project(202201150401021);Shanxi Province Science and Technology Achievement Transformation Guidance Project(202104021301055);Shanxi Province Basic Research Program(202303021211153);Shanxi Province Graduate Research Innovation Project(2024KY614)

多尺度时空解耦的骨架行为识别对比学习

刘晓霞¹^,²^,³, 况立群¹^,²^,³(), 王松¹^,²^,³, 焦世超¹^,²^,³, 韩慧妍¹^,²^,³, 熊风光¹^,²^,³

^1.中北大学计算机科学与技术学院，太原 030051
^2.机器视觉与虚拟现实山西省重点实验室（中北大学），太原 030051
^3.山西省视觉信息处理及智能机器人工程研究中心，太原 030051

通讯作者: 况立群
作者简介:刘晓霞（2000—），女，山西临汾人，硕士研究生，CCF会员，主要研究方向：人体行为识别
王松（1998—），男，河南周口人，博士研究生，CCF会员，主要研究方向：图像融合、多模态数据融合
焦世超（1994—），男，山西临汾人，讲师，博士，CCF会员，主要研究方向：人工智能、计算机视觉
韩慧妍（1980—），女，山西临汾人，副教授，博士，CCF会员，主要研究方向：人工智能、计算机视觉
熊风光（1979—），男，湖北鄂州人，副教授，博士，CCF会员，主要研究方向：人工智能、计算机视觉。
基金资助:
山西省科技重大专项计划“揭榜挂帅”项目(202201150401021);山西省科技成果转化引导专项(202104021301055);山西省基础研究计划项目(202303021211153);山西省基础研究计划项目(202303021212189);山西省研究生科研创新项目(2024KY614)

Abstract

Abstract:

Aiming at the problems of dynamic action modeling and multi-scale temporal fusion in skeleton action recognition， an efficient Multi-scale Spatio-Temporal Decoupled Contrastive Learning Framework （MSTDCLF） was proposed. Firstly， a Multi-scale Spatio-Temporal Feature enhancement module （MSTF） was designed to combine depth separable convolution and dilated convolution， so as to model short-term motion features and long-term behavior patterns simultaneously. Secondly， the semantic response between joints and feature channels was further strengthened by embedding the channel-spatial joint attention mechanism. Thirdly， a residual network with attention mechanism was used to solve the gradient decay problem of deep network structure. Finally， a Bidirectional Gated Spatio-temporal Context Modeling （BGSCM） was proposed， and a spatio-temporal enhancement branch was constructed on the basis of Bidirectional Long Short-Term Memory （BiLSTM） network， and the decoupled features were transmitted in joint topology and temporal axis through the gating mechanism， thereby suppressing noise interference and establishing complete action evolution dependency. Experimental results show that MSTDCLF has the accuracies of 87.5% （Cross-Subject （CS）） and 93.0% （Cross-View （CV）） on the NTU RGB+D 60 dataset， and the accuracies of 79.3% （CS） and 80.6% （crosS-Setup （SS）） on the NTU RGB+D 120 dataset， all of which are better than those of the suboptimal method SCD-Net （Spatiotemporal Clues Disentanglement Network）. Ablation experiments verify the effectiveness of the multi-scale design and bidirectional gating mechanism， indicating that MSTDCLF can achieve efficient behavior representation in skeleton behavior recognition and improve recognition accuracy effectively.

Key words: action recognition, human skeleton, gated spatio-temporal context, spatio-temporal feature extraction, contrastive learning, Long Short-Term Memory (LSTM) network

摘要：

针对骨架行为识别中的动态动作建模与多尺度时序融合问题，提出高效多尺度时空解耦对比学习框架（MSTDCLF）。首先，设计多尺度时空特征增强模块（MSTF），结合深度可分离卷积与空洞卷积，从而同步建模短时运动特征与长时周期行为模式；其次，通过嵌入通道-空间联合注意力机制进一步强化关节与特征通道之间的语义响应；再次，使用具有注意力机制的残差网络解决深层网络结构的梯度衰减问题；最后，提出双向门控时空上下文建模（BGSCM），基于双向长短期记忆（BiLSTM）网络构建时空增强分支，通过门控机制在关节拓扑与时序轴中双向传递解耦特征，抑制噪声干扰并建立完整的动作演化依赖。实验结果表明，MSTDCLF在NTU RGB+D 60数据集上的准确率为87.5%（交叉受试者（CS））和93.0%（交叉视角（CV）），在NTU RGB+D 120数据集上的准确率为79.3%（CS）和80.6%（交叉设置（SS）），均优于次优方法SCD-Net（Spatiotemporal Clues Disentanglement Network）。消融实验结果验证了多尺度设计与双向门控机制的有效性，表明MSTDCLF在骨架行为识别中能实现高效的行为表征，有效提高识别精度。

关键词: 行为识别, 人体骨架, 门控时空上下文, 时空特征提取, 对比学习, 长短期记忆网络

CLC Number:

TP391.41

Xiaoxia LIU, Liqun KUANG, Song WANG, Shichao JIAO, Huiyan HAN, Fengguang XIONG. Multi-scale spatio-temporal decoupling for contrastive learning of skeleton action recognition[J]. Journal of Computer Applications, 2026, 46(3): 767-774.

刘晓霞, 况立群, 王松, 焦世超, 韩慧妍, 熊风光. 多尺度时空解耦的骨架行为识别对比学习[J]. 《计算机应用》唯一官方网站, 2026, 46(3): 767-774.

Figures/Tables 11

References 25

[1]	孟月波，陈廷廷，杨逍. 卷积时间注意力与多尺度信息学习的人体行为检测方法［J/OL］. 计算机工程与应用［2025-02-24］. .
	MENG Y B， CHEN T T， YANG X. Convolutional temporal attention and multi-scale information learning for human action detection ［J/OL］. Computer Engineering and Applications ［2025-02-24］. .
[2]	REN Z， ZHANG Q， GAO X， et al. Multi-modality learning for human action recognition ［J］. Multimedia Tools and Applications， 2021， 80（11）： 16185-16203.
[3]	赵登阁，智敏. 用于人体动作识别的多尺度时空图卷积算法［J］. 计算机科学与探索， 2023， 17（3）： 719-732.
	ZHAO D G， ZHI M. Spatial multiple-temporal graph convolutional neural network for human action recognition ［J］. Journal of Frontiers of Computer Science and Technology， 2023， 17（3）： 719-732.
[4]	丁帅，况立群，曹亚明，等. 时空特征融合的高精度轻量级骨架行为识别［J］. 计算机工程， 2025， 51（11）： 283-293.
	DING S， KUANG L Q， CAO Y M， et al. High-precision and lightweight skeleton behavior recognition based on spatial-temporal feature fusion ［J］. Computer Engineering， 2025， 51（11）： 283-293.
[5]	黄倩，崔静雯，李畅. 基于骨骼的人体行为识别方法研究综述［J］. 计算机辅助设计与图形学学报， 2024， 36（2）： 173-194.
	HUANG Q， CUI J W， LI C. A review of skeleton-based human action recognition ［J］. Journal of Computer-Aided Design and Computer Graphics， 2024， 36（2）： 173-194.
[6]	LIN L， ZHANG J， LIU J. Actionlet-dependent contrastive learning for unsupervised skeleton-based action recognition ［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023： 2363-2372.
[7]	WU Z， SUN P， CHEN X， et al. SelfGCN： graph convolution network with self-attention for skeleton-based action recognition ［J］. IEEE Transactions on Image Processing， 2024， 33： 4391-4403.
[8]	YU B， YIN H， ZHU Z. Spatio-temporal graph convolutional networks： a deep learning framework for traffic forecasting ［C］// Proceedings of the 27th International Joint Conference on Artificial Intelligence. California： IJCAI.org， 2018： 3634-3640.
[9]	HUA Y， WU W， ZHENG C， et al. Part aware contrastive learning for self-supervised action recognition ［C］// Proceedings of the 32nd International Joint Conference on Artificial Intelligence. California： IJCAI.org， 2023： 855-863.
[10]	FRANCO L， MANDICA P， MUNJAL B， et al. HYperbolic Self-Paced learning for self-supervised skeleton-based action representations ［EB/OL］. ［2025-02-23］..
[11]	WU Z， PAN S， LONG G， et al. Graph WaveNet for deep spatial-temporal graph modeling ［C］// Proceedings of the 28th International Joint Conference on Artificial Intelligence. California： IJCAI.org， 2019： 1907-1913.
[12]	CHEN T， WANG J， SUN Y. Meta-MSGAT： meta multi-scale fused graph attention network ［C］// Proceedings of the 2023 International Joint Conference on Neural Networks. Piscataway： IEEE， 2023： 1-8.
[13]	PLIZZARI C， CANNICI M， MATTEUCCI M. Skeleton-based action recognition via spatial and temporal Transformer networks［J］. Computer Vision and Image Understanding， 2021， 208/209： No.103219.
[14]	SUN S K， LIU D Z， DONG J F， et al. Unified multi-modal unsupervised representation learning for skeleton-based action understanding ［C］// Proceedings of the 31st ACM International Conference on Multimedia. New York： ACM， 2023： 2973-2984.
[15]	GAO H， JIANG R， DONG Z， et al. Spatial-temporal-decoupled masked pre-training for spatiotemporal forecasting ［C］// Proceedings of the 33rd International Joint Conference on Artificial Intelligence. California： IJCAI.org， 2024： 3998-4006.
[16]	WU C， WU X J， KITTLER J， et al. SCD-Net： spatiotemporal clues disentanglement network for self-supervised skeleton-based action recognition ［C］// Proceedings of the 38th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2024： 5949-5957.
[17]	DONG J， SUN S， LIU Z， et al. Hierarchical contrast for unsupervised skeleton-based action representation learning ［C］// Proceedings of the 37th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2023： 525-533.
[18]	LIN J， GAN C， HAN S. TSM： temporal shift module for efficient video understanding ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 7082-7092.
[19]	ZHANG J， LIN L， LIU J. Hierarchical consistent contrastive learning for skeleton-based action recognition with growing augmentations ［C］// Proceedings of the 37th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2023： 3427-3435.
[20]	LIU J， CHEN C， LIU M. Multi-modality co-learning for efficient skeleton-based action recognition ［C］// Proceedings of the 32nd ACM International Conference on Multimedia. New York： ACM， 2024： 4909-4918.
[21]	SHAHROUDY A， LIU J， NG T T， et al. NTU RGB+D： a large scale dataset for 3D human activity analysis ［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 1010-1019.
[22]	LIU J， SHAHROUDY A， PEREZ M， et al. NTU RGB+D 120： a large-scale benchmark for 3D human activity understanding ［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2020， 42（10）： 2684-2701.
[23]	WANG M， LI X， CHEN S， et al. Learning representations by contrastive spatio-temporal clustering for skeleton-based action recognition ［J］. IEEE Transactions on Multimedia， 2024， 26： 3207-3220.
[24]	SHAH A， ROY A， SHAH K， et al. HaLP： hallucinating latent positives for skeleton-based self-supervised learning of actions［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023： 18846-18856.
[25]	YANG D， WANG Y， DANTCHEVA A， et al. View-invariant skeleton action representation learning via motion retargeting ［J］. Image and Vision Computing， 2024， 132（7）： 2351-2366.

方法	主干网络	NTU 60		NTU 120
方法	主干网络	CS	CV	CS	SS
CSTCN^［23］	GRU	85.8	92.0	77.5	78.5
HaLP^［24］	GRU	79.7	86.8	71.1	72.2
HiCo^［17］	GRU	82.6	90.8	75.9	77.3
ActCLR^［6］	GCN	84.3	88.8	74.3	75.7
SkeAttnCLR^［9］	GCN	82.0	86.5	77.1	80.0
HYSP^［10］	GCN	79.1	85.2	64.5	67.3
HiCLR^［19］	GCN	80.4	85.5	70.0	70.4
ViA^［25］	GCN	78.1	85.8	69.2	66.9
UmURL^［14］	Transformer	84.4	91.4	75.9	77.2
SCD-Net^［16］	GCN+Transformer	86.4	91.3	76.7	79.6
MSTDCLF	GCN+BiLSTM	87.5	93.0	79.3	80.6

方法	主干网络	NTU 60		NTU 120
方法	主干网络	CS	CV	CS	SS
CSTCN^［23］	GRU	85.8	92.0	77.5	78.5
HaLP^［24］	GRU	79.7	86.8	71.1	72.2
HiCo^［17］	GRU	82.6	90.8	75.9	77.3
ActCLR^［6］	GCN	84.3	88.8	74.3	75.7
SkeAttnCLR^［9］	GCN	82.0	86.5	77.1	80.0
HYSP^［10］	GCN	79.1	85.2	64.5	67.3
HiCLR^［19］	GCN	80.4	85.5	70.0	70.4
ViA^［25］	GCN	78.1	85.8	69.2	66.9
UmURL^［14］	Transformer	84.4	91.4	75.9	77.2
SCD-Net^［16］	GCN+Transformer	86.4	91.3	76.7	79.6
MSTDCLF	GCN+BiLSTM	87.5	93.0	79.3	80.6

编码块	NTU 60		NTU 120
编码块	CS	CV	CS	SS
Base	86.4	91.3	76.7	79.6
Base+MSTF	86.6	92.2	78.1	80.1
Base+BGSCM	87.3	92.3	79.0	80.2
Base+MSTF+BGSCM	87.5	93.0	79.3	80.6

编码块	NTU 60		NTU 120
编码块	CS	CV	CS	SS
Base	86.4	91.3	76.7	79.6
Base+MSTF	86.6	92.2	78.1	80.1
Base+BGSCM	87.3	92.3	79.0	80.2
Base+MSTF+BGSCM	87.5	93.0	79.3	80.6

Multi-scale spatio-temporal decoupling for contrastive learning of skeleton action recognition

多尺度时空解耦的骨架行为识别对比学习

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 11

References 25

Related Articles 15

Recommended Articles

Metrics

[1]	Yuhang XIAO, Guanfeng LI, Yuyin CHEN, Jing QIN. Few-shot relation extraction model with graph-based multi-view contrastive learning [J]. Journal of Computer Applications, 2026, 46(3): 732-740.
[2]	Zuxi ZHANG, Zhancheng ZHANG, Fuyuan HU. Local and long-range temporal complementary modeling for video action recognition [J]. Journal of Computer Applications, 2026, 46(3): 758-766.
[3]	Limei DONG, Yanzi LI, Jiayin LI, Li XU. Neighborhood-enhanced unsupervised graph anomaly detection [J]. Journal of Computer Applications, 2026, 46(2): 458-466.
[4]	Hu LUO, Mingshu ZHANG. Rumor detection method based on cross-modal attention mechanism and contrastive learning [J]. Journal of Computer Applications, 2026, 46(2): 361-367.
[5]	Tinghu WEI, Haoyan LIU, Jianning WU. Graph spatiotemporal learning model-based method for detecting dynamic changes of leg length discrepancy [J]. Journal of Computer Applications, 2026, 46(2): 587-595.
[6]	Ziyang CHENG, Ruizhang HUANG, Jingjing XUE. Deep evolutionary topic clustering model [J]. Journal of Computer Applications, 2026, 46(1): 85-94.
[7]	Zhihui ZAN, Yajing WANG, Ke LI, Zhixiang YANG, Guangyu YANG. Multi-feature fusion speech emotion recognition method based on SAA-CNN-BiLSTM network [J]. Journal of Computer Applications, 2026, 46(1): 69-76.
[8]	Wen LI, Kairong LI, Kai YANG. Subgraph-aware contrastive learning with data augmentation [J]. Journal of Computer Applications, 2026, 46(1): 1-9.
[9]	Xingyao YANG, Zheng QI, Jiong YU, Zulian ZHANG, Shuai MA, Hongtao SHEN. Session-based recommendation model based on time-aware and space-enhanced dual channel graph neural network [J]. Journal of Computer Applications, 2026, 46(1): 104-112.
[10]	Zhixiong XU, Bo LI, Xiaoyong BIAN, Qiren HU. Adversarial sample embedded attention U-Net for 3D medical image segmentation [J]. Journal of Computer Applications, 2025, 45(9): 3011-3016.
[11]	Chao LIU, Yanhua YU. Knowledge-aware recommendation model combining denoising strategy and multi-view contrastive learning [J]. Journal of Computer Applications, 2025, 45(9): 2827-2837.
[12]	Chao SHI, Yuxin ZHOU, Qian FU, Wanyu TANG, Ling HE, Yuanyuan LI. Action recognition algorithm for ADHD patients using skeleton and 3D heatmap [J]. Journal of Computer Applications, 2025, 45(9): 3036-3044.
[13]	Yilin DENG, Fajiang YU. Pseudo random number generator based on LSTM and separable self-attention mechanism [J]. Journal of Computer Applications, 2025, 45(9): 2893-2901.
[14]	Zhiyuan WANG, Tao PENG, Jie YANG. Integrating internal and external data for out-of-distribution detection training and testing [J]. Journal of Computer Applications, 2025, 45(8): 2497-2506.
[15]	Jin XIE, Surong CHU, Yan QIANG, Juanjuan ZHAO, Hua ZHANG, Yong GAO. Dual-branch distribution consistency contrastive learning model for hard negative sample identification in chest X-rays [J]. Journal of Computer Applications, 2025, 45(7): 2369-2377.