多尺度时空解耦的骨架行为识别对比学习

doi:10.11772/j.issn.1001-9081.2025030310

《计算机应用》唯一官方网站 ›› 2026, Vol. 46 ›› Issue (3): 767-774.DOI: 10.11772/j.issn.1001-9081.2025030310

多尺度时空解耦的骨架行为识别对比学习

刘晓霞¹^,²^,³, 况立群¹^,²^,³(), 王松¹^,²^,³, 焦世超¹^,²^,³, 韩慧妍¹^,²^,³, 熊风光¹^,²^,³

^1.中北大学计算机科学与技术学院，太原 030051
^2.机器视觉与虚拟现实山西省重点实验室（中北大学），太原 030051
^3.山西省视觉信息处理及智能机器人工程研究中心，太原 030051

收稿日期:2025-03-27 修回日期:2025-04-27 接受日期:2025-04-28 发布日期:2025-05-09 出版日期:2026-03-10
通讯作者: 况立群
作者简介:刘晓霞（2000—），女，山西临汾人，硕士研究生，CCF会员，主要研究方向：人体行为识别
王松（1998—），男，河南周口人，博士研究生，CCF会员，主要研究方向：图像融合、多模态数据融合
焦世超（1994—），男，山西临汾人，讲师，博士，CCF会员，主要研究方向：人工智能、计算机视觉
韩慧妍（1980—），女，山西临汾人，副教授，博士，CCF会员，主要研究方向：人工智能、计算机视觉
熊风光（1979—），男，湖北鄂州人，副教授，博士，CCF会员，主要研究方向：人工智能、计算机视觉。
基金资助:
山西省科技重大专项计划“揭榜挂帅”项目(202201150401021);山西省科技成果转化引导专项(202104021301055);山西省基础研究计划项目(202303021211153);山西省基础研究计划项目(202303021212189);山西省研究生科研创新项目(2024KY614)

Multi-scale spatio-temporal decoupling for contrastive learning of skeleton action recognition

Xiaoxia LIU¹^,²^,³, Liqun KUANG¹^,²^,³(), Song WANG¹^,²^,³, Shichao JIAO¹^,²^,³, Huiyan HAN¹^,²^,³, Fengguang XIONG¹^,²^,³

^1.School of Computer Science and Technology，North University of China，Taiyuan Shanxi 030051，China
^2.Shanxi Key Laboratory of Machine Vision and Virtual Reality （North University of China），Taiyuan Shanxi 030051，China
^3.Shanxi Vision Information Processing and Intelligent Robot Engineering Research Center，Taiyuan Shanxi 030051，China

Received:2025-03-27 Revised:2025-04-27 Accepted:2025-04-28 Online:2025-05-09 Published:2026-03-10
Contact: Liqun KUANG
About author:LIU Xiaoxia， born in 2000， M. S. candidate. Her research interests include human behavior recognition.
WANG Song， born in 1998， Ph. D. candidate. His research interests include image fusion， multimodal data fusion.
JIAO Shichao， born in 1994， Ph. D.， lecturer. His research interests include artificial intelligence， computer vision.
HAN Huiyan， born in 1980， Ph. D.， associate professor. Her research interests include artificial intelligence， computer vision.
XIONG Fengguang， born in 1979， Ph. D.， associate professor. His research interests include artificial intelligence， computer vision.
Supported by:
Shanxi Province Science and Technology Special Plan “Taking on Challenging Projects by Responding to Calls for Solutions” Project(202201150401021);Shanxi Province Science and Technology Achievement Transformation Guidance Project(202104021301055);Shanxi Province Basic Research Program(202303021211153);Shanxi Province Graduate Research Innovation Project(2024KY614)

摘要/Abstract

摘要：

针对骨架行为识别中的动态动作建模与多尺度时序融合问题，提出高效多尺度时空解耦对比学习框架（MSTDCLF）。首先，设计多尺度时空特征增强模块（MSTF），结合深度可分离卷积与空洞卷积，从而同步建模短时运动特征与长时周期行为模式；其次，通过嵌入通道-空间联合注意力机制进一步强化关节与特征通道之间的语义响应；再次，使用具有注意力机制的残差网络解决深层网络结构的梯度衰减问题；最后，提出双向门控时空上下文建模（BGSCM），基于双向长短期记忆（BiLSTM）网络构建时空增强分支，通过门控机制在关节拓扑与时序轴中双向传递解耦特征，抑制噪声干扰并建立完整的动作演化依赖。实验结果表明，MSTDCLF在NTU RGB+D 60数据集上的准确率为87.5%（交叉受试者（CS））和93.0%（交叉视角（CV）），在NTU RGB+D 120数据集上的准确率为79.3%（CS）和80.6%（交叉设置（SS）），均优于次优方法SCD-Net（Spatiotemporal Clues Disentanglement Network）。消融实验结果验证了多尺度设计与双向门控机制的有效性，表明MSTDCLF在骨架行为识别中能实现高效的行为表征，有效提高识别精度。

关键词: 行为识别, 人体骨架, 门控时空上下文, 时空特征提取, 对比学习, 长短期记忆网络

Abstract:

Aiming at the problems of dynamic action modeling and multi-scale temporal fusion in skeleton action recognition， an efficient Multi-scale Spatio-Temporal Decoupled Contrastive Learning Framework （MSTDCLF） was proposed. Firstly， a Multi-scale Spatio-Temporal Feature enhancement module （MSTF） was designed to combine depth separable convolution and dilated convolution， so as to model short-term motion features and long-term behavior patterns simultaneously. Secondly， the semantic response between joints and feature channels was further strengthened by embedding the channel-spatial joint attention mechanism. Thirdly， a residual network with attention mechanism was used to solve the gradient decay problem of deep network structure. Finally， a Bidirectional Gated Spatio-temporal Context Modeling （BGSCM） was proposed， and a spatio-temporal enhancement branch was constructed on the basis of Bidirectional Long Short-Term Memory （BiLSTM） network， and the decoupled features were transmitted in joint topology and temporal axis through the gating mechanism， thereby suppressing noise interference and establishing complete action evolution dependency. Experimental results show that MSTDCLF has the accuracies of 87.5% （Cross-Subject （CS）） and 93.0% （Cross-View （CV）） on the NTU RGB+D 60 dataset， and the accuracies of 79.3% （CS） and 80.6% （crosS-Setup （SS）） on the NTU RGB+D 120 dataset， all of which are better than those of the suboptimal method SCD-Net （Spatiotemporal Clues Disentanglement Network）. Ablation experiments verify the effectiveness of the multi-scale design and bidirectional gating mechanism， indicating that MSTDCLF can achieve efficient behavior representation in skeleton behavior recognition and improve recognition accuracy effectively.

Key words: action recognition, human skeleton, gated spatio-temporal context, spatio-temporal feature extraction, contrastive learning, Long Short-Term Memory (LSTM) network

中图分类号:

TP391.41

刘晓霞, 况立群, 王松, 焦世超, 韩慧妍, 熊风光. 多尺度时空解耦的骨架行为识别对比学习[J]. 计算机应用, 2026, 46(3): 767-774.

Xiaoxia LIU, Liqun KUANG, Song WANG, Shichao JIAO, Huiyan HAN, Fengguang XIONG. Multi-scale spatio-temporal decoupling for contrastive learning of skeleton action recognition[J]. Journal of Computer Applications, 2026, 46(3): 767-774.

图/表 11

图1 MSTDCLF的整体架构

Fig. 1 Overall architecture of MSTDCLF

图2 MSTF的结构

Fig. 2 Structure of MSTF

图3 通道-空间注意力机制的结构

Fig. 3 Structure of channel-spatial attention mechanism

图4 BGSCM的结构

Fig. 4 Structure of BGSCM

图5 人体关节点坐标

Fig. 5 Coordinates of human joint points

图6 训练过程曲线

Fig. 6 Training process curves

表1 不同方法在NTU 60和NTU 120数据集上的准确率对比 (%)

Tab. 1 Comparison of accuracy of different methods on NTU 60 and NTU 120 datasets

方法	主干网络	NTU 60		NTU 120
方法	主干网络	CS	CV	CS	SS
CSTCN^［23］	GRU	85.8	92.0	77.5	78.5
HaLP^［24］	GRU	79.7	86.8	71.1	72.2
HiCo^［17］	GRU	82.6	90.8	75.9	77.3
ActCLR^［6］	GCN	84.3	88.8	74.3	75.7
SkeAttnCLR^［9］	GCN	82.0	86.5	77.1	80.0
HYSP^［10］	GCN	79.1	85.2	64.5	67.3
HiCLR^［19］	GCN	80.4	85.5	70.0	70.4
ViA^［25］	GCN	78.1	85.8	69.2	66.9
UmURL^［14］	Transformer	84.4	91.4	75.9	77.2
SCD-Net^［16］	GCN+Transformer	86.4	91.3	76.7	79.6
MSTDCLF	GCN+BiLSTM	87.5	93.0	79.3	80.6

表2 编码块有效性的消融实验结果 (%)

Tab. 2 Ablation experimental results on effectiveness of encoding blocks

编码块	NTU 60		NTU 120
编码块	CS	CV	CS	SS
Base	86.4	91.3	76.7	79.6
Base+MSTF	86.6	92.2	78.1	80.1
Base+BGSCM	87.3	92.3	79.0	80.2
Base+MSTF+BGSCM	87.5	93.0	79.3	80.6

图7 “挥手”动作的权重热力图

Fig. 7 Heatmap of weights of “hand waving” action

图8 “挥手”动作的带权重骨架图

Fig. 8 Weighted skeleton diagrams of “hand waving” action

图9 不同核大小和空洞率的卷积权重分布

Fig. 9 Convolutional weight distribution with different kernel sizes and dilation rates

参考文献 25

[1]	孟月波，陈廷廷，杨逍. 卷积时间注意力与多尺度信息学习的人体行为检测方法［J/OL］. 计算机工程与应用［2025-02-24］. .
	MENG Y B， CHEN T T， YANG X. Convolutional temporal attention and multi-scale information learning for human action detection ［J/OL］. Computer Engineering and Applications ［2025-02-24］. .
[2]	REN Z， ZHANG Q， GAO X， et al. Multi-modality learning for human action recognition ［J］. Multimedia Tools and Applications， 2021， 80（11）： 16185-16203.
[3]	赵登阁，智敏. 用于人体动作识别的多尺度时空图卷积算法［J］. 计算机科学与探索， 2023， 17（3）： 719-732.
	ZHAO D G， ZHI M. Spatial multiple-temporal graph convolutional neural network for human action recognition ［J］. Journal of Frontiers of Computer Science and Technology， 2023， 17（3）： 719-732.
[4]	丁帅，况立群，曹亚明，等. 时空特征融合的高精度轻量级骨架行为识别［J］. 计算机工程， 2025， 51（11）： 283-293.
	DING S， KUANG L Q， CAO Y M， et al. High-precision and lightweight skeleton behavior recognition based on spatial-temporal feature fusion ［J］. Computer Engineering， 2025， 51（11）： 283-293.
[5]	黄倩，崔静雯，李畅. 基于骨骼的人体行为识别方法研究综述［J］. 计算机辅助设计与图形学学报， 2024， 36（2）： 173-194.
	HUANG Q， CUI J W， LI C. A review of skeleton-based human action recognition ［J］. Journal of Computer-Aided Design and Computer Graphics， 2024， 36（2）： 173-194.
[6]	LIN L， ZHANG J， LIU J. Actionlet-dependent contrastive learning for unsupervised skeleton-based action recognition ［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023： 2363-2372.
[7]	WU Z， SUN P， CHEN X， et al. SelfGCN： graph convolution network with self-attention for skeleton-based action recognition ［J］. IEEE Transactions on Image Processing， 2024， 33： 4391-4403.
[8]	YU B， YIN H， ZHU Z. Spatio-temporal graph convolutional networks： a deep learning framework for traffic forecasting ［C］// Proceedings of the 27th International Joint Conference on Artificial Intelligence. California： IJCAI.org， 2018： 3634-3640.
[9]	HUA Y， WU W， ZHENG C， et al. Part aware contrastive learning for self-supervised action recognition ［C］// Proceedings of the 32nd International Joint Conference on Artificial Intelligence. California： IJCAI.org， 2023： 855-863.
[10]	FRANCO L， MANDICA P， MUNJAL B， et al. HYperbolic Self-Paced learning for self-supervised skeleton-based action representations ［EB/OL］. ［2025-02-23］..
[11]	WU Z， PAN S， LONG G， et al. Graph WaveNet for deep spatial-temporal graph modeling ［C］// Proceedings of the 28th International Joint Conference on Artificial Intelligence. California： IJCAI.org， 2019： 1907-1913.
[12]	CHEN T， WANG J， SUN Y. Meta-MSGAT： meta multi-scale fused graph attention network ［C］// Proceedings of the 2023 International Joint Conference on Neural Networks. Piscataway： IEEE， 2023： 1-8.
[13]	PLIZZARI C， CANNICI M， MATTEUCCI M. Skeleton-based action recognition via spatial and temporal Transformer networks［J］. Computer Vision and Image Understanding， 2021， 208/209： No.103219.
[14]	SUN S K， LIU D Z， DONG J F， et al. Unified multi-modal unsupervised representation learning for skeleton-based action understanding ［C］// Proceedings of the 31st ACM International Conference on Multimedia. New York： ACM， 2023： 2973-2984.
[15]	GAO H， JIANG R， DONG Z， et al. Spatial-temporal-decoupled masked pre-training for spatiotemporal forecasting ［C］// Proceedings of the 33rd International Joint Conference on Artificial Intelligence. California： IJCAI.org， 2024： 3998-4006.
[16]	WU C， WU X J， KITTLER J， et al. SCD-Net： spatiotemporal clues disentanglement network for self-supervised skeleton-based action recognition ［C］// Proceedings of the 38th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2024： 5949-5957.
[17]	DONG J， SUN S， LIU Z， et al. Hierarchical contrast for unsupervised skeleton-based action representation learning ［C］// Proceedings of the 37th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2023： 525-533.
[18]	LIN J， GAN C， HAN S. TSM： temporal shift module for efficient video understanding ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 7082-7092.
[19]	ZHANG J， LIN L， LIU J. Hierarchical consistent contrastive learning for skeleton-based action recognition with growing augmentations ［C］// Proceedings of the 37th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2023： 3427-3435.
[20]	LIU J， CHEN C， LIU M. Multi-modality co-learning for efficient skeleton-based action recognition ［C］// Proceedings of the 32nd ACM International Conference on Multimedia. New York： ACM， 2024： 4909-4918.
[21]	SHAHROUDY A， LIU J， NG T T， et al. NTU RGB+D： a large scale dataset for 3D human activity analysis ［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 1010-1019.
[22]	LIU J， SHAHROUDY A， PEREZ M， et al. NTU RGB+D 120： a large-scale benchmark for 3D human activity understanding ［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2020， 42（10）： 2684-2701.
[23]	WANG M， LI X， CHEN S， et al. Learning representations by contrastive spatio-temporal clustering for skeleton-based action recognition ［J］. IEEE Transactions on Multimedia， 2024， 26： 3207-3220.
[24]	SHAH A， ROY A， SHAH K， et al. HaLP： hallucinating latent positives for skeleton-based self-supervised learning of actions［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023： 18846-18856.
[25]	YANG D， WANG Y， DANTCHEVA A， et al. View-invariant skeleton action representation learning via motion retargeting ［J］. Image and Vision Computing， 2024， 132（7）： 2351-2366.

多尺度时空解耦的骨架行为识别对比学习

Multi-scale spatio-temporal decoupling for contrastive learning of skeleton action recognition

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 25

相关文章 15

编辑推荐

Metrics

[1]	肖毓航, 李贯峰, 陈昱胤, 秦晶. 基于图的多视角对比学习小样本关系抽取模型[J]. 《计算机应用》唯一官方网站, 2026, 46(3): 732-740.
[2]	董莉梅, 李雁姿, 李家印, 许力. 基于邻域增强的无监督图异常检测[J]. 《计算机应用》唯一官方网站, 2026, 46(2): 458-466.
[3]	罗虎, 张明书. 基于跨模态注意力机制与对比学习的谣言检测方法[J]. 《计算机应用》唯一官方网站, 2026, 46(2): 361-367.
[4]	李玟, 李开荣, 杨凯. 基于数据增强的子图感知对比学习[J]. 《计算机应用》唯一官方网站, 2026, 46(1): 1-9.
[5]	杨兴耀, 齐正, 于炯, 张祖莲, 马帅, 沈洪涛. 时间感知和空间增强的双通道图神经网络会话推荐模型[J]. 《计算机应用》唯一官方网站, 2026, 46(1): 104-112.
[6]	程梓洋, 黄瑞章, 薛菁菁. 深度演化主题聚类模型[J]. 《计算机应用》唯一官方网站, 2026, 46(1): 85-94.
[7]	许志雄, 李波, 边小勇, 胡其仁. 对抗样本嵌入注意力U型网络的3D医学图像分割[J]. 《计算机应用》唯一官方网站, 2025, 45(9): 3011-3016.
[8]	刘超, 余岩化. 融合降噪策略与多视图对比学习的知识感知推荐模型[J]. 《计算机应用》唯一官方网站, 2025, 45(9): 2827-2837.
[9]	王祉苑, 彭涛, 杨捷. 分布外检测中训练与测试的内外数据整合[J]. 《计算机应用》唯一官方网站, 2025, 45(8): 2497-2506.
[10]	谢劲, 褚苏荣, 强彦, 赵涓涓, 张华, 高勇. 用于胸片中硬负样本识别的双支分布一致性对比学习模型[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2369-2377.
[11]	王震洲, 郭方方, 宿景芳, 苏鹤, 王建超. 面向智能巡检的视觉模型鲁棒性优化方法[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2361-2368.
[12]	王艺涵, 路翀, 陈忠源. 跨模态文本信息增强的多模态情感分析模型[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2237-2244.
[13]	颜文婧, 王瑞东, 左敏, 张青川. 基于风味嵌入异构图层次学习的食谱推荐模型[J]. 《计算机应用》唯一官方网站, 2025, 45(6): 1869-1878.
[14]	余明峰, 秦永彬, 黄瑞章, 陈艳平, 林川. 基于对比学习增强双注意力机制的多标签文本分类方法[J]. 《计算机应用》唯一官方网站, 2025, 45(6): 1732-1740.
[15]	姜超英, 李倩, 刘宁, 刘磊, 崔立真. 基于图对比学习的再入院预测模型[J]. 《计算机应用》唯一官方网站, 2025, 45(6): 1784-1792.