《计算机应用》唯一官方网站 ›› 2026, Vol. 46 ›› Issue (3): 767-774.DOI: 10.11772/j.issn.1001-9081.2025030310

• 人工智能 • 上一篇    下一篇

多尺度时空解耦的骨架行为识别对比学习

刘晓霞1,2,3, 况立群1,2,3(), 王松1,2,3, 焦世超1,2,3, 韩慧妍1,2,3, 熊风光1,2,3   

  1. 1.中北大学 计算机科学与技术学院,太原 030051
    2.机器视觉与虚拟现实山西省重点实验室(中北大学),太原 030051
    3.山西省视觉信息处理及智能机器人工程研究中心,太原 030051
  • 收稿日期:2025-03-27 修回日期:2025-04-27 接受日期:2025-04-28 发布日期:2025-05-09 出版日期:2026-03-10
  • 通讯作者: 况立群
  • 作者简介:刘晓霞(2000—),女,山西临汾人,硕士研究生,CCF会员,主要研究方向:人体行为识别
    王松(1998—),男,河南周口人,博士研究生,CCF会员,主要研究方向:图像融合、多模态数据融合
    焦世超(1994—),男,山西临汾人,讲师,博士,CCF会员,主要研究方向:人工智能、计算机视觉
    韩慧妍(1980—),女,山西临汾人,副教授,博士,CCF会员,主要研究方向:人工智能、计算机视觉
    熊风光(1979—),男,湖北鄂州人,副教授,博士,CCF会员,主要研究方向:人工智能、计算机视觉。
  • 基金资助:
    山西省科技重大专项计划“揭榜挂帅”项目(202201150401021);山西省科技成果转化引导专项(202104021301055);山西省基础研究计划项目(202303021211153);山西省基础研究计划项目(202303021212189);山西省研究生科研创新项目(2024KY614)

Multi-scale spatio-temporal decoupling for contrastive learning of skeleton action recognition

Xiaoxia LIU1,2,3, Liqun KUANG1,2,3(), Song WANG1,2,3, Shichao JIAO1,2,3, Huiyan HAN1,2,3, Fengguang XIONG1,2,3   

  1. 1.School of Computer Science and Technology,North University of China,Taiyuan Shanxi 030051,China
    2.Shanxi Key Laboratory of Machine Vision and Virtual Reality (North University of China),Taiyuan Shanxi 030051,China
    3.Shanxi Vision Information Processing and Intelligent Robot Engineering Research Center,Taiyuan Shanxi 030051,China
  • Received:2025-03-27 Revised:2025-04-27 Accepted:2025-04-28 Online:2025-05-09 Published:2026-03-10
  • Contact: Liqun KUANG
  • About author:LIU Xiaoxia, born in 2000, M. S. candidate. Her research interests include human behavior recognition.
    WANG Song, born in 1998, Ph. D. candidate. His research interests include image fusion, multimodal data fusion.
    JIAO Shichao, born in 1994, Ph. D., lecturer. His research interests include artificial intelligence, computer vision.
    HAN Huiyan, born in 1980, Ph. D., associate professor. Her research interests include artificial intelligence, computer vision.
    XIONG Fengguang, born in 1979, Ph. D., associate professor. His research interests include artificial intelligence, computer vision.
  • Supported by:
    Shanxi Province Science and Technology Special Plan “Taking on Challenging Projects by Responding to Calls for Solutions” Project(202201150401021);Shanxi Province Science and Technology Achievement Transformation Guidance Project(202104021301055);Shanxi Province Basic Research Program(202303021211153);Shanxi Province Graduate Research Innovation Project(2024KY614)

摘要:

针对骨架行为识别中的动态动作建模与多尺度时序融合问题,提出高效多尺度时空解耦对比学习框架(MSTDCLF)。首先,设计多尺度时空特征增强模块(MSTF),结合深度可分离卷积与空洞卷积,从而同步建模短时运动特征与长时周期行为模式;其次,通过嵌入通道-空间联合注意力机制进一步强化关节与特征通道之间的语义响应;再次,使用具有注意力机制的残差网络解决深层网络结构的梯度衰减问题;最后,提出双向门控时空上下文建模(BGSCM),基于双向长短期记忆(BiLSTM)网络构建时空增强分支,通过门控机制在关节拓扑与时序轴中双向传递解耦特征,抑制噪声干扰并建立完整的动作演化依赖。实验结果表明,MSTDCLF在NTU RGB+D 60数据集上的准确率为87.5%(交叉受试者(CS))和93.0%(交叉视角(CV)),在NTU RGB+D 120数据集上的准确率为79.3%(CS)和80.6%(交叉设置(SS)),均优于次优方法SCD-Net(Spatiotemporal Clues Disentanglement Network)。消融实验结果验证了多尺度设计与双向门控机制的有效性,表明MSTDCLF在骨架行为识别中能实现高效的行为表征,有效提高识别精度。

关键词: 行为识别, 人体骨架, 门控时空上下文, 时空特征提取, 对比学习, 长短期记忆网络

Abstract:

Aiming at the problems of dynamic action modeling and multi-scale temporal fusion in skeleton action recognition, an efficient Multi-scale Spatio-Temporal Decoupled Contrastive Learning Framework (MSTDCLF) was proposed. Firstly, a Multi-scale Spatio-Temporal Feature enhancement module (MSTF) was designed to combine depth separable convolution and dilated convolution, so as to model short-term motion features and long-term behavior patterns simultaneously. Secondly, the semantic response between joints and feature channels was further strengthened by embedding the channel-spatial joint attention mechanism. Thirdly, a residual network with attention mechanism was used to solve the gradient decay problem of deep network structure. Finally, a Bidirectional Gated Spatio-temporal Context Modeling (BGSCM) was proposed, and a spatio-temporal enhancement branch was constructed on the basis of Bidirectional Long Short-Term Memory (BiLSTM) network, and the decoupled features were transmitted in joint topology and temporal axis through the gating mechanism, thereby suppressing noise interference and establishing complete action evolution dependency. Experimental results show that MSTDCLF has the accuracies of 87.5% (Cross-Subject (CS)) and 93.0% (Cross-View (CV)) on the NTU RGB+D 60 dataset, and the accuracies of 79.3% (CS) and 80.6% (crosS-Setup (SS)) on the NTU RGB+D 120 dataset, all of which are better than those of the suboptimal method SCD-Net (Spatiotemporal Clues Disentanglement Network). Ablation experiments verify the effectiveness of the multi-scale design and bidirectional gating mechanism, indicating that MSTDCLF can achieve efficient behavior representation in skeleton behavior recognition and improve recognition accuracy effectively.

Key words: action recognition, human skeleton, gated spatio-temporal context, spatio-temporal feature extraction, contrastive learning, Long Short-Term Memory (LSTM) network

中图分类号: