Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (7): 1915-1921.DOI: 10.11772/j.issn.1001-9081.2020091515

Special Issue: 人工智能

• Artificial intelligence • Previous Articles     Next Articles

Human skeleton-based action recognition algorithm based on spatiotemporal attention graph convolutional network model

LI Yangzhi1, YUAN Jiazheng2, LIU Hongzhe1   

  1. 1. Beijing Key Laboratory of Information Service Engineering(Beijing Union University), Beijing 100101, China;
    2. Department of Scientific Research and Foreign Affairs, Beijing Open University, Beijing 100081, China
  • Received:2020-09-29 Revised:2020-12-17 Online:2021-07-10 Published:2021-01-26
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61871028, 61871039, 61906017, 61802019), the Leading Talent Program of Beijing Union University of China(BPHR2019AZ01), the Beijing Municipal Education Commission Project of China (KM202111417001, KM201911417001), the Graduate Research and Innovation Funding Project of Beijing Union University (YZ2020K001)。


李扬志1, 袁家政2, 刘宏哲1   

  1. 1. 北京市信息服务工程重点实验室(北京联合大学), 北京 100101;
    2. 北京开放大学 科研外事处, 北京 100081
  • 通讯作者: 袁家政
  • 作者简介:李扬志(1996-),男,湖南邵东人,硕士研究生,主要研究方向:计算机视觉、人工智能;袁家政(1971-),男,湖南隆回人,教授,博士,主要研究方向:计算机视觉、人工智能;刘宏哲(1971-),女,河北保定人,教授,博士,主要研究方向:数字图像处理、智能驾驶。
  • 基金资助:

Abstract: Aiming at the problem that the existing human skeleton-based action recognition algorithms cannot fully explore the temporal and spatial characteristics of motion, a human skeleton-based action recognition algorithm based on Spatiotemporal Attention Graph Convolutional Network (STA-GCN) model was proposed, which consisted of spatial attention mechanism and temporal attention mechanism. The spatial attention mechanism used the instantaneous motion information of the optical flow features to locate the spatial regions with significant motion on the one hand, and introduced the global average pooling and auxiliary classification loss during the training process to enable the model to focus on the non-motion regions with discriminability ability on the other hand. While the temporal attention mechanism automatically extracted the discriminative time-domain segments from the long-term complex video. Both of spatial and temporal attention mechanisms were integrated into a unified Graph Convolution Network (GCN) framework to enable the end-to-end training. Experimental results on Kinetics and NTU RGB+D datasets show that the proposed algorithm based on STA-GCN has strong robustness and stability, and compared with the benchmark algorithm based on Spatial Temporal Graph Convolutional Network (ST-GCN) model, the Top-1 and Top-5 on Kinetics are improved by 5.0 and 4.5 percentage points, respectively, and the Top-1 on CS and CV of NTU RGB+D dataset are also improved by 6.2 and 6.7 percentage points, respectively; it also outperforms the current State-Of-the-Art (SOA) methods in action recognition, such as Res-TCN (Residue Temporal Convolutional Network), STA-LSTM, and AS-GCN (Actional-Structural Graph Convolutional Network). The results indicate that the proposed algorithm can better meet the practical application requirements of human action recognition.

Key words: Graph Convolutional Network (GCN), human skeleton-based action recognition, attention mechanism, human joint, video behavior understanding

摘要: 针对现有的人体骨架动作识别算法不能充分发掘运动的时空特征问题,提出一种基于时空注意力图卷积网络(STA-GCN)模型的人体骨架动作识别算法。该模型包含空间注意力机制和时间注意力机制:空间注意力机制一方面利用光流特征中的瞬时运动信息定位运动显著的空间区域,另一方面在训练过程中引入全局平均池化及辅助分类损失使得该模型可以关注到具有判别力的非运动区域;时间注意力机制则自动地从长时复杂视频中挖掘出具有判别力的时域片段。将这二者融合到统一的图卷积网络(GCN)框架中,实现了端到端的训练。在Kinetics和NTU RGB+D两个公开数据集的对比实验结果表明,基于STA-GCN模型的人体骨架动作识别算法具有很强的鲁棒性与稳定性,与基于时空图卷积网络(ST-GCN)模型的识别算法相比,在Kinetics数据集上的Top-1和Top-5分别提升5.0和4.5个百分点,在NTURGB+D数据集的CS和CV上的Top-1分别提升6.2和6.7个百分点;也优于当前行为识别领域最先进(SOA)方法,如Res-TCN、STA-LSTM和动作-结构图卷积网络(AS-GCN)。结果表示,所提算法可以更好地满足人体行为识别的实际应用需求。

关键词: 图卷积网络, 人体骨架行为识别, 注意力机制, 人体关节点, 视频行为理解

CLC Number: