Human skeleton-based action recognition algorithm based on spatiotemporal attention graph convolutional network model

doi:10.11772/j.issn.1001-9081.2020091515

Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (7): 1915-1921.DOI: 10.11772/j.issn.1001-9081.2020091515

Special Issue: 人工智能

• Artificial intelligence • Previous Articles Next Articles

Human skeleton-based action recognition algorithm based on spatiotemporal attention graph convolutional network model

LI Yangzhi¹, YUAN Jiazheng², LIU Hongzhe¹

1. Beijing Key Laboratory of Information Service Engineering(Beijing Union University), Beijing 100101, China;
2. Department of Scientific Research and Foreign Affairs, Beijing Open University, Beijing 100081, China

Received:2020-09-29 Revised:2020-12-17 Online:2021-01-26 Published:2021-07-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61871028, 61871039, 61906017, 61802019), the Leading Talent Program of Beijing Union University of China(BPHR2019AZ01), the Beijing Municipal Education Commission Project of China (KM202111417001, KM201911417001), the Graduate Research and Innovation Funding Project of Beijing Union University (YZ2020K001)。

基于时空注意力图卷积网络模型的人体骨架动作识别算法

李扬志¹, 袁家政², 刘宏哲¹

1. 北京市信息服务工程重点实验室(北京联合大学), 北京 100101;
2. 北京开放大学科研外事处, 北京 100081

通讯作者: 袁家政
作者简介:李扬志(1996-),男,湖南邵东人,硕士研究生,主要研究方向:计算机视觉、人工智能;袁家政(1971-),男,湖南隆回人,教授,博士,主要研究方向:计算机视觉、人工智能;刘宏哲(1971-),女,河北保定人,教授,博士,主要研究方向:数字图像处理、智能驾驶。
基金资助:
国家自然科学资助基金项目（61871028，61871039，61906017，61802019）；北京联合大学领军人才项目（BPHR2019AZ01）；北京市教委项目（KM202111417001，KM201911417001）；北京联合大学研究生科研创新项目（YZ2020K001）。

Abstract

Abstract: Aiming at the problem that the existing human skeleton-based action recognition algorithms cannot fully explore the temporal and spatial characteristics of motion, a human skeleton-based action recognition algorithm based on Spatiotemporal Attention Graph Convolutional Network (STA-GCN) model was proposed, which consisted of spatial attention mechanism and temporal attention mechanism. The spatial attention mechanism used the instantaneous motion information of the optical flow features to locate the spatial regions with significant motion on the one hand, and introduced the global average pooling and auxiliary classification loss during the training process to enable the model to focus on the non-motion regions with discriminability ability on the other hand. While the temporal attention mechanism automatically extracted the discriminative time-domain segments from the long-term complex video. Both of spatial and temporal attention mechanisms were integrated into a unified Graph Convolution Network (GCN) framework to enable the end-to-end training. Experimental results on Kinetics and NTU RGB+D datasets show that the proposed algorithm based on STA-GCN has strong robustness and stability, and compared with the benchmark algorithm based on Spatial Temporal Graph Convolutional Network (ST-GCN) model, the Top-1 and Top-5 on Kinetics are improved by 5.0 and 4.5 percentage points, respectively, and the Top-1 on CS and CV of NTU RGB+D dataset are also improved by 6.2 and 6.7 percentage points, respectively; it also outperforms the current State-Of-the-Art (SOA) methods in action recognition, such as Res-TCN (Residue Temporal Convolutional Network), STA-LSTM, and AS-GCN (Actional-Structural Graph Convolutional Network). The results indicate that the proposed algorithm can better meet the practical application requirements of human action recognition.

Key words: Graph Convolutional Network (GCN), human skeleton-based action recognition, attention mechanism, human joint, video behavior understanding

摘要： 针对现有的人体骨架动作识别算法不能充分发掘运动的时空特征问题，提出一种基于时空注意力图卷积网络（STA-GCN）模型的人体骨架动作识别算法。该模型包含空间注意力机制和时间注意力机制：空间注意力机制一方面利用光流特征中的瞬时运动信息定位运动显著的空间区域，另一方面在训练过程中引入全局平均池化及辅助分类损失使得该模型可以关注到具有判别力的非运动区域；时间注意力机制则自动地从长时复杂视频中挖掘出具有判别力的时域片段。将这二者融合到统一的图卷积网络（GCN）框架中，实现了端到端的训练。在Kinetics和NTU RGB+D两个公开数据集的对比实验结果表明，基于STA-GCN模型的人体骨架动作识别算法具有很强的鲁棒性与稳定性，与基于时空图卷积网络（ST-GCN）模型的识别算法相比，在Kinetics数据集上的Top-1和Top-5分别提升5.0和4.5个百分点，在NTURGB+D数据集的CS和CV上的Top-1分别提升6.2和6.7个百分点；也优于当前行为识别领域最先进（SOA）方法，如Res-TCN、STA-LSTM和动作-结构图卷积网络（AS-GCN）。结果表示，所提算法可以更好地满足人体行为识别的实际应用需求。

关键词: 图卷积网络, 人体骨架行为识别, 注意力机制, 人体关节点, 视频行为理解

CLC Number:

TP391.4

LI Yangzhi, YUAN Jiazheng, LIU Hongzhe. Human skeleton-based action recognition algorithm based on spatiotemporal attention graph convolutional network model[J]. Journal of Computer Applications, 2021, 41(7): 1915-1921.

李扬志, 袁家政, 刘宏哲. 基于时空注意力图卷积网络模型的人体骨架动作识别算法[J]. 计算机应用, 2021, 41(7): 1915-1921.

References

[1] PENG Y,ZHAO Y,ZHANG J. Two-stream collaborative learning with spatial-temporal attention for video classification[J]. IEEE Transactions on Circuits and Systems for Video Technology,2019, 29(3):773-786.
[2] KE Q, BENNAMOUN M, AN S, et al. Learning clip representations for skeleton-based 3D action recognition[J]. IEEE Transactions on Image Processing,2018,27(6):2842-2855.
[3] CHEN C,LIU K,KEHTARNAVAZ N. Real-time human action recognition based on depth motion maps[J]. Journal of Real-Time Image Processing,2016,12(1):155-163.
[4] WANG J,LIU Z,WU Y,et al. Mining actionlet ensemble for action recognition with depth cameras[C]//Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2012:1290-1297.
[5] DU Y, WANG W, WANG L. Hierarchical recurrent neural network for skeleton based action recognition[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2015:1110-1118.
[6] VEMULAPALLI R,ARRATE F,CHELLAPPA R. Human action recognition by representing 3D skeletons as points in a lie group[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2014:588-595.
[7] XU K,HU W,LESKOVEC J,et al. How powerful are graph neural networks?[EB/OL].[2020-01-07]. https://arxiv.org/pdf/1810.00826.pdf.
[8] ZHANG M,CHEN Y. Link prediction based on graph neural networks[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook,NY:Curran Associates Inc.,2018:5171-5181.
[9] QI S,WANG W,JIA B,et al. Learning human-object interactions by graph parsing neural networks[C]//Proceedings of the 2018 European Conference on Computer Vision,LNCS 11213. Cham:Springer,2018:407-423.
[10] LI R,TAPASWI M,LIAO R,et al. Situation recognition with graph neural networks[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway:IEEE, 2017:4183-4192.
[11] KIPF T N,WELLING M. Semi-supervised classification with graph convolutional networks[EB/OL].[2020-01-17]. https://arxiv.org/pdf/1609.02907.pdf.
[12] SIMONOVSKY M,KOMODAKIS N. Dynamic edge-conditioned filters in convolutional neural networks on graphs[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern recognition. Piscataway:IEEE,2017:29-38.
[13] SONG S,LAN C,XING J,et al. An end-to-end spatio-temporal attention model for human action recognition from skeleton data[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence. Palo Alto,CA:AAAI Press,2017:4263-4270.
[14] ZHANG P,LAN C,XING J,et al. View adaptive recurrent neural networks for high performance human action recognition from skeleton data[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway:IEEE, 2017:2136-2145.
[15] YAN S,XIONG Y,LIN D. Spatial temporal graph convolutional networks for skeleton-based action recognition[EB/OL].[2020-02-13]. https://arxiv.org/pdf/1801.07455.pdf.
[16] LI M, CHEN S, CHEN X, et al. Actional-structural graph convolutional networks for skeleton-based action recognition[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2019:3590-3698.
[17] THAKKAR K,NARAYANAN P J. Part-based graph convolutional network for action recognition[C]//Proceedings of the 2018 British Machine Vision Conference. Durham:BMVA Press, 2018:No. 1003.
[18] SI C, JING Y, WANG W, et al. Skeleton-based action recognition with spatial reasoning and temporal stack learning[C]//Proceedings of the 2018 European Conference on Computer Vision,LNCS 11205. Cham:Springer,2018:106-121.
[19] ZHANG S,LIU X,XIAO J. On geometric features for skeletonbased action recognition using multilayer LSTM networks[C]//Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision. Piscataway:IEEE,2017:148-157.
[20] ZHANG P,LAN C,XING J,et al. View adaptive neural networks for high performance skeleton-based human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019,41(8):1963-1978.
[21] HE K,ZHANG X,REN S,et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:770-778.
[22] CAO C, LAN C, ZHANG Y, et al. Skeleton-based action recognition with gated convolutional neural networks[J]. IEEE Transactions on Circuits and Systems for Video Technology,2019, 29(11):3247-3257.
[23] FENG X,GUO J,QIN B,et al. Effective deep memory networks for distant supervised relation extraction[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence. Palo Alto,CA:AAAI Press,2017:4002-4008.
[24] SHUMAN D I, NARANG S K, FROSSARD P, et al. The emerging field of signal processing on graphs:extending highdimensional data analysis to networks and other irregular domains[J]. IEEE Signal Processing Magazine,2013,30(3):83-98.
[25] HENAFF M,BRUNA J. LeCUN Y. Deep convolutional networks on graph-structured data[EB/OL].[2020-03-10]. https://arxiv.org/pdf/1506.05163.pdf.
[26] KAY W,CARREIRA J,SIMONYAN K,et al. The Kinetics human action video dataset[EB/OL].[2020-03-10]. https://arxiv.org/pdf/1705.06950.pdf.
[27] SHAHROUDY A,LIU J,NG T T,et al. NTU RGB+D:a large scale dataset for 3D human activity analysis[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE,2016:1010-1019.
[28] CAO Z,HIDALGO G,SIMON T,et al. OpenPose:realtime multi-person 2D pose estimation using part affinity fields[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021,43(1):172-186.
[29] KIM T S,REITER A. Interpretable 3D human action analysis with temporal convolutional networks[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Piscataway:IEEE,2017:1623-1631.
[30] 朱大勇, 郭星, 吴建国. 基于Kinect三维骨骼节点的动作识别方法[J]. 计算机工程与应用,2018,54(20):152-158.(ZHU D Y,GUO X,WU J G. Action recognition method using Kinect 3D skeleton data[J]. Computer Engineering and Applications,2018, 54(20):152-158.)
[31] 喻露, 胡剑锋, 姚磊岳. 基于人体骨架的非标准深蹲姿势检测方法[J]. 计算机应用,2019,39(5):1448-1452.(YU L,HU J F, YAO L Y. Detection method of non-standard deep squat posture based on human skeleton[J]. Journal of Computer Applications,2019,39(5):1448-1452.)

Human skeleton-based action recognition algorithm based on spatiotemporal attention graph convolutional network model

基于时空注意力图卷积网络模型的人体骨架动作识别算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	Guixiang XUE, Hui WANG, Weifeng ZHOU, Yu LIU, Yan LI. Port traffic flow prediction based on knowledge graph and spatio-temporal diffusion graph convolutional network [J]. Journal of Computer Applications, 2024, 44(9): 2952-2957.
[2]	Jing QIN, Zhiguang QIN, Fali LI, Yueheng PENG. Diagnosis of major depressive disorder based on probabilistic sparse self-attention neural network [J]. Journal of Computer Applications, 2024, 44(9): 2970-2974.
[3]	Liting LI, Bei HUA, Ruozhou HE, Kuang XU. Multivariate time series prediction model based on decoupled attention mechanism [J]. Journal of Computer Applications, 2024, 44(9): 2732-2738.
[4]	Zhiqiang ZHAO, Peihong MA, Xinhong HEI. Crowd counting method based on dual attention mechanism [J]. Journal of Computer Applications, 2024, 44(9): 2886-2892.
[5]	Chuanlin PANG, Rui TANG, Ruizhi ZHANG, Chuan LIU, Jia LIU, Shibo YUE. Distributed power allocation algorithm based on graph convolutional network for D2D communication systems [J]. Journal of Computer Applications, 2024, 44(9): 2855-2862.
[6]	Kaipeng XUE, Tao XU, Chunjie LIAO. Multimodal sentiment analysis network with self-supervision and multi-layer cross attention [J]. Journal of Computer Applications, 2024, 44(8): 2387-2392.
[7]	Pengqi GAO, Heming HUANG, Yonghong FAN. Fusion of coordinate and multi-head attention mechanisms for interactive speech emotion recognition [J]. Journal of Computer Applications, 2024, 44(8): 2400-2406.
[8]	Zhonghua LI, Yunqi BAI, Xuejin WANG, Leilei HUANG, Chujun LIN, Shiyu LIAO. Low illumination face detection based on image enhancement [J]. Journal of Computer Applications, 2024, 44(8): 2588-2594.
[9]	Shangbin MO, Wenjun WANG, Ling DONG, Shengxiang GAO, Zhengtao YU. Single-channel speech enhancement based on multi-channel information aggregation and collaborative decoding [J]. Journal of Computer Applications, 2024, 44(8): 2611-2617.
[10]	Li LIU, Haijin HOU, Anhong WANG, Tao ZHANG. Generative data hiding algorithm based on multi-scale attention [J]. Journal of Computer Applications, 2024, 44(7): 2102-2109.
[11]	Song XU, Wenbo ZHANG, Yifan WANG. Lightweight video salient object detection network based on spatiotemporal information [J]. Journal of Computer Applications, 2024, 44(7): 2192-2199.
[12]	Dahai LI, Zhonghua WANG, Zhendong WANG. Dual-branch low-light image enhancement network combining spatial and frequency domain information [J]. Journal of Computer Applications, 2024, 44(7): 2175-2182.
[13]	Wenliang WEI, Yangping WANG, Biao YUE, Anzheng WANG, Zhe ZHANG. Deep learning model for infrared and visible image fusion based on illumination weight allocation and attention [J]. Journal of Computer Applications, 2024, 44(7): 2183-2191.
[14]	Wu XIONG, Congjun CAO, Xuefang SONG, Yunlong SHAO, Xusheng WANG. Handwriting identification method based on multi-scale mixed domain attention mechanism [J]. Journal of Computer Applications, 2024, 44(7): 2225-2232.
[15]	Huanhuan LI, Tianqiang HUANG, Xuemei DING, Haifeng LUO, Liqing HUANG. Public traffic demand prediction based on multi-scale spatial-temporal graph convolutional network [J]. Journal of Computer Applications, 2024, 44(7): 2065-2072.