Human action recognition method based on multi-scale feature fusion of single mode

doi:10.11772/j.issn.1001-9081.2022101473

Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (10): 3236-3243.DOI: 10.11772/j.issn.1001-9081.2022101473

Special Issue: 多媒体计算与计算机仿真

• Multimedia computing and computer simulation • Previous Articles Next Articles

Human action recognition method based on multi-scale feature fusion of single mode

Suolan LIU¹^,², Zhenzhen TIAN¹, Hongyuan WANG¹(), Long LIN¹, Yan WANG¹

^1.School of Computer Science and Artificial Intelligence，Aliyun School of Big Data，School of Software，Changzhou University，Changzhou Jiangsu 213164，China
^2.Jiangsu Key Laboratory of Image and Video Understanding for Social Security （Nanjing University of Science and Technology），Nanjing Jiangsu 210094，China

Received:2022-10-11 Revised:2022-12-29 Accepted:2023-01-03 Online:2023-04-12 Published:2023-10-10
Contact: Hongyuan WANG
About author:LIU Suolan， born in 1980， Ph. D.， associate professor. Her research interests include computer vision， artificial intelligence.
TIAN Zhenzhen， born in 1997， M. S. candidate. Her research interests include computer vision， pattern recognition.
LIN Long， born in 1998， M. S. candidate. His research interests include computer vision， data augmentation.
WANG Yan， born in 1999， M. S. candidate. His research interests include computer vision， pattern recognition.
Supported by:
National Natural Science Foundation of China(61976028);Open Project of Jiangsu Key Laboratory of Image and Video Understanding for Social Security(J2021-2)

基于单模态的多尺度特征融合人体行为识别方法

刘锁兰¹^,², 田珍珍¹, 王洪元¹(), 林龙¹, 王炎¹

^1.常州大学计算机与人工智能学院阿里云大数据学院软件学院，江苏常州 213164
^2.江苏省社会安全图像与视频理解重点实验室（南京理工大学），南京 210094

通讯作者: 王洪元
作者简介:刘锁兰（1980—），女，江苏泰州人，副教授，博士，CCF会员，主要研究方向：计算机视觉、人工智能
田珍珍（1997—），女，河南郑州人，硕士研究生，主要研究方向：计算机视觉、模式识别
林龙（1998—），男，四川德阳人，硕士研究生，主要研究方向：计算机视觉、数据增强
王炎（1999—），男，江苏连云港人，硕士研究生，主要研究方向：计算机视觉、模式识别。
基金资助:
国家自然科学基金资助项目(61976028);江苏省社会安全图像与视频理解重点实验室开放课题(J2021?2)

Abstract

Abstract:

In order to solve the problem of insufficient mining of potential association between remote nodes in human action recognition tasks， and the problem of high training cost caused by using multi-modal data， a multi-scale feature fusion human action recognition method under the condition of single mode was proposed. Firstly， the global feature correlation of the original skeleton diagram of human body was carried out， and the coarse-scale global features were used to capture the connections between the remote nodes. Secondly， the global feature correlation graph was divided locally to obtain the Complementary Subgraphs with Global Features （CSGFs）， the fine-scale features were used to establish the strong correlation， and the multi-scale feature complementarity was formed. Finally， the CSGFs were input into the spatial-temporal Graph Convolutional module for feature extraction， and the extracted results were aggregated to output the final classification results. Experimental results show that the accuracy of the proposed method on the authoritative action recognition dataset NTU RGB+D60 is 89.0% （X-sub） and 94.2% （X-view） respectively. On the challenging large-scale dataset NTU RGB+D120， the accuracy of the proposed method is 83.3% （X-sub） and 85.0% （X-setup） respectively， which is 1.4 and 0.9 percentage points higher than that of the ST-TR （Spatial-Temporal TRansformer） under single modal respectively， and 4.1 and 3.5 percentage points higher than that of the lightweight SGN （Semantics-Guided Network）. It can be seen that the proposed method can fully exploit the synergistic complementarity of multi-scale features， and effectively improve the recognition accuracy and training efficiency of the model under the condition of single modal.

Key words: human action recognition, skeleton joint, Graph Convolutional Network (GCN), single mode, multi-scale, feature fusion

摘要：

针对人体行为识别任务中未能充分挖掘超距关节点之间潜在关联的问题，以及使用多模态数据带来的高昂训练成本的问题，提出一种单模态条件下的多尺度特征融合人体行为识别方法。首先，将人体的原始骨架图进行全局特征关联，并利用粗尺度的全局特征捕获远距离关节点间的联系；其次，对全局特征关联图进行局部划分以得到融合了全局特征的互补子图（CSGF），利用细尺度特征建立强关联，并形成多尺度特征的互补；最后，将CSGF输入时空图卷积模块中提取特征，并聚合提取后的结果以输出最终的分类结果。实验结果表明，在行为识别权威数据集NTU RGB+D60上，所提方法的准确率分别为89.0%（X-sub）和94.2%（X-view）；在具有挑战性的大规模数据集NTU RGB+D120上，所提方法的准确率分别为83.3%（X-sub）和85.0%（X-setup），与单模态下的ST-TR（Spatial-Temporal TRansformer）相比，分别提升1.4和0.9个百分点，与轻量级SGN（Semantics-Guided Network）相比，分别提升4.1和3.5个百分点。可见，所提方法能够充分挖掘多尺度特征的协同互补性，并有效提高单模态条件下模型的识别准确率和训练效率。

关键词: 人体行为识别, 骨架关节点, 图卷积网络, 单模态, 多尺度, 特征融合

CLC Number:

TP391.41

Suolan LIU, Zhenzhen TIAN, Hongyuan WANG, Long LIN, Yan WANG. Human action recognition method based on multi-scale feature fusion of single mode[J]. Journal of Computer Applications, 2023, 43(10): 3236-3243.

刘锁兰, 田珍珍, 王洪元, 林龙, 王炎. 基于单模态的多尺度特征融合人体行为识别方法[J]. 《计算机应用》唯一官方网站, 2023, 43(10): 3236-3243.

Figures/Tables 8

References 33

1	SI C， CHEN W， WANG W， et al. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 1227-1236. 10.1109/cvpr.2019.00132
2	A van den OORD， KALCHBRENNER N， KAVUKCUOGLU K. Pixel recurrent neural networks［C］// Proceedings of the 33rd International Conference on Machine Learning. New York： JMLR.org， 2016： 1747-1756.
3	DEFFERRARD M， BRESSON X， VANDERGHEYNST P. Convolutional neural networks on graphs with fast localized spectral filtering［C］// Proceedings of the 30th International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2016： 3844-3852.
4	YANG H， YAN D， ZHANG L， et al. Feedback graph convolutional network for skeleton-based action recognition［J］. IEEE Transactions on Image Processing， 2022， 31： 164-175. 10.1109/tip.2021.3129117
5	YAN S， XIONG Y， LIN D. Spatial temporal graph convolutional networks for skeleton-based action recognition［C］// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2018： 7444-7452. 10.1609/aaai.v32i1.12328
6	SHI L， ZHANG Y， CHENG J， et al. Decoupled spatial-temporal attention network for skeleton-based action recognition［C］// Proceedings of the 2020 Asian Conference on Computer Vision， LNCS 12626. Cham： Springer， 2021： 38-53.
7	CHEN Y， ZHANG Z， YUAN C， et al. Channel-wise topology refinement graph convolution for skeleton-based action recognition［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 13339-13348. 10.1109/iccv48922.2021.01311
8	LI C， CUI Z， ZHENG W， et al. Action-attending graphic neural network［J］. IEEE Transactions on Image Processing， 2018， 27（7）： 3657-3670. 10.1109/tip.2018.2815744
9	PENG W， HONG X， CHEN H， et al. Learning graph convolutional network for skeleton-based human action recognition by neural searching［C］// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2020： 2669-2676. 10.1609/aaai.v34i03.5652
10	ZHAO R， WANG K， SU H， et al. Bayesian graph convolution LSTM for skeleton based action recognition［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 6882-6892. 10.1109/iccv.2019.00698
11	GAO J， HE T， ZHOU X， et al. Focusing and diffusion： bidirectional attentive graph convolutional networks for skeleton-based action recognition［EB/OL］. （2019-12-24）. ［2022-08-13］.. 10.1109/lsp.2021.3116513
12	KIPF T N， WELLING M. Semi-supervised classification with graph convolutional networks［EB/OL］. （2017-02-22）. ［2022-09-10］.. 10.48550/arXiv.1609.02907
13	LIU Z， ZHANG H， CHEN Z， et al. Disentangling and unifying graph convolutions for skeleton-based action recognition［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 143-152. 10.1109/cvpr42600.2020.00022
14	CHENG K， ZHANG Y， HE X， et al. Skeleton-based action recognition with shift graph convolutional network［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 180-189. 10.1109/cvpr42600.2020.00026
15	SONG Y F， ZHANG Z， SHAN C， et al. Richly activated graph convolutional network for robust skeleton-based action recognition［J］. IEEE Transactions on Circuits and Systems for Video Technology， 2021， 31（5）： 1915-1925. 10.1109/tcsvt.2020.3015051
16	CHO S， MAQBOOL M H， LIU F， et al. Self-attention network for skeleton-based human action recognition［C］// Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision. Piscataway： IEEE， 2020： 624-633. 10.1109/wacv45572.2020.9093639
17	YU W， YANG K， YAO H， et al. Exploiting the complementary strengths of multi-layer CNN features for image retrieval［J］. Neurocomputing， 2017， 237： 235-241. 10.1016/j.neucom.2016.12.002
18	刘渭滨，邹智元，邢薇薇. 模式分类中的特征融合方法［J］. 北京邮电大学学报， 2017， 40（4）： 1-8.
	LIU W B， ZOU Z Y， XING W W. Feature fusion method in pattern classification［J］. Journal of Beijing University of Posts and Telecommunications， 2017， 40（4）： 1-8.
19	SHI L， ZHANG Y， CHENG J， et al. Two-stream adaptive graph convolutional networks for skeleton-based action recognition［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 12018-12027. 10.1109/cvpr.2019.01230
20	CHEN Y， ROHRBACH M， YAN Z， et al. Graph-based global reasoning networks［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 433-442. 10.1109/cvpr.2019.00052
21	SHAHROUDY A， LIU J， NG T T， et al. NTU RGB+ D： a large scale dataset for 3D human activity analysis［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 1010-1019. 10.1109/cvpr.2016.115
22	LIU J， SHAHROUDY A， PEREZ M， et al. NTU RGB+ D 120： a large-scale benchmark for 3D human activity understanding［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2020， 42（10）： 2684-2701. 10.1109/tpami.2019.2916873
23	PASZKE A， GROSS S， CHINTALA S， et al. Automatic differentiation in PyTorch［EB/OL］. （2017-10-29）［2020-12-01］..
24	HUANG L， HUANG Y， OUYANG W， et al. Part-level graph convolutional network for skeleton-based action recognition［C］// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2020： 11045-11052. 10.1609/aaai.v34i07.6759
25	SHI L， ZHANG Y， CHENG J， et al. Skeleton-based action recognition with directed graph neural networks［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 7904-7913. 10.1109/cvpr.2019.00810
26	THAKKAR K， NARAYANAN P J. Part-based graph convolutional network for action recognition［EB/OL］. （2018-09-13）［2022-08-13］..
27	YANG H， GU Y， ZHU J， et al. PGCN-TCA： pseudo graph convolutional network with temporal and channel-wise attention for skeleton-based action recognition［J］. IEEE Access， 2020， 8： 10040-10047. 10.1109/access.2020.2964115
28	PLIZZARI C， CANNICI M， MATTEUCCI M. Skeleton-based action recognition via spatial and temporal transformer networks［J］. Computer Vision and Image Understanding， 2021， 208/209： No.103219. 10.1016/j.cviu.2021.103219
29	ZHANG P， LAN C， ZENG W， et al. Semantics-guided neural networks for efficient skeleton-based human action recognition［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 1109-1118. 10.1109/cvpr42600.2020.00119
30	CHEN Z， LIU H， GUO T， et al. Contrastive learning from spatio-temporal mixed skeleton sequences for self-supervised skeleton-based action recognition［EB/OL］. （2022-07-07）［2022-10-23］..
31	CHEN Z， LI S， YANG B， et al. Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition［C］// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2021： 1113-1122. 10.1609/aaai.v35i2.16197
32	PAPADOPOULOS K， GHORBEL E， AOUADA D， et al. Vertex feature encoding and hierarchical temporal modeling in a spatial-temporal graph convolutional network for action recognition［C］// Proceedings of the 25th International Conference on Pattern Recognition. Piscataway： IEEE， 2021： 452-458. 10.1109/icpr48806.2021.9413189
33	MEMMESHEIMER R， THEISEN N， PAULUS D. Gimme signals： discriminative signal encoding for multimodal activity recognition［C］// Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway： IEEE， 2020： 10394-10401. 10.1109/iros45743.2020.9341699

方法	准确率/%	参数量/10⁶
RA-GCN（3s）	87.3	6.21
Shift-GCN（1s）	87.8	0.72
ST-TR（1s）	88.7	6.48
DGNN（2s）	89.9	26.20
PL-GCN	89.2	20.70
PB-GCN	87.5	3.55
本文方法	89.0	4.10

方法	准确率/%	参数量/10⁶
RA-GCN（3s）	87.3	6.21
Shift-GCN（1s）	87.8	0.72
ST-TR（1s）	88.7	6.48
DGNN（2s）	89.9	26.20
PL-GCN	89.2	20.70
PB-GCN	87.5	3.55
本文方法	89.0	4.10

特征数	方法	X-sub	X-view
单特征	ST-GCN	81.5	88.3
	Global feature graph	86.7	93.1
	3subgraph	86.8	93.3
	4subgraph	87.4	93.7
	5subgraph	86.9	93.4
	6subgraph	87.0	93.2
多特征融合	Global feature graph+3subgraph	88.8	94.2
	Global feature graph+4subgraph	89.0	94.2
	Global feature graph+5subgraph	88.2	94.1
	Global feature graph+6subgraph	88.7	93.6

特征数	方法	X-sub	X-view
单特征	ST-GCN	81.5	88.3
	Global feature graph	86.7	93.1
	3subgraph	86.8	93.3
	4subgraph	87.4	93.7
	5subgraph	86.9	93.4
	6subgraph	87.0	93.2
多特征融合	Global feature graph+3subgraph	88.8	94.2
	Global feature graph+4subgraph	89.0	94.2
	Global feature graph+5subgraph	88.2	94.1
	Global feature graph+6subgraph	88.7	93.6

方法	X-sub	X-view
ST-GCN	81.5	88.3
PB-GCN	87.5	93.2
SAN	87.2	92.7
SGN	89.0	94.5
PGCN-TCA	88.0	93.6
ST-TR（1s）	88.7	95.6
RA-GCN（3s）	87.3	93.6
MST-GCN（1s）	89.0	95.1
Shift-GCN（1s）	87.8	95.1
SkeleMixCLR（3s）	87.7	94.0
本文方法	89.0	94.2

Human action recognition method based on multi-scale feature fusion of single mode

基于单模态的多尺度特征融合人体行为识别方法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 8

References 33

Related Articles 15

Recommended Articles

Metrics

方法	X-sub	X-setup
GVFE+AS-GCN with DH-TCN	78.3	79.8
Gimme Signals	70.8	71.6
SkeleMixCLR（3s）	82.0	82.9
Shift-GCN（1s）	80.9	83.2
MST-GCN（1s）	82.8	84.5
RA-GCN（3s）	81.1	82.7
ST-TR（1s）	81.9	84.1
SGN	79.2	81.5
本文方法	83.3	85.0

[1]	Guixiang XUE, Hui WANG, Weifeng ZHOU, Yu LIU, Yan LI. Port traffic flow prediction based on knowledge graph and spatio-temporal diffusion graph convolutional network [J]. Journal of Computer Applications, 2024, 44(9): 2952-2957.
[2]	Yexin PAN, Zhe YANG. Optimization model for small object detection based on multi-level feature bidirectional fusion [J]. Journal of Computer Applications, 2024, 44(9): 2871-2877.
[3]	Chuanlin PANG, Rui TANG, Ruizhi ZHANG, Chuan LIU, Jia LIU, Shibo YUE. Distributed power allocation algorithm based on graph convolutional network for D2D communication systems [J]. Journal of Computer Applications, 2024, 44(9): 2855-2862.
[4]	Yan RONG, Jiawen LIU, Xinlei LI. Adaptive hybrid network for affective computing in student classroom [J]. Journal of Computer Applications, 2024, 44(9): 2919-2930.
[5]	Tong CHEN, Fengyu YANG, Yu XIONG, Hong YAN, Fuxing QIU. Construction method of voiceprint library based on multi-scale frequency-channel attention fusion [J]. Journal of Computer Applications, 2024, 44(8): 2407-2413.
[6]	Chenqian LI, Jun LIU. Ultrasound carotid plaque segmentation method based on semi-supervision and multi-scale cascaded attention [J]. Journal of Computer Applications, 2024, 44(8): 2604-2610.
[7]	Li LIU, Haijin HOU, Anhong WANG, Tao ZHANG. Generative data hiding algorithm based on multi-scale attention [J]. Journal of Computer Applications, 2024, 44(7): 2102-2109.
[8]	Ruihua LIU, Zihe HAO, Yangyang ZOU. Gait recognition algorithm based on multi-layer refined feature fusion [J]. Journal of Computer Applications, 2024, 44(7): 2250-2257.
[9]	Wu XIONG, Congjun CAO, Xuefang SONG, Yunlong SHAO, Xusheng WANG. Handwriting identification method based on multi-scale mixed domain attention mechanism [J]. Journal of Computer Applications, 2024, 44(7): 2225-2232.
[10]	Huanhuan LI, Tianqiang HUANG, Xuemei DING, Haifeng LUO, Liqing HUANG. Public traffic demand prediction based on multi-scale spatial-temporal graph convolutional network [J]. Journal of Computer Applications, 2024, 44(7): 2065-2072.
[11]	Wei LI, Xiaorong ZHANG, Peng CHEN, Qing LI, Changqing ZHANG. Crowd counting algorithm with multi-scale fusion based on normal inverse Gamma distribution [J]. Journal of Computer Applications, 2024, 44(7): 2243-2249.
[12]	Yuan TANG, Yanping CHEN, Ying HU, Ruizhang HUANG, Yongbin QIN. Relation extraction model based on multi-scale hybrid attention convolutional neural networks [J]. Journal of Computer Applications, 2024, 44(7): 2011-2017.
[13]	Sailong SHI, Zhiwen FANG. Gaze estimation model based on multi-scale aggregation and shared attention [J]. Journal of Computer Applications, 2024, 44(7): 2047-2054.
[14]	Shibin LI, Jun GONG, Shengjun TANG. Semi-supervised heterophilic graph representation learning model based on Graph Transformer [J]. Journal of Computer Applications, 2024, 44(6): 1816-1823.
[15]	Mei WANG, Xuesong SU, Jia LIU, Ruonan YIN, Shan HUANG. Time series classification method based on multi-scale cross-attention fusion in time-frequency domain [J]. Journal of Computer Applications, 2024, 44(6): 1842-1847.