Lightweight fall detection algorithm framework based on RPEpose and XJ-GCN

doi:10.11772/j.issn.1001-9081.2023101379

Abstract

Abstract:

The traditional joint keypoint detection model based on the Vision Transformer （ViT） model usually adopts 2D Sine Position Embedding， which is prone to losing key two-dimensional shape information in the image， leading to a decrease in accuracy. For behavior classification models， the traditional Spatio-Temporal Graph Convolutional Network （ST?GCN） suffers from the lack of correlation between non-physically connected joint connections in uni-labeling partitioning strategy. To address the above problems， a lightweight real-time fall detection algorithm framework was designed to detect fall behavior quickly and accurately. The framework contains a joint keypoint detection model RPEpose （Relative Position Encoding pose estimation） and a behavior classification model XJ-GCN （Cross-Joint attention Graph Convolutional Network）. On the one hand， a type of relative position encoding was adopted by the RPEpose model to overcome the position insensitivity defect of the original position encoding and improve the performance of the ViT architecture in joint keypoint detection. On the other hand， an X-Joint （Cross-Joint） attention mechanism was proposed， after reconstructing the partitioning strategy into the XJL （X-Joint Labeling） partitioning strategy， the dependencies between all joint connections were modelled to obtain the potential correlation between joint connections with excellent classification performance and few parameters. Experimental results indicate that， on the COCO 2017 validation set， RPEpose model only requires 8.2 GFLOPs （Giga FLOating Point of operations） of computational overhead while achieving a testing Average Precision （AP） of 74.3% for images with a resolution of 256×192； on the NTU RGB+D dataset， the Top-1 accuracy using Cross Subject （X?Sub） as the partitioning standard is 89.6%， and the proposed framework RPEpose+XJ-GCN has a prediction accuracy of 87.2% at a processing speed of 30 frame/s， verifying its high real-time and accuracy.

Key words: fall detection, joint keypoint detection, relative position encoding, Spatio-Temporal Graph Convolutional Network (ST-GCN), attention mechanism

摘要：

传统的以ViT（Vision Transformer）模型为基准架构的关节点检测模型通常采用二维正弦位置编码，易丢失图像关键的二维形状信息，导致精度下降；而行为分类模型中，传统的时空图卷积网络（ST-GCN）在单标签分区策略中存在非物理连接的关节连接间关联度缺失问题。针对上述问题，设计一种轻量化实时跌倒检测算法框架，以快速准确地检测跌倒行为。该框架包含关节点检测模型RPEpose（Relative Position Encoding pose estimation）和行为分类模型XJ-GCN（Cross-Joint attention Graph Convolutional Network）。一方面，RPEpose模型采用相对位置编码克服原有位置编码的位置不敏感的缺陷，提升ViT架构在关节点检测中的性能；另一方面，提出X-Joint（Cross?Joint）注意力机制，将分区策略重构为XJL（X-Joint Labeling）分区策略后，对所有关节连接之间的依赖关系建模，能获得关节连接潜在相关性，具有分类性能优异且参数量小的优势。实验结果表明，在COCO 2017验证集上，对于分辨率为256×192的图像，RPEpose模型的计算开销仅为8.2 GFLOPs（Giga FLOating Point of operations），测试平均精度（AP）为74.3%；在以交叉目标（X?Sub）为划分标准的NTU RGB+D数据集上，XJ-GCN模型的测试Top-1准确率为89.6%，所提框架RPEpose+XJ-GCN的处理速度为30 frame/s，预测准确率为87.2%，具有较高的实时性和准确性。

关键词: 跌倒检测, 关节点检测, 相对位置编码, 时空图卷积网络, 注意力机制

CLC Number:

TP391.4

Ruiyan LIANG, Hui YANG. Lightweight fall detection algorithm framework based on RPEpose and XJ-GCN[J]. Journal of Computer Applications, 2024, 44(11): 3639-3646.

梁睿衍, 杨慧. 基于RPEpose和XJ-GCN的轻量级跌倒检测算法框架[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3639-3646.

Figures/Tables 14

References 27

1	PIERLEONI P， BELLI A， PALMA L， et al. A high reliability wearable device for elderly fall detection ［J］. IEEE Sensors Journal， 2015， 15（8）： 4544-4553.
2	CAO Z， HIDALGO G， SIMON T， et al. OpenPose： realtime multi-person 2D pose estimation using part affinity fields［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2021， 43（1）： 172-186.
3	MAJI D， NAGORI S， MATHEW M， et al. YOLO-Pose： enhancing YOLO for multi person pose estimation using object keypoint similarity loss［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 2636-2645.
4	CHEN Y， WANG Z， PENG Y， et al. Cascaded pyramid network for multi-person pose estimation［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 7103-7112.
5	YANG S， QUAN Z， NIE M， et al. TransPose： keypoint localization via Transformer［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 11782-11792.
6	RAMACHANDRAN P， PARMAR N， VASWANI A， et al. Stand-alone self-attention in vision models［C］// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2019： 68-80.
7	DOSOVITSKIY A， BEYER L， KOLESNIKOV A， et al. An image is worth 16x16 words： Transformers for image recognition at scale ［EB/OL］. ［2023-10-11］. .
8	LIN T-Y， MAIRE M， BELONGIE S， et al. Microsoft COCO： common objects in context［C］// Proceedings of the 13th European Conference on Computer Vision. Cham： Springer， 2014： 740-755.
9	YAN S， XIONG Y， LIN D. Spatial temporal graph convolutional networks for skeleton-based action recognition［C］// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2018： 7444-7452.
10	LI M， CHEN S， CHEN X， et al. Actional-structural graph convolutional networks for skeleton-based action recognition［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 3590-3598.
11	HEDEGAARD L， HEIDARI N， IOSIFIDIS A. Continual spatio-temporal graph convolutional networks［J］. Pattern Recognition， 2023， 140： 109528.
12	SHAHROUDY A， LIU J， T-T NG， et al. NTU RGB+ D： a large scale dataset for 3D human activity analysis［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 1010-1019.
13	XU Y， ZHANG J， ZHANG Q， et al. ViTPose： simple vision Transformer baselines for human pose estimation［C］// Proceedings of the 36th International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2022： 38571-38584.
14	YUAN Y， FU R， HUANG L， et al. HRFormer： high-resolution vision Transformer for dense predict［C］// Proceedings of the 35th International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2021： 7281-7293.
15	曹建荣，吕俊杰，武欣莹，等.融合运动特征和深度学习的跌倒检测算法［J］.计算机应用，2021，41（2）：583-589.
	CAO J R， LYU J J， WU X Y， et al. Fall detection algorithm integrating motion features and deep learning［J］. Journal of Computer Applications， 2021， 41（2）： 583-589.
16	马敬奇，雷欢，陈敏翼.基于AlphaPose优化模型的老人跌倒行为检测算法［J］.计算机应用，2022，42（1）：294-301.
	MA J Q， LEI H， CHEN M Y. Fall behavior detection algorithm for the elderly based on AlphaPose optimization model［J］. Journal of Computer Applications， 2022， 42（1）：294-301.
17	DENG J， DONG W， SOCHER R， et al. ImageNet： a large-scale hierarchical image database［C］// Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2009： 248-255.
18	WU K， PENG H， CHEN M， et al. Rethinking and improving relative position encoding for vision Transformer［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 10033-10041.
19	FANG H-S， XIE S， TAI Y-W， et al. RMPE： regional multi-person pose estimation［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 2353-2362.
20	XIAO B， WU H， WEI Y. Simple baselines for human pose estimation and tracking［C］// Proceedings of the 15th European Conference on Computer Vision.Cham： Springer， 2018： 472-487.
21	KREISS S， BERTONI L， ALAHI A. OpenPifPaf： composite fields for semantic keypoint detection and spatio-temporal association［J］. IEEE Transactions on Intelligent Transportation Systems， 2022， 23（8）： 13498-13511.
22	PLIZZARI C， CANNICI M， MATTEUCCI M. Skeleton-based action recognition via spatial and temporal Transformer networks［J］. Computer Vision and Image Understanding， 2021， 208/209： 103219.
23	LI C， ZHONG Q， XIE D， et al. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation ［EB/OL］. ［2023-08-22］. .
24	SHI L， ZHANG Y， CHENG J， et al. Two-stream adaptive graph convolutional networks for skeleton-based action recognition［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 12018-12027.
25	SI C， JING Y， WANG W， et al. Skeleton-based action recognition with spatial reasoning and temporal stack learning［C］// Proceedings of the 15th European Conference on Computer Vision. Cham： Springer， 2018： 106-121.
26	SI C， CHEN W， WANG W， et al. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 1227-1236.
27	ZHANG P， LAN C， XING J， et al. View adaptive neural networks for high performance skeleton-based human action recognition［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2019， 41（8）： 1963-1978.

位置编码	AP	AR
2D Sine Position Embedding	71.7	77.1
Bias Mode	72.9	77.4
Contextual Mode	73.3	77.6
RPE-I（本文）	74.3	78.2

位置编码	AP	AR
2D Sine Position Embedding	71.7	77.1
Bias Mode	72.9	77.4
Contextual Mode	73.3	77.6
RPE-I（本文）	74.3	78.2

模型	分辨率	计算量/GFLOPs	AP/%	AR/%
TransPose-H-A4^［5］	256×192	10.2	74.2	78.0
CPN+^［4］	384×288	29.2	73.0	79.0
AlphaPose^［19］	320×256	26.7	72.3	—
Simple Baseline^［20］	384×288	35.6	72.3	79.0
OpenPose^［2］	—	—	65.3	—
YOLO-Pose^［3］	960×960	—	68.5	75.0
OpenPifPaf^［21］	—	—	71.9	—
RPEpose	256×192	8.2	74.3	78.2

模型	分辨率	计算量/GFLOPs	AP/%	AR/%
TransPose-H-A4^［5］	256×192	10.2	74.2	78.0
CPN+^［4］	384×288	29.2	73.0	79.0
AlphaPose^［19］	320×256	26.7	72.3	—
Simple Baseline^［20］	384×288	35.6	72.3	79.0
OpenPose^［2］	—	—	65.3	—
YOLO-Pose^［3］	960×960	—	68.5	75.0
OpenPifPaf^［21］	—	—	71.9	—
RPEpose	256×192	8.2	74.3	78.2

维度	Top-1 Accuracy/%
维度	X-Sub	X-View
2D	88.4	95.2
3D	89.6	94.6