Journal of Computer Applications

    Next Articles

Spatio-temporal context network for 3D human pose estimation based on graph attention

  

  • Received:2024-10-22 Revised:2025-01-11 Accepted:2025-01-16 Online:2025-02-07 Published:2025-02-07

基于图注意力机制的三维人体姿态估计时空上下文网络

曾正东,赵明   

  1. 上海海事大学 信息工程学院,上海 200135
  • 通讯作者: 赵明
  • 基金资助:
    国家自然科学基金项目;上海市自然科学基金项目

Abstract: According to recent research data on human pose estimation, making full potential of 2D pose sequences to acquire representative characteristics for producing a 3D pose estimation still awaits a definitive answer. A temporal context network with graph attention was developed, incorporating a temporal context network with graph attention, which included Temporal Context Network with Shifted windows (STCN), Extremity-Guided graph global Attention Network (EGAT), and Pose Grammar-based graph Convolution Network (PGCN). Firstly, STCN was explored to transform the 2D joint position of long sequences into a single 3D human pose, which not only effectively aggregated and utilized the long-range and short-range pose information, but also substantially cut down the computational cost. Through research, in 3D pose estimation, spatial feature was essential for solving depth ambiguity and occlusion problems. So far, there was no effective solution to simultaneously capture the changing spatial articulations flexibly and achieve real-time 3D pose estimation efficiently. EGAT module was present for efficiently computing global spatial context. This module treated human body edge nodes as “traffic hubs”, and established bridges for information exchange between them and other nodes. It utilized graph convolution for information propagation and graph attention mechanism for adaptive weight assignment to perform global context computation on human body joints. Finally, PGCN module was designed to utilize Graph Convolution Network (GCN) for computation and modeling local spatial context. The approach combined the imposition of additional temporal smoothness constraints under single-target frame supervision, thereby contributing to the generation of more precise and smoother 3D poses. Comprehensive inferential evaluations of the proposed model were conducted on the two complex benchmark datasets, Human3.6m and Humaneva-I. The results demonstrate that the approach exhibits superior performance. Specifically, when the number of input frames is set to 81, the experimental result of the approach on the Human3.6M dataset achieves a Mean Per Joint Position Error (MPJPE) of 43.5mm. This represents a 9% reduction in prediction error compared to the state-of-the-art SCNet (Spatial Collaboration Network), thereby showcasing enhanced accuracy. Code can be found at https://github.com/zzddwyff/upgraded-octo-carnival.

Key words: 3D human pose estimation, graph attention, temporal context, special context, temporal convolution network

摘要: 近期关于人体姿态估计的研究表明,充分发挥二维姿态潜在空间信息的潜力,获取具有代表性的特征,以产生更准确的三维姿态估计结果,仍然需要进一步研究。因此,提出一种基于图注意力机制的时空上下文网络,该网络包括带滑动窗口的时间上下文模块(STCN)、由人体边缘肢体关节点引导的全局图注意力机制模块(EGAT)和基于姿态语法的局部图卷积网络模块(PGCN)。首先,使用STCN将长序列的二维关节位置转化为单序列的人体姿态潜在特征,从而有效聚合和利用远、近距离的人体姿态信息,并大幅降低计算成本。其次,提出EGAT模块,以有效计算全局空间上下文。该模块将人体边缘节点视为“交通枢纽”,为它与其他节点之间的信息交换建立桥梁。再次,利用图注意力机制进行自适应权值分配,对人体关节进行全局上下文计算。最后,设计PGCN模块,利用图卷积网络(GCN)对局部空间上下文进行计算和建模。它强调人体对称节点的运动一致性和人体骨骼的运动关联结构。在Human3.6m和Humaneva-I这2个复杂的标准数据集上全面地推理评估所提模型,结果表明,所提模型有更优越的性能,在输入帧数为81帧的情况下,模型在数据集Human3.6m上的实验结果平均位置误差(MPJPE)可达43.5mm,与目前先进算法SCNet(Spatial Collaboration Network)相比,所提模型的预测误差降低了9%,体现出更高的准确度。本文的代码可以在https://github.com/zzddwyff/upgraded-octo-carnival上查阅。

关键词: 三维人体姿态估计, 图注意力机制, 时间上下文, 空间上下文, 时间卷积网络

CLC Number: