《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (10): 3161-3169.DOI: 10.11772/j.issn.1001-9081.2024101489

• 人工智能 • 上一篇    

基于图注意力机制的三维人体姿态估计时空上下文网络

曾正东, 赵明()   

  1. 上海海事大学 信息工程学院,上海 201306
  • 收稿日期:2024-10-30 修回日期:2025-01-11 接受日期:2025-01-16 发布日期:2025-02-07 出版日期:2025-10-10
  • 通讯作者: 赵明
  • 作者简介:曾正东(1999—),男,广东中山人,硕士研究生,CCF会员,主要研究方向:人体姿态估计、计算机视觉、目标检测
    赵明(1984—),女,湖北孝感人,教授,博士,主要研究方向:遥感影像采集与处理、计算机视觉、模式识别。 Email:zm_cynthia@163.com
  • 基金资助:
    国家自然科学基金资助项目(62271302)

Spatio-temporal context network for 3D human pose estimation based on graph attention

Zhengdong ZENG, Ming ZHAO()   

  1. College of Information Engineering,Shanghai Maritime University,Shanghai 201306,China
  • Received:2024-10-30 Revised:2025-01-11 Accepted:2025-01-16 Online:2025-02-07 Published:2025-10-10
  • Contact: Ming ZHAO
  • About author:ZENG Zhengdong, born in 1999, M. S. candidate. His research interests include human pose estimation, computer vision, object detection.
    ZHAO Ming, born in 1984, Ph. D., professor. Her research interests include remote sensing image acquisition and processing, computer vision, pattern recognition.
  • Supported by:
    National Natural Science Foundation of China(62271302)

摘要:

近期关于人体姿态估计的研究表明,充分发挥二维姿态潜在空间信息的能力,获取具有代表性的特征,可产生更准确的三维姿态估计结果。因此,提出一种基于图注意力机制的时空上下文网络,该网络包括带滑动窗口的时间上下文网络(TCN)、由肢体引导的全局图注意力机制网络(EGAT)和基于姿态语法的局部图注意力卷积网络(PGCN)。首先,使用STCN将长序列的二维关节位置转化为单序列的人体姿态潜在特征,从而有效聚合和利用远、近距离的人体姿态信息,并大幅降低计算成本。其次,提出EGAT模块,以有效计算全局空间上下文。该模块将人体边缘节点视为“交通枢纽”,为它们与其他节点之间的信息交换建立桥梁。再次,利用图注意力机制进行自适应权值分配,对人体关节进行全局上下文计算。最后,设计PGCN模块,利用图卷积网络(GCN)计算和建模局部空间上下文,它强调人体对称节点的运动一致性和人体骨骼的运动关联结构。在Human3.6M和HumanEva-Ⅰ这2个复杂的标准数据集上评估所提模型。实验结果表明,所提模型具有更优越的性能,在输入帧长度为81的情况下,所提模型在数据集Human3.6M上的每个关节的平均位置误差(MPJPE)达43.5 mm,与目前先进算法MCFNet(Multi-scale Cross Fusion Network)相比降低了10.5%,体现出更高的准确度。

关键词: 三维人体姿态估计, 图注意力, 时间上下文, 空间上下文, 时间卷积网络

Abstract:

According to recent research on human pose estimation, making full use of potential 2D pose space information to acquire representative characteristics can produce more accurate 3D pose results. Therefore, a spatio-temporal context network based on graph attention mechanism was proposed, which includes Temporal Context Network with Shifted windows (STCN), Extremity-Guided global graph ATtention mechanism network (EGAT), and Pose Grammar-based local graph attention Convolution Network (PGCN). Firstly, STCN was used to transform the 2D joint position in long sequence into potential features of human pose in single sequence, which aggregated and utilized the long-range and short-range human pose information effectively, and reduce the computational cost significantly. Secondly, EGAT was presented for computing global spatial context effectively, so that human extremities were treated as “traffic hubs”, and bridges were established for information exchange between them and other nodes. Thirdly, graph attention mechanism was utilized for adaptive weight assignment to perform global context computation on human joints. Finally, PGCN was designed to utilize Graph Convolution Network (GCN) for computing and modeling local spatial context, thereby emphasizing the motion consistency of symmetrical nodes of human and the motion correlation structure of human bones. Evaluations of the proposed model were conducted on the two complex benchmark datasets: Human3.6M and HumanEva-Ⅰ. Experimental results demonstrate that the proposed model has superior performance. Specifically, when the input frame length is 81, the proposed model achieves a Mean Per Joint Position Error (MPJPE) of 43.5 mm on the Human3.6M dataset, which represents a 10.5% reduction compared to that of the state-of-the-art algorithm MCFNet (Multi-scale Cross Fusion Network), showcasing higher accuracy.

Key words: 3D human pose estimation, graph attention, temporal context, spatial context, Temporal Convolution Network (TCN)

中图分类号: