Journal of Computer Applications ›› 2026, Vol. 46 ›› Issue (6): 1965-1972.DOI: 10.11772/j.issn.1001-9081.2025060763

• Multimedia computing and computer simulation • Previous Articles    

3D human pose estimation model based on temporal-spatial feature pyramid network and multi-hypothesis interaction mechanism

Jinxiao ZHANG1, Chenglong LI1(), Xinyan GAO2, Ming ZHANG1   

  1. 1.School of Computer and Artificial Intelligence,Shandong Jianzhu University,Jinan Shandong 250101,China
    2.Shandong Huayun 3D Technology Company Limited,Jinan Shandong 250000,China
  • Received:2025-07-12 Revised:2025-08-11 Accepted:2025-08-15 Online:2025-08-27 Published:2026-06-10
  • Contact: Chenglong LI
  • About author:ZHANG Jinxiao, born in 1999, M. S. candidate. His research interests include computer vision, human pose estimation.
    GAO Xinyan, born in 1987, M. S., senior engineer. Her research interests include computer vision.
    ZHANG Ming, born in 1995, M. S. candidate. Her research interests include computer vision, human pose estimation.
    First author contact:LI Chenglong, born in 1988, Ph. D., associate professor. His research interests include computer vision, augmented reality, computer graphics.
  • Supported by:
    National Natural Science Foundation of China(62102235)

基于时空特征金字塔网络与多假设交互机制的三维人体姿态估计模型

张金萧1, 李成龙1(), 高新燕2, 张铭1   

  1. 1.山东建筑大学 计算机与人工智能学院,济南 250101
    2.山东华云三维科技有限公司,济南 250000
  • 通讯作者: 李成龙
  • 作者简介:张金萧(1999—),男,河南驻马店人,硕士研究生,主要研究方向:计算机视觉、人体姿态估计
    高新燕(1987—),女,山东济南人,高级工程师,硕士,主要研究方向:计算机视觉
    张铭(1995—),女,山东菏泽人,硕士研究生,主要研究方向:计算机视觉、人体姿态估计。
    第一联系人:李成龙(1988—),男,山东济南人,副教授,博士,CCF会员,主要研究方向:计算机视觉、增强现实、计算机图形学
  • 基金资助:
    国家自然科学基金项目资助项目(62102235)

Abstract:

Estimating ambiguous Three-Dimensional (3D) human poses from monocular videos accurately is a current research challenge. Though the existing methods can estimate 3D joint coordinates using deep learning models, most of them fail to consider the multi-solution nature of this inverse problem adequately. Some multi-hypothesis estimation methods address multi-solution problems, but they suffer from insufficient cross-level feature fusion. To address these issues, a 3D human pose estimation model based on Temporal-SPatial Feature Pyramid Network (TSP-FPN) and multi-hypothesis interaction mechanism, called TSP-FPN-MHFormer (Temporal-SPatial Feature Pyramid Network-Multi-Hypothesis Transformer), was proposed. Firstly, based on Transformer encoder, the multi-possibility distribution of human poses was captured by using multi-head self-attention mechanism, thereby generating multiple initial hypothesis features. Then, a TSP-FPN was designed, and a gated adaptive fusion strategy was employed to achieve dynamic weighted integration of multi-level skeleton sequence features, thereby balancing the fusion of local details and global temporal information effectively. Finally, based on Multi-Hypothesis Transformer (MHFormer), a multi-hypothesis optimization module that combined joint Relative Position Bias (RPB) with a cross-attention mechanism was implemented, thereby facilitating cross-hypothesis communication and feature aggregation to enhance the model’s long-range reasoning capability to human topology for high-precision 3D joint coordinate estimation. Experimental results on the Human3.6M dataset demonstrate that the proposed model achieves a Mean Per Joint Position Error (MPJPE) of 42.3 mm, and reduces the estimation error by 1.6% compared to the state-of-the-art method MHFormer, indicating substantial progress obtained by the proposed model in addressing multi-solution challenge of monocular 3D pose estimation.

Key words: Three-Dimensional (3D) human pose estimation, dynamic fusion, multi-hypothesis interaction mechanism, attention mechanism, Relative Position Bias (RPB)

摘要:

在单目视频中准确预测具有歧义性的三维(3D)人体姿态是当前研究的难点,虽然现有方法能通过深度学习模型预测3D关节坐标,但其中多数未能充分考虑该逆问题的多解性。部分多假设预测方法虽能处理多解性问题,然而它们存在跨层次特征融合不足的缺陷。针对上述问题,提出一种基于时空特征金字塔网络(TSP-FPN)与多假设交互机制的3D人体姿态估计模型——TSP-FPN-MHFormer(TSP-FPN-Multi-Hypothesis Transformer)。首先,基于Transformer编码器,利用多头自注意力机制捕获人体姿态的多重可能性分布,从而生成多个初始假设特征;其次,设计TSP-FPN,并采用门控自适应融合策略实现骨架序列多层次特征的动态加权整合,从而有效平衡局部细节与全局时序信息的融合;最后,在多假设转换器(MHFormer)的基础上实现结合关节相对位置偏置(RPB)与交叉注意力机制的多假设优化模块,以促进各假设之间的沟通与特征聚合,从而增强模型对人体拓扑结构的长程推理能力,进而实现高精度的3D关节坐标预测。在Human3.6M数据集上的实验结果表明,所提模型的平均关节位置误差(MPJPE)达到了42.3 mm,相较于目前先进方法多假设转换器(MHFormer),该模型的预测误差降低了1.6%,体现出所提模型在应对单目3D姿态估计的多解性挑战上取得了实质性进展。

关键词: 三维人体姿态估计, 动态融合, 多假设交互机制, 注意力机制, 相对位置偏置

CLC Number: