Journal of Computer Applications

    Next Articles

3D human pose estimation model based on temporal-spatial feature pyramid network and multi-hypothesis interaction mechanism

  

  • Received:2025-07-14 Revised:2025-08-11 Online:2025-08-27 Published:2025-08-27

基于时空特征金字塔网络与多假设交互机制的三维人体姿态估计模型

张金萧,李成龙,高新燕,张铭   

  1. 山东建筑大学计算机与人工智能学院
  • 通讯作者: 李成龙
  • 基金资助:
    国家自然科学基金项目

Abstract: Accurately predicting ambiguous 3D human poses from monocular videos remains a significant research challenge. While existing methods can predict 3D joint coordinates using deep learning models, most fail to adequately consider the multi-solution nature of this inverse problem. Although some multi-hypothesis prediction methods address ambiguity, they suffer from insufficient cross-level feature fusion. To address these issues, a 3D human pose estimation model was proposed based on a Temporal-Spatial Feature Pyramid Network and a multi-hypothesis interaction mechanism. First, a Transformer encoder was utilized to capture the multi-modal distribution of possible poses using multi-head self-attention, generating multiple initial hypothesis features. Then, a Temporal-Spatial Feature Pyramid Network (TSP-FPN) was designed, employing a gated adaptive fusion strategy to achieve dynamic weighted integration of multi-level skeleton sequence features, effectively balancing the fusion of local details and global temporal information. Additionally, based on relevant algorithms, a multi-hypothesis optimization module was implemented that combined joint relative position encoding with a cross-attention mechanism, facilitating cross-hypothesis communication and feature aggregation to enhance the model's long-range reasoning capability regarding human topology for high-precision 3D joint coordinate predictions. Experimental results on the Human3.6M dataset demonstrate that the proposed model achieved a Mean Per-Joint Position Error (MPJPE) of 42.3 mm, reducing prediction error by 1.6% compared to the state-of-the-art method Multi-Hypothesis Transformer (MHFormer), indicating substantial progress in addressing monocular 3D pose estimation ambiguity.

Key words: 3D human pose estimation, dynamic fusion, multi-hypothesis interaction mechanism, attention mechanism, relative positional encoding

摘要: 在单目视频中准确预测具有歧义性的三维人体姿态是当前研究的难点,现有方法虽能通过深度学习模型预测三维关节坐标,但多数方法未能充分考虑该逆问题的多解性,部分多假设预测方法虽能处理多解性问题,但存在跨层次特征融合不足的缺陷。针对上述问题,提出一种基于时空特征金字塔网络与多假设交互机制的三维人体姿态估计模型。首先,基于Transformer编码器利用多头自注意力机制捕获人体姿态的多重可能性分布,生成多个初始假设特征。其次,设计时空特征金字塔网络(TSP-FPN),采用门控自适应融合策略实现骨架序列多层次特征的动态加权整合,有效平衡局部细节与全局时序信息的融合。再次,在相关算法的基础上实现了结合关节相对位置编码与交叉注意力机制的多假设优化模块,促进各假设之间的交叉沟通与特征聚合,增强模型对人体拓扑结构的长程推理能力,从而得到高精度的三维关节坐标预测。在Human3.6M数据集上的实验结果表明,所提模型的平均关节误差(MPJPE)达到了42.3mm,相较于目前先进方法多假设转换器(MHFormer),预测误差降低了1.6%,体现了模型在应对单目三维姿态估计的多解性挑战上取得了实质性进展。

关键词: 三维人体姿态估计, 动态融合, 多假设交互机制, 注意力机制, 相对位置编码

CLC Number: