Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
3D human pose estimation model based on temporal-spatial feature pyramid network and multi-hypothesis interaction mechanism
Jinxiao ZHANG, Chenglong LI, Xinyan GAO, Ming ZHANG
Journal of Computer Applications    2026, 46 (6): 1965-1972.   DOI: 10.11772/j.issn.1001-9081.2025060763
Abstract132)   HTML0)    PDF (1271KB)(11)       Save

Estimating ambiguous Three-Dimensional (3D) human poses from monocular videos accurately is a current research challenge. Though the existing methods can estimate 3D joint coordinates using deep learning models, most of them fail to consider the multi-solution nature of this inverse problem adequately. Some multi-hypothesis estimation methods address multi-solution problems, but they suffer from insufficient cross-level feature fusion. To address these issues, a 3D human pose estimation model based on Temporal-SPatial Feature Pyramid Network (TSP-FPN) and multi-hypothesis interaction mechanism, called TSP-FPN-MHFormer (Temporal-SPatial Feature Pyramid Network-Multi-Hypothesis Transformer), was proposed. Firstly, based on Transformer encoder, the multi-possibility distribution of human poses was captured by using multi-head self-attention mechanism, thereby generating multiple initial hypothesis features. Then, a TSP-FPN was designed, and a gated adaptive fusion strategy was employed to achieve dynamic weighted integration of multi-level skeleton sequence features, thereby balancing the fusion of local details and global temporal information effectively. Finally, based on Multi-Hypothesis Transformer (MHFormer), a multi-hypothesis optimization module that combined joint Relative Position Bias (RPB) with a cross-attention mechanism was implemented, thereby facilitating cross-hypothesis communication and feature aggregation to enhance the model’s long-range reasoning capability to human topology for high-precision 3D joint coordinate estimation. Experimental results on the Human3.6M dataset demonstrate that the proposed model achieves a Mean Per Joint Position Error (MPJPE) of 42.3 mm, and reduces the estimation error by 1.6% compared to the state-of-the-art method MHFormer, indicating substantial progress obtained by the proposed model in addressing multi-solution challenge of monocular 3D pose estimation.

Table and Figures | Reference | Related Articles | Metrics