Search Result

Journals

Publication Years

Keywords

Please wait a minute...

For Selected:

Download Citations
EndNote Ris BibTeX

Toggle Thumbnails

Select

3D human pose estimation model based on temporal-spatial feature pyramid network and multi-hypothesis interaction mechanism

Jinxiao ZHANG, Chenglong LI, Xinyan GAO, Ming ZHANG

Journal of Computer Applications 2026, 46 (6): 1965-1972. DOI: 10.11772/j.issn.1001-9081.2025060763

Abstract （132）

HTML （0）

PDF （1271KB）（11）

Save

Estimating ambiguous Three-Dimensional （3D） human poses from monocular videos accurately is a current research challenge. Though the existing methods can estimate 3D joint coordinates using deep learning models， most of them fail to consider the multi-solution nature of this inverse problem adequately. Some multi-hypothesis estimation methods address multi-solution problems， but they suffer from insufficient cross-level feature fusion. To address these issues， a 3D human pose estimation model based on Temporal-SPatial Feature Pyramid Network （TSP-FPN） and multi-hypothesis interaction mechanism， called TSP-FPN-MHFormer （Temporal-SPatial Feature Pyramid Network-Multi-Hypothesis Transformer）， was proposed. Firstly， based on Transformer encoder， the multi-possibility distribution of human poses was captured by using multi-head self-attention mechanism， thereby generating multiple initial hypothesis features. Then， a TSP-FPN was designed， and a gated adaptive fusion strategy was employed to achieve dynamic weighted integration of multi-level skeleton sequence features， thereby balancing the fusion of local details and global temporal information effectively. Finally， based on Multi-Hypothesis Transformer （MHFormer）， a multi-hypothesis optimization module that combined joint Relative Position Bias （RPB） with a cross-attention mechanism was implemented， thereby facilitating cross-hypothesis communication and feature aggregation to enhance the model’s long-range reasoning capability to human topology for high-precision 3D joint coordinate estimation. Experimental results on the Human3.6M dataset demonstrate that the proposed model achieves a Mean Per Joint Position Error （MPJPE） of 42.3 mm， and reduces the estimation error by 1.6% compared to the state-of-the-art method MHFormer， indicating substantial progress obtained by the proposed model in addressing multi-solution challenge of monocular 3D pose estimation.

Table and Figures | Reference | Related Articles | Metrics