3D hand pose estimation combining attention mechanism and multi-scale feature fusion

doi:10.11772/j.issn.1001-9081.2024040507

Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (4): 1293-1299.DOI: 10.11772/j.issn.1001-9081.2024040507

• Multimedia computing and computer simulation • Previous Articles Next Articles

3D hand pose estimation combining attention mechanism and multi-scale feature fusion

Shiyue GUO¹(), Jianwu DANG¹^,², Yangping WANG¹^,², Jiu YONG¹^,²

^1.School of Electronic and Information Engineering，Lanzhou Jiaotong University，Lanzhou Gansu 730070，China
^2.Gansu Artificial Intelligence and Graphics and Image Processing Engineering Research Center （Lanzhou Jiaotong University），Lanzhou Gansu 730070，China

Received:2024-04-25 Revised:2024-07-17 Accepted:2024-07-18 Online:2025-04-08 Published:2025-04-10
Contact: Shiyue GUO
About author:DANG Jianwu， born in 1963， Ph. D.， professor. His research interests include intelligence information processing， artificial intelligence.
WANG Yangping， born in 1973， Ph. D.， professor. Her research interests include digital image processing， virtual reality.
YONG Jiu， born in 1993， Ph. D. candidate， engineer. His research interests include digital image processing， virtual reality.
Supported by:
National Natural Science Foundation of China(62067006);Gansu Province Intellectual Property Program(21ZSCQ013);Major Cultivation Project of Scientific Research Innovation Platform in Colleges and Universities in Gansu Province(2024CXPT-17);Humanities and Social Sciences Research Project of Ministry of Education(21YJC880085);Gansu Provincial Natural Science Foundation(23JRRA845);Lanzhou Youth Science and Technology Talent Innovation Project(2023-QN-117)

结合注意力机制和多尺度特征融合的三维手部姿态估计

郭诗月¹(), 党建武¹^,², 王阳萍¹^,², 雍玖¹^,²

^1.兰州交通大学电子与信息工程学院，兰州 730070
^2.甘肃省人工智能与图形图像处理工程研究中心（兰州交通大学），兰州 730070

通讯作者: 郭诗月
作者简介:党建武（1963—），男，陕西渭南人，教授，博士，主要研究方向：智能信息处理、人工智能
王阳萍（1973—），女，四川达州人，教授，博士，主要研究方向：数字图像处理、虚拟现实
雍玖（1993—），男，甘肃临夏人，工程师，博士研究生，主要研究方向：数字图像处理、虚拟现实。
基金资助:
国家自然科学基金资助项目(62067006);甘肃省知识产权计划项目(21ZSCQ013);甘肃省高校科研创新平台重大培育项目(2024CXPT?17);教育部人文社会科学研究项目(21YJC880085);甘肃省自然科学基金资助项目(23JRRA845);兰州市青年科技人才创新项目(2023?QN?117)

Abstract

Abstract:

To address the problem of inaccurate 3D hand pose estimation from a single RGB image due to occlusion and self-similarity， a 3D hand pose estimation network combining attention mechanism and multi-scale feature fusion was proposed. Firstly， Sensory Enhancement Module （SEM） was proposed， which combined dilated convolution and CBAM （Convolutional Block Attention Module） attention mechanism， and it was used to replace the Basicblock of HourGlass Network （HGNet） to expand the receptive field and enhance the sensitivity to spatial information， so as to improve the ability of extracting hand features. Secondly， a multi-scale information fusion module SS-MIFM （SPCNet and Soft-attention-Multi-scale Information Fusion Module） combining SPCNet （Spatial Preserve and Content-aware Network） and Soft-Attention enhancement was designed to aggregate multi-level features effectively and improve the accuracy of 2D hand keypoint detection significantly with full consideration of the spatial content awareness mechanism. Finally， a 2.5D pose conversion module was proposed to convert 2D pose into 3D pose， thereby avoiding the problem of spatial loss caused by the direct regression of 2D keypoint coordinates to calculate 3D pose information. Experimental results show that on InterHand2.6M dataset， the two?hand Mean Per Joint Position Error （MPJPE）， the single?hand MPJPE， and the Mean Relative-Root Position Error （MRRPE） of the proposed algorithm reach 12.32， 9.96 and 29.57 mm， respectively； on RHD （Rendered Hand pose Dataset）， compared with InterNet and QMCG-Net algorithms， the proposed algorithm has the End-Point Error （EPE） reduced by 2.68 and 0.38 mm， respectively. The above results demonstrate that the proposed algorithm can estimate hand pose more accurately and is more robust in some two-hand interaction and occlusion scenarios.

Key words: hand pose estimation, multi-scale feature fusion, attention mechanism, High-Resolution Net (HRNet), HourGlass Network (HGNet)

摘要：

针对因遮挡和自相似性导致的从单张RGB图像估计三维手部姿态不精确的问题，提出结合注意力机制和多尺度特征融合的三维手部姿态估计算法。首先，提出结合扩张卷积和CBAM （Convolutional Block Attention Module）注意力机制的感受强化模块（SEM），以替换沙漏网络（HGNet）中的基本块（Basicblock），在扩大感受野的同时增强对空间信息的敏感性，从而提高手部特征的提取能力；其次，设计一种结合SPCNet （Spatial Preserve and Content-aware Network）和Soft-Attention改进的多尺度信息融合模块SS-MIFM （SPCNet and Soft-attention-Multi-scale Information Fusion Module），在充分考虑空间内容感知机制的情况下，有效地聚合多级特征，并显著提高二维手部关键点检测的准确性；最后，利用2.5D姿态转换模块将二维姿态转换为三维姿态，从而避免二维关键点坐标直接回归计算三维姿态信息导致的空间丢失问题。实验结果表明，在InterHand2.6M数据集上，所提算法的双手关节点平均误差（MPJPE）、单手MPJPE和根节点平均误差（MRRPE）分别达到了12.32、9.96和29.57 mm；在RHD（Rendered Hand pose Dataset）上，与InterNet和QMGR-Net算法相比，所提算法的终点误差（EPE）分别降低了2.68和0.38 mm。以上结果说明了所提算法能够更准确地估计手部姿态，且在一些双手交互和遮挡的场景下有更高的鲁棒性。

关键词: 手部姿态估计, 多尺度特征融合, 注意力机制, 高分辨率网络, 沙漏网络

CLC Number:

TP391.41

Shiyue GUO, Jianwu DANG, Yangping WANG, Jiu YONG. 3D hand pose estimation combining attention mechanism and multi-scale feature fusion[J]. Journal of Computer Applications, 2025, 45(4): 1293-1299.

郭诗月, 党建武, 王阳萍, 雍玖. 结合注意力机制和多尺度特征融合的三维手部姿态估计[J]. 《计算机应用》唯一官方网站, 2025, 45(4): 1293-1299.

Figures/Tables 11

References 31

1	ISMAIL A W， ALADIN M Y F， HALIM N A A， et al. Augmented reality using gesture and speech accelerates user interaction［C］// Proceedings of the 2022 International Conference on Advanced Communication and Intelligent Systems， CCIS 1749. Cham： Springer， 2023： 233-244.
2	QI T D， BOYD L， FITZPATRICK S， et al. Towards a virtual reality visualization of hand-object interactions to support remote physical therapy［C］// Proceedings of the 2023 International Conference on Ubiquitous Computing and Ambient Intelligence， LNNS 835. Cham： Springer， 2023： 136-147.
3	ZHOU Y， JIANG G， LIN Y. A novel finger and hand pose estimation technique for real-time hand gesture recognition［J］. Pattern Recognition， 2016， 49： 102-114.
4	SRIDHAR S， FEIT A M， THEOBALT C， et al. Investigating the dexterity of multi-finger input for mid-air text entry［C］// Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. New York： ACM， 2015： 3643-3652.
5	OBERWEGER M， LEPETIT V. DeepPrior++： improving fast and accurate 3D hand pose estimation［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops. Piscataway： IEEE， 2017： 585-594.
6	CHANG J Y， MOON G， LEE K M. V2V-PoseNet： voxel-to-voxel prediction network for accurate 3D hand and human pose estimation from a single depth map［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 5079-5088.
7	MUELLER F， MEHTA D， SOTNYCHENKO O， et al. Real-time hand tracking under occlusion from an egocentric RGB-D sensor［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 1163-1172.
8	KADKHODAMOHAMMADI A， PADOY N. A generalizable approach for multi-view 3D human pose regression［J］. Machine Vision and Applications， 2021， 32： No.6.
9	ISKAKOV K， BURKOV E， LEMPITSKY V， et al. Learnable triangulation of human pose［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 7717-7726.
10	SIMON T， JOO H， MATTHEWS I， et al. Hand keypoint detection in single images using multiview bootstrapping［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 4645-4653.
11	ZIMMERMANN C， BROX T. Learning to estimate 3D hand pose from single RGB images［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 4913-4921.
12	CAI Y， GE L， CAI J， et al. Weakly-supervised 3D hand pose estimation from monocular RGB images［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11210. Cham： Springer， 2018： 678-694.
13	IQBAL U， MOLCHANOV P， BREUEL T， et al. Hand pose estimation via latent 2.5D heatmap regression［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11215. Cham： Springer， 2018： 125-143.
14	SPURR A， IQBAL U， MOLCHANOV P， et al. Weakly supervised 3D hand pose estimation via biomechanical constraints［C］// Proceedings of the 2020 European Conference on Computer Vision， LNCS 12362. Cham： Springer， 2020： 211-228.
15	ZHOU Y， HABERMANN M， XU W， et al. Monocular real-time hand shape and motion capture using multi-modal data［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 5345-5354.
16	MOON G， YU S I， WEN H， et al. InterHand2.6M： a dataset and baseline for 3D interacting hand pose estimation from a single RGB image［C］// Proceedings of the 2020 European Conference Computer Vision， LNCS 12365. Cham： Springer， 2020： 548-564.
17	NI H， XIE S， XU P， et al. QMGR-Net： quaternion multi-graph reasoning network for 3D hand pose estimation［J］. International Journal of Machine Learning and Cybernetics， 2023， 14（12）： 4029-4045.
18	LIN T Y， DOLLÁR P， GIRSHICK R， et al. Feature pyramid networks for object detection［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 936-944.
19	SUN K， XIAO B， LIU D， et al. Deep high-resolution representation learning for human pose estimation［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 5686-5696.
20	NEWELL A， YANG K， DENG J. Stacked hourglass networks for human pose estimation［C］// Proceedings of the 2016 European Conference on Computer Vision， LNCS 9912. Cham： Springer， 2016： 483-499.
21	GUPTA D， ARTACHO B， SAVAKIS A. HandyPose： multi-level framework for hand pose estimation［J］. Pattern Recognition， 2022， 128： No.108674.
22	GUAN X， SHEN H， NYATEGA C O， et al. Repeated cross-scale structure-induced feature fusion network for 2D hand pose estimation［J］. Entropy， 2023， 25（5）： No.724.
23	XIAO Y， YU D， WANG X， et al. SPCNet： spatial preserve and content-aware network for human pose estimation［C］// Proceedings of the 24th European Conference on Artificial Intelligence. Amsterdam： IOS Press， 2020：2776-2783.
24	SCHLEMPER J， OKTAY O， SCHAAP M， et al. Attention gated networks： learning to leverage salient regions in medical images［J］. Medical Image Analysis， 2019， 53： 197-207.
25	FAN Z， SPURR A， KOCABAS M， et al. Learning to disambiguate strongly interacting hands via probabilistic per-pixel part segmentation［C］// Proceedings of the 2021 International Conference on 3D Vision. Piscataway： IEEE， 2021： 1-10.
26	贾迪，李宇扬，安彤，等. 融合多尺度特征的复杂手势姿态估计网络［J］. 中国图象图形学报， 2023， 28（9）：2887-2898.
	JIA D， LI Y Y， AN T， et al. Complex gesture pose estimation network fusing multiscale features［J］. Journal of Image and Graphics， 2023， 28（9）： 2887-2898.
27	HAMPALI S， SARKAR S D， RAD M， et al. Keypoint Transformer： solving joint identification in challenging hands and object interactions for accurate 3D pose estimation［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 11080-11090.
28	MENG H， JIN S， LIU W， et al. 3D interacting hand pose estimation by hand de-occlusion and removal［C］// Proceedings of the 2022 European Conference on Computer Vision， LNCS 13666. Cham： Springer， 2022： 380-397.
29	GAO C， YANG Y， LI W. 3D interacting hand pose and shape estimation from a single RGB image［J］. Neurocomputing， 2022， 474： 25-36.
30	YANG L， YAO A. Disentangling latent hands for image synthesis and pose estimation［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 9869-9878.
31	ZHAO L， PENG X， CHEN Y， et al. Knowledge as priors： cross-modal knowledge generalization for datasets without superior knowledge［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 6527-6536.

算法	AP_h/%	MPJPE/mm		MRRPE/mm
算法	AP_h/%	单手	双手	MRRPE/mm
文献［11］算法	99.20	45.74	51.44	41.45
InterNet^［16］	99.14	12.16	16.02	32.59
DIGIT^［25］	99.15	11.32	15.57	30.51
文献［26］算法	98.97	11.10	15.14	30.92
文献［27］算法	—	10.99	14.34	29.63
文献［28］算法	—	8.51	13.12	—
文献［29］算法	99.02	9.10	12.82	31.37
本文算法	99.35	9.96	12.32	29.57

算法	AP_h/%	MPJPE/mm		MRRPE/mm
算法	AP_h/%	单手	双手	MRRPE/mm
文献［11］算法	99.20	45.74	51.44	41.45
InterNet^［16］	99.14	12.16	16.02	32.59
DIGIT^［25］	99.15	11.32	15.57	30.51
文献［26］算法	98.97	11.10	15.14	30.92
文献［27］算法	—	10.99	14.34	29.63
文献［28］算法	—	8.51	13.12	—
文献［29］算法	99.02	9.10	12.82	31.37
本文算法	99.35	9.96	12.32	29.57

算法	参数量/10⁶	计算量/ GFLOPs	InterHand2.6M
算法	参数量/10⁶	计算量/ GFLOPs	测试/min	训练（epoch）/h
文献［11］算法	13.61	30.52	34.45	4.08
InterNet^［16］	47.31	23.24	35.36	3.81
本文算法	38.73	21.57	68.21	6.03

算法	参数量/10⁶	计算量/ GFLOPs	InterHand2.6M
算法	参数量/10⁶	计算量/ GFLOPs	测试/min	训练（epoch）/h
文献［11］算法	13.61	30.52	34.45	4.08
InterNet^［16］	47.31	23.24	35.36	3.81
本文算法	38.73	21.57	68.21	6.03

算法	GT_S	GT_H	EPE/mm
文献［11］算法	是	是	30.42
文献［30］算法	是	是	19.95
文献［31］算法	是	是	20.74
文献［14］算法	是	是	19.73
文献［14］算法	否	否	22.53
InterNet^［16］	否	否	20.89
QMGR-Net^［17］	否	否	18.59
本文算法	否	否	18.21

3D hand pose estimation combining attention mechanism and multi-scale feature fusion

结合注意力机制和多尺度特征融合的三维手部姿态估计

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 11

References 31

Related Articles 15

Recommended Articles

Metrics

[1]	Jie HU, Qiyang ZHENG, Jun SUN, Yan ZHANG. Multi-label classification model based on multi-label relational graph and local dynamic reconstruction learning [J]. Journal of Computer Applications, 2025, 45(4): 1104-1112.
[2]	Chun XU, Shuangyan JI, Huan MA, Enwei SUN, Mengmeng WANG, Mingyu SU. Consultation recommendation method based on knowledge graph and dialogue structure [J]. Journal of Computer Applications, 2025, 45(4): 1157-1168.
[3]	Liwei ZHANG, Quan LIANG, Yutao HU, Qiaole ZHU. Channel shuffle attention mechanism based on group convolution [J]. Journal of Computer Applications, 2025, 45(4): 1069-1076.
[4]	Kunyuan JIANG, Xiaoxia LI, Li WANG, Yaodan CAO, Xiaoqiang ZHANG, Nan DING, Yingyue ZHOU. Boundary-cross supervised semantic segmentation network with decoupled residual self-attention [J]. Journal of Computer Applications, 2025, 45(4): 1120-1129.
[5]	Liqin WANG, Zhilei GENG, Yingshuang LI, Yongfeng DONG, Meng BIAN. Open-world knowledge reasoning model based on path and enhanced triplet text [J]. Journal of Computer Applications, 2025, 45(4): 1177-1183.
[6]	Haijun GENG, Yun DONG, Zhiguo HU, Haotian CHI, Jing YANG, Xia YIN. Encrypted traffic classification method based on Attention-1DCNN-CE [J]. Journal of Computer Applications, 2025, 45(3): 872-882.
[7]	Dixin WANG, Jiahao WANG, Min LI, Hao CHEN, Guangyao HU, Yu GONG. Abnormal attack detection for underwater acoustic communication network [J]. Journal of Computer Applications, 2025, 45(2): 526-533.
[8]	Zhongwei ZHANG, Jun WANG, Shudong LIU, Zhiheng WANG. Object detection in remote sensing image based on multi-scale feature fusion and weighted boxes fusion [J]. Journal of Computer Applications, 2025, 45(2): 633-639.
[9]	Haiteng MENG, Xiaole ZHAO, Tianrui LI. Lightweight image super-resolution reconstruction based on asymmetric information distillation network [J]. Journal of Computer Applications, 2025, 45(2): 601-609.
[10]	Tianqi ZHANG, Shuang TAN, Xiwen SHEN, Juan TANG. Image watermarking method combining attention mechanism and multi-scale feature [J]. Journal of Computer Applications, 2025, 45(2): 616-623.
[11]	Qijian CAI, Wei TAN. Semantic graph enhanced multi-modal recommendation algorithm [J]. Journal of Computer Applications, 2025, 45(2): 421-427.
[12]	Yan LI, Guanhua YE, Yawen LI, Meiyu LIANG. Enterprise ESG indicator prediction model based on richness coordination technology [J]. Journal of Computer Applications, 2025, 45(2): 670-676.
[13]	Lifang WANG, Jingshuang WU, Pengliang YIN, Lihua HU. Action recognition algorithm based on attention mechanism and energy function [J]. Journal of Computer Applications, 2025, 45(1): 234-239.
[14]	Jie XU, Yong ZHONG, Yang WANG, Changfu ZHANG, Guanci YANG. Facial attribute estimation and expression recognition based on contextual channel attention mechanism [J]. Journal of Computer Applications, 2025, 45(1): 253-260.
[15]	Junying CHEN, Shijie GUO, Lingling CHEN. Lightweight human pose estimation based on decoupled attention and ghost convolution [J]. Journal of Computer Applications, 2025, 45(1): 223-233.

算法	MPJPE		MRRPE
算法	单手	双手	MRRPE
去除SEM模块	11.59	13.76	31.45
去除SS-MIFM模块	13.16	17.02	34.49
去除2.5D姿态估计模块	12.32	15.57	32.51
全部模块	9.96	12.32	29.57

算法	MPJPE		MRRPE
算法	单手	双手	MRRPE
去除SEM模块	11.59	13.76	31.45
去除SS-MIFM模块	13.16	17.02	34.49
去除2.5D姿态估计模块	12.32	15.57	32.51
全部模块	9.96	12.32	29.57