Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (4): 1293-1299.DOI: 10.11772/j.issn.1001-9081.2024040507

• Multimedia computing and computer simulation • Previous Articles     Next Articles

3D hand pose estimation combining attention mechanism and multi-scale feature fusion

Shiyue GUO1(), Jianwu DANG1,2, Yangping WANG1,2, Jiu YONG1,2   

  1. 1.School of Electronic and Information Engineering,Lanzhou Jiaotong University,Lanzhou Gansu 730070,China
    2.Gansu Artificial Intelligence and Graphics and Image Processing Engineering Research Center (Lanzhou Jiaotong University),Lanzhou Gansu 730070,China
  • Received:2024-04-25 Revised:2024-07-17 Accepted:2024-07-18 Online:2025-04-08 Published:2025-04-10
  • Contact: Shiyue GUO
  • About author:DANG Jianwu, born in 1963, Ph. D., professor. His research interests include intelligence information processing, artificial intelligence.
    WANG Yangping, born in 1973, Ph. D., professor. Her research interests include digital image processing, virtual reality.
    YONG Jiu, born in 1993, Ph. D. candidate, engineer. His research interests include digital image processing, virtual reality.
  • Supported by:
    National Natural Science Foundation of China(62067006);Gansu Province Intellectual Property Program(21ZSCQ013);Major Cultivation Project of Scientific Research Innovation Platform in Colleges and Universities in Gansu Province(2024CXPT-17);Humanities and Social Sciences Research Project of Ministry of Education(21YJC880085);Gansu Provincial Natural Science Foundation(23JRRA845);Lanzhou Youth Science and Technology Talent Innovation Project(2023-QN-117)

结合注意力机制和多尺度特征融合的三维手部姿态估计

郭诗月1(), 党建武1,2, 王阳萍1,2, 雍玖1,2   

  1. 1.兰州交通大学 电子与信息工程学院,兰州 730070
    2.甘肃省人工智能与图形图像处理工程研究中心(兰州交通大学),兰州 730070
  • 通讯作者: 郭诗月
  • 作者简介:党建武(1963—),男,陕西渭南人,教授,博士,主要研究方向:智能信息处理、人工智能
    王阳萍(1973—),女,四川达州人,教授,博士,主要研究方向:数字图像处理、虚拟现实
    雍玖(1993—),男,甘肃临夏人,工程师,博士研究生,主要研究方向:数字图像处理、虚拟现实。
  • 基金资助:
    国家自然科学基金资助项目(62067006);甘肃省知识产权计划项目(21ZSCQ013);甘肃省高校科研创新平台重大培育项目(2024CXPT?17);教育部人文社会科学研究项目(21YJC880085);甘肃省自然科学基金资助项目(23JRRA845);兰州市青年科技人才创新项目(2023?QN?117)

Abstract:

To address the problem of inaccurate 3D hand pose estimation from a single RGB image due to occlusion and self-similarity, a 3D hand pose estimation network combining attention mechanism and multi-scale feature fusion was proposed. Firstly, Sensory Enhancement Module (SEM) was proposed, which combined dilated convolution and CBAM (Convolutional Block Attention Module) attention mechanism, and it was used to replace the Basicblock of HourGlass Network (HGNet) to expand the receptive field and enhance the sensitivity to spatial information, so as to improve the ability of extracting hand features. Secondly, a multi-scale information fusion module SS-MIFM (SPCNet and Soft-attention-Multi-scale Information Fusion Module) combining SPCNet (Spatial Preserve and Content-aware Network) and Soft-Attention enhancement was designed to aggregate multi-level features effectively and improve the accuracy of 2D hand keypoint detection significantly with full consideration of the spatial content awareness mechanism. Finally, a 2.5D pose conversion module was proposed to convert 2D pose into 3D pose, thereby avoiding the problem of spatial loss caused by the direct regression of 2D keypoint coordinates to calculate 3D pose information. Experimental results show that on InterHand2.6M dataset, the two?hand Mean Per Joint Position Error (MPJPE), the single?hand MPJPE, and the Mean Relative-Root Position Error (MRRPE) of the proposed algorithm reach 12.32, 9.96 and 29.57 mm, respectively; on RHD (Rendered Hand pose Dataset), compared with InterNet and QMCG-Net algorithms, the proposed algorithm has the End-Point Error (EPE) reduced by 2.68 and 0.38 mm, respectively. The above results demonstrate that the proposed algorithm can estimate hand pose more accurately and is more robust in some two-hand interaction and occlusion scenarios.

Key words: hand pose estimation, multi-scale feature fusion, attention mechanism, High-Resolution Net (HRNet), HourGlass Network (HGNet)

摘要:

针对因遮挡和自相似性导致的从单张RGB图像估计三维手部姿态不精确的问题,提出结合注意力机制和多尺度特征融合的三维手部姿态估计算法。首先,提出结合扩张卷积和CBAM (Convolutional Block Attention Module)注意力机制的感受强化模块(SEM),以替换沙漏网络(HGNet)中的基本块(Basicblock),在扩大感受野的同时增强对空间信息的敏感性,从而提高手部特征的提取能力;其次,设计一种结合SPCNet (Spatial Preserve and Content-aware Network)和Soft-Attention改进的多尺度信息融合模块SS-MIFM (SPCNet and Soft-attention-Multi-scale Information Fusion Module),在充分考虑空间内容感知机制的情况下,有效地聚合多级特征,并显著提高二维手部关键点检测的准确性;最后,利用2.5D姿态转换模块将二维姿态转换为三维姿态,从而避免二维关键点坐标直接回归计算三维姿态信息导致的空间丢失问题。实验结果表明,在InterHand2.6M数据集上,所提算法的双手关节点平均误差(MPJPE)、单手MPJPE和根节点平均误差(MRRPE)分别达到了12.32、9.96和29.57 mm;在RHD(Rendered Hand pose Dataset)上,与InterNet和QMGR-Net算法相比,所提算法的终点误差(EPE)分别降低了2.68和0.38 mm。以上结果说明了所提算法能够更准确地估计手部姿态,且在一些双手交互和遮挡的场景下有更高的鲁棒性。

关键词: 手部姿态估计, 多尺度特征融合, 注意力机制, 高分辨率网络, 沙漏网络

CLC Number: