Learnable query-based efficient multi-object gesture recognition

doi:10.11772/j.issn.1001-9081.2024111577

Journal of Computer Applications

Received:2024-11-05 Revised:2024-12-09 Online:2024-12-24 Published:2024-12-24

基于可学习查询向量的高效多目标手势识别方法

马骋昊¹,林垠²,³,陈叶瀚森³,殷保才³,高建清³

1. 中国电子技术标准化研究院
2. 中国科学技术大学
3. 科大讯飞股份有限公司

通讯作者: 林垠
基金资助:
视听触多通道融合的高沉浸式自然人机交互系统

Abstract

Abstract: With the advancement of intelligent human-computer interaction technology, gesture recognition based on 2D images has been widely applied in various fields. Existing methods decompose gesture recognition into two independent stages: hand tracking and gesture classification. Firstly, the region and motion trajectory of the hand are determined, and then the corresponding image region is cropped for category classification. However, these methods heavily rely on the performance of the front-end model (such as hand detection), and in multi-person scenarios, the computational cost increases linearly with the number of hands to be recognized, failing to achieve a good balance between efficiency and effectiveness. To address these issues, an Efficient Query-Based Multi-Object Gesture Recognition (EQMGR) algorithm was proposed. This method can accomplish end-to-end multi-object gesture recognition tasks by setting multiple learnable query vectors combined with an attention mechanism. Each query can adaptively focus on a specific person in the entire image, enabling the recognition of all objects' gestures in the image with single inference. Furthermore, through inter-frame propagation of queries, the query vectors can model the temporal features of objects with no additional computational cost, thereby achieving high-performance recognition for both dynamic and static gestures. To validate the effectiveness of this method on multi-object dynamic and static gesture recognition tasks, a multi-object gesture recognition dataset is collected and annotated. Experimental results on this dataset show that the proposed EQMGR algorithm achieves an 93.2% precision rate and an 96.1% recall rate, while reaching an inference speed of 25.2 frames per second（FPS）on a single GPU, demonstrating efficient and accurate gesture recognition.

Key words: multi-object gesture recognition, efficient dynamic and static gesture recognition, learnable query, frame-wise query propagation

摘要： 随着人机交互技术的智能化，基于二维图像的手势识别被广泛应用于各个领域。现有方法将手势识别分解为手部跟踪和手势分类两个独立阶段，先确定手的区域位置和运动轨迹，再截取对应图像区域进行类别判定。这种方法极度依赖前端模型(如手部检测)的效果，在多人场景中的计算开销也会随着待识别手的数量增加而线性增加，无法很好地平衡识别效率与效果。为了解决上述问题，提出了一种基于可学习查询向量的高效多目标手势识别(EQMGR)方法。该方法能够实现端到端的多目标手势检测任务，通过设置多个可学习查询向量，结合注意力机制，每个查询向量能够自适应地关注整张图像中一个特定的交互人，仅需一次推理即可完成图中所有对象的手势识别。此外，通过查询向量帧间传递操作，查询向量能够以零额外计算开销建模对象的时序特征，从而实现高精度动、静态手势交互。为了验证该算法在多目标动静态手势识别任务上的效果，采集并标注了一个多目标手势识别数据集。在该数据集上的实验结果表明，EQMGR算法识别精确率达到93.2%，召回率达到96.1%，同时在单块GPU上的推理速度达到25.2 frames/s，实现了高效准确的手势识别。

关键词: 多目标手势识别, 高效的动静态手势识别, 可学习查询向量, 查询向量帧间传递

CLC Number:

TP391. 4

马骋昊林垠陈叶瀚森殷保才高建清. 基于可学习查询向量的高效多目标手势识别方法[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2024111577.

[1]	. Multimodal fusion method for predicting rectal cancer efficacy based on attention mechanism [J]. Journal of Computer Applications, 0, (): 0-0.
[2]	. Prediction method of rectal cancer efficacy based on radiomics [J]. Journal of Computer Applications, 0, (): 0-0.
[3]	. Cross-view matching model based on attention mechanism and multi-granularity feature fusion [J]. Journal of Computer Applications, 0, (): 0-0.
[4]	DAI Jin，WU Feng. Speech teaching App based on speech recognition [J]. Journal of Computer Applications, 0, (): 0-0.
[5]	HU Hengyang CHEN Guannan WANG Ping LIU Yao. Segmentation of cell two-photon microscopic image based on center location algorithm [J]. Journal of Computer Applications, 2013, 33(09): 2694-2697.
[6]	WANG Kejun LV Zhuowen SUN Guozhen YAN Tao. Human behavior recognition based on stratified fractal conditional random field [J]. Journal of Computer Applications, 2013, 33(04): 957-959.
[7]	LI Ying-ying TAN Jie-qing ZHONG Jin-qin LI Yan. Automatic extraction of bead-like particle regions of fly ash in scanning electron microscope images [J]. Journal of Computer Applications, 2012, 32(06): 1570-1573.
[8]	. Precise recognition algorithm for handwritten digit characters based on low-dimensional features [J]. Journal of Computer Applications, 2009, 29(05): 1412-1415.
[9]	. Audio classification based on one-class SVM [J]. Journal of Computer Applications, 2009, 29(05): 1419-1422.
[10]	. Multi-type application layer DDoS attack detection method [J]. , 0, (): 0-0.
[11]	. Infrared and visible light image fusion based on Dense Connection [J]. , 0, (): 0-0.