《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (10): 3236-3243.DOI: 10.11772/j.issn.1001-9081.2022101473

• 多媒体计算与计算机仿真 • 上一篇    

基于单模态的多尺度特征融合人体行为识别方法

刘锁兰1,2, 田珍珍1, 王洪元1(), 林龙1, 王炎1   

  1. 1.常州大学 计算机与人工智能学院 阿里云大数据学院 软件学院,江苏 常州 213164
    2.江苏省社会安全图像与视频理解重点实验室(南京理工大学),南京 210094
  • 收稿日期:2022-10-11 修回日期:2022-12-29 接受日期:2023-01-03 发布日期:2023-04-12 出版日期:2023-10-10
  • 通讯作者: 王洪元
  • 作者简介:刘锁兰(1980—),女,江苏泰州人,副教授,博士,CCF会员,主要研究方向:计算机视觉、人工智能
    田珍珍(1997—),女,河南郑州人,硕士研究生,主要研究方向:计算机视觉、模式识别
    林龙(1998—),男,四川德阳人,硕士研究生,主要研究方向:计算机视觉、数据增强
    王炎(1999—),男,江苏连云港人,硕士研究生,主要研究方向:计算机视觉、模式识别。
  • 基金资助:
    国家自然科学基金资助项目(61976028);江苏省社会安全图像与视频理解重点实验室开放课题(J2021?2)

Human action recognition method based on multi-scale feature fusion of single mode

Suolan LIU1,2, Zhenzhen TIAN1, Hongyuan WANG1(), Long LIN1, Yan WANG1   

  1. 1.School of Computer Science and Artificial Intelligence,Aliyun School of Big Data,School of Software,Changzhou University,Changzhou Jiangsu 213164,China
    2.Jiangsu Key Laboratory of Image and Video Understanding for Social Security (Nanjing University of Science and Technology),Nanjing Jiangsu 210094,China
  • Received:2022-10-11 Revised:2022-12-29 Accepted:2023-01-03 Online:2023-04-12 Published:2023-10-10
  • Contact: Hongyuan WANG
  • About author:LIU Suolan, born in 1980, Ph. D., associate professor. Her research interests include computer vision, artificial intelligence.
    TIAN Zhenzhen, born in 1997, M. S. candidate. Her research interests include computer vision, pattern recognition.
    LIN Long, born in 1998, M. S. candidate. His research interests include computer vision, data augmentation.
    WANG Yan, born in 1999, M. S. candidate. His research interests include computer vision, pattern recognition.
  • Supported by:
    National Natural Science Foundation of China(61976028);Open Project of Jiangsu Key Laboratory of Image and Video Understanding for Social Security(J2021-2)

摘要:

针对人体行为识别任务中未能充分挖掘超距关节点之间潜在关联的问题,以及使用多模态数据带来的高昂训练成本的问题,提出一种单模态条件下的多尺度特征融合人体行为识别方法。首先,将人体的原始骨架图进行全局特征关联,并利用粗尺度的全局特征捕获远距离关节点间的联系;其次,对全局特征关联图进行局部划分以得到融合了全局特征的互补子图(CSGF),利用细尺度特征建立强关联,并形成多尺度特征的互补;最后,将CSGF输入时空图卷积模块中提取特征,并聚合提取后的结果以输出最终的分类结果。实验结果表明,在行为识别权威数据集NTU RGB+D60上,所提方法的准确率分别为89.0%(X-sub)和94.2%(X-view);在具有挑战性的大规模数据集NTU RGB+D120上,所提方法的准确率分别为83.3%(X-sub)和85.0%(X-setup),与单模态下的ST-TR(Spatial-Temporal TRansformer)相比,分别提升1.4和0.9个百分点,与轻量级SGN(Semantics-Guided Network)相比,分别提升4.1和3.5个百分点。可见,所提方法能够充分挖掘多尺度特征的协同互补性,并有效提高单模态条件下模型的识别准确率和训练效率。

关键词: 人体行为识别, 骨架关节点, 图卷积网络, 单模态, 多尺度, 特征融合

Abstract:

In order to solve the problem of insufficient mining of potential association between remote nodes in human action recognition tasks, and the problem of high training cost caused by using multi-modal data, a multi-scale feature fusion human action recognition method under the condition of single mode was proposed. Firstly, the global feature correlation of the original skeleton diagram of human body was carried out, and the coarse-scale global features were used to capture the connections between the remote nodes. Secondly, the global feature correlation graph was divided locally to obtain the Complementary Subgraphs with Global Features (CSGFs), the fine-scale features were used to establish the strong correlation, and the multi-scale feature complementarity was formed. Finally, the CSGFs were input into the spatial-temporal Graph Convolutional module for feature extraction, and the extracted results were aggregated to output the final classification results. Experimental results show that the accuracy of the proposed method on the authoritative action recognition dataset NTU RGB+D60 is 89.0% (X-sub) and 94.2% (X-view) respectively. On the challenging large-scale dataset NTU RGB+D120, the accuracy of the proposed method is 83.3% (X-sub) and 85.0% (X-setup) respectively, which is 1.4 and 0.9 percentage points higher than that of the ST-TR (Spatial-Temporal TRansformer) under single modal respectively, and 4.1 and 3.5 percentage points higher than that of the lightweight SGN (Semantics-Guided Network). It can be seen that the proposed method can fully exploit the synergistic complementarity of multi-scale features, and effectively improve the recognition accuracy and training efficiency of the model under the condition of single modal.

Key words: human action recognition, skeleton joint, Graph Convolutional Network (GCN), single mode, multi-scale, feature fusion

中图分类号: