Action recognition based on fusion of temporal convolution and multi-dimensional attention

doi:10.11772/j.issn.1001-9081.2025091125

Journal of Computer Applications

Action recognition based on fusion of temporal convolution and multi-dimensional attention

LI Yuchen, LI Wanggen, WANG Cheng, GAO Shangshu, ZHANG Chunsheng

School of Computer & Information, Anhui Normal University

Received:2025-09-24 Revised:2025-12-16 Online:2025-12-30 Published:2025-12-30
About author:LI Yuchen，born in 1999，M. S. candidate. His research interests include action recognition. LI Wanggen, born in 1973, Ph. D., professor. His research interests include biological computing, computational intelligence. WANG Cheng, born in 1989, Ph. D. His research interests include medical image intelligent analysis. GAO Shangshu, born in 2000, M. S. candidate. His research interests include depression screening. ZHANG Chunsheng, born in 2001, M. S. candidate. His research interests include human pose estimation.
Supported by:
National Natural Science Foundation of China (61976006)

基于融合时间卷积与多维度注意力的动作识别

李宇辰,李汪根,王程,高尚书,张春生

安徽师范大学计算机与信息学院

通讯作者: 李汪根
作者简介:李宇辰(1999—)，男，安徽宿州人，硕士研究生，主要研究方向：动作识别；李汪根(1973—)，男，安徽太湖人，教授，博士，主要研究方向：生物计算、智能计算；王程（1989—），男，安徽六安人，博士，主要研究方向：医学图像智能分析；高尚书（2000—），男，安徽宿州人，硕士研究生，主要研究方向：抑郁症检测；张春生（2001—），男，安徽池州人，硕士研究生，主要研究方向：姿态估计。
基金资助:
国家自然科学基金资助项目（61976006）

Abstract

Abstract: To address the issue of insufficient temporal representation and inadequate feature aggregation in most existing skeleton-based action recognition algorithms, a skeleton-based action recognition method using dual-branch temporal convolution and multi-dimensional attention was developed. , Firstly, progressive cross-scale temporal convolution (PCST-Conv) was employed to extract local temporal dependencies of actions, which reduced model parameters while improving recognition performance. Secondly, a du-al-branch parallel temporal block (DBPT-Block) was utilized to capture both long- and short-term temporal dependencies effectively through a parallel temporal convolution architecture. Meanwhile, a cross-temporal adaptive fusion (CTAF) module was introduced to enhance the representation of key action frames through dynamic weight allocation, addressing the problem that traditional mul-ti-scale fusion tends to ignore temporal variations across different action moments. Finally, hybrid multi-dimensional attention (HMDA) was applied to efficiently aggregate multi-dimensional features and further optimize feature representation. Experimental results showed that the method achieved accuracies of 91.7% on the cross-subject (CS) benchmark and 96.0% on the cross-view (CV) benchmark on the NTU-RGB+D 60 dataset. On the NTU-RGB+D 120 dataset, 87.4% accuracy was achieved on the cross-subject (CS-120) benchmark and 88.3% on the cross-setup (SS-120) benchmark. Moreover, compared with existing mainstream methods, higher recognition accuracy was achieved with fewer parameters, demonstrating significant advantages in both accuracy and effi-ciency.

Key words: action recognition, skeleton sequence, adaptive fusion, multi-dimensional attention, dual-branch temporal convolution

摘要： 针对目前大多数基于骨骼的人体动作识别算法存在时序表征能力不足与特征聚合不充分的问题，提出了一种使用双分支时间卷积和多维度注意力的骨骼动作识别方法。首先使用渐进跨尺度时间卷积（PCST-Conv）提取动作局部时间依赖，在减少模型参数量的同时也提高了模型识别的性能；其次，利用双分支并联时间块（DBPT-Block）有效捕捉长短程动作的时间依赖；同时引入跨时序自适应融合（CTAF）模块，通过动态权重分配强化关键动作帧的表征能力，解决了传统多尺度融合会忽略动作不同时刻的差异性问题；最后通过混合多维度注意力（HMDA）高效地聚合多维度的特征，进一步优化特征表达。实验结果表明，该方法在NTU-RGB+D 60数据集上取得了跨受试者基准（CS）91.7%和跨视角基准（CV）96.0%，在NTU-RGB+D 120数据集的交叉主题基准（CS-120）和交叉设置基准（SS-120）的识别准确率分别为87.4%和88.3%。此外，与现有主流方法相比，该方法以更少的参数量取得了更高的识别准确率，在精度与效率上均展现出显著优势。

关键词: 动作识别, 骨架序列, 自适应融合, 多维度注意力, 双分支时间卷积

CLC Number:

TP391

LI Yuchen, LI Wanggen, WANG Cheng, GAO Shangshu, ZHANG Chunsheng. Action recognition based on fusion of temporal convolution and multi-dimensional attention[J]. Journal of Computer Applications, DOI: 10.11772/j.issn.1001-9081.2025091125.

李宇辰李汪根王程高尚书张春生. 基于融合时间卷积与多维度注意力的动作识别[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2025091125.

[1]	Zuxi ZHANG, Zhancheng ZHANG, Fuyuan HU. Local and long-range temporal complementary modeling for video action recognition [J]. Journal of Computer Applications, 2026, 46(3): 758-766.
[2]	Chao SHI, Yuxin ZHOU, Qian FU, Wanyu TANG, Ling HE, Yuanyuan LI. Action recognition algorithm for ADHD patients using skeleton and 3D heatmap [J]. Journal of Computer Applications, 2025, 45(9): 3036-3044.
[3]	Yanhua LIAO, Yuanxia YAN, Wenlin PAN. Multi-target detection algorithm for traffic intersection images based on YOLOv9 [J]. Journal of Computer Applications, 2025, 45(8): 2555-2565.
[4]	Lifang WANG, Jingshuang WU, Pengliang YIN, Lihua HU. Action recognition algorithm based on attention mechanism and energy function [J]. Journal of Computer Applications, 2025, 45(1): 234-239.
[5]	Doudou LI, Wanggen LI, Yichun XIA, Yang SHU, Kun GAO. Skeleton-based action recognition based on feature interaction and adaptive fusion [J]. Journal of Computer Applications, 2023, 43(8): 2581-2587.
[6]	Nanfan LI, Wenwen SI, Siyuan DU, Zhiyong WANG, Chongyang ZHONG, Shihong XIA. Hidden state initialization method for recurrent neural network-based human motion model [J]. Journal of Computer Applications, 2023, 43(3): 723-727.
[7]	Tingxiu CHEN, Jianqin YIN. Audio visual joint action recognition based on key frame selection network [J]. Journal of Computer Applications, 2022, 42(3): 731-735.
[8]	GUO Tianxiao, HU Qingrui, LI Jianwei, SHEN Yanfei. Fitness action recognition method based on human skeleton feature encoding [J]. Journal of Computer Applications, 2021, 41(5): 1458-1464.
[9]	LI Qian, YANG Wenzhu, CHEN Xiangyang, YUAN Tongtong, WANG Yuxia. Human action recognition model based on tightly coupled spatiotemporal two-stream convolution neural network [J]. Journal of Computer Applications, 2020, 40(11): 3178-3183.
[10]	LOU Mengying, YUAN Lisha, LIU Yaqin, WAN Xuemei, YANG Feng. Palm vein enhancement method based on adaptive fusion [J]. Journal of Computer Applications, 2019, 39(4): 1176-1182.
[11]	YANG Shiqiang, LUO Xiaoyu, QIAO Dan, LIU Peilei, LI Dexin. Continuous action segmentation and recognition based on sliding window and dynamic programming [J]. Journal of Computer Applications, 2019, 39(2): 348-353.
[12]	WANG Xin, ZHOU Yun, NING Chen, SHI Aiye. Image saliency detection via adaptive fusion of local and global sparse representation [J]. Journal of Computer Applications, 2018, 38(3): 866-872.
[13]	YANG Tianming, CHEN Zhi, YUE Wenjing. Spatio-temporal two-stream human action recognition model based on video deep learning [J]. Journal of Computer Applications, 2018, 38(3): 895-899.
[14]	ZHANG Quangui, CAI Feng, LI Zhiqiang. Human action recognition based on coupled multi-Hidden Markov model and depth image data [J]. Journal of Computer Applications, 2018, 38(2): 454-457.
[15]	WU Feng, WANG Ying. Visual dictionary construction for human actions recognition based on improved information gain [J]. Journal of Computer Applications, 2017, 37(8): 2240-2243.

Action recognition based on fusion of temporal convolution and multi-dimensional attention

基于融合时间卷积与多维度注意力的动作识别

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics