基于注意力机制和能量函数的动作识别算法

doi:10.11772/j.issn.1001-9081.2024010004

《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (1): 234-239.DOI: 10.11772/j.issn.1001-9081.2024010004

• 多媒体计算与计算机仿真 • 上一篇下一篇

基于注意力机制和能量函数的动作识别算法

王丽芳¹(), 吴荆双¹, 尹鹏亮², 胡立华¹

^1.太原科技大学计算机科学与技术学院，太原 030024
^2.上海方宜万强微电子有限公司，西安 710000

收稿日期:2024-01-10 修回日期:2024-03-15 接受日期:2024-03-21 发布日期:2024-05-09 出版日期:2025-01-10
通讯作者: 王丽芳
作者简介:王丽芳（1975—），女，山西和顺人，副教授，博士，CCF会员，主要研究方向：智能优化、图像处理；wanglifang@tyust.edu.cn
吴荆双（1998—），女，山西运城人，硕士研究生，主要研究方向：计算机视觉、动作识别；
尹鹏亮（1987—），男，山西忻州人，硕士，主要研究方向：计算机视觉、深度学习、图像处理与模式识别、测试计量方法与仪器；
胡立华（1982—），女，山西忻州人，教授，博士，CCF会员，主要研究方向：计算机视觉。
基金资助:
国家自然科学基金资助项目(62273248);山西省自然科学基金资助项目(202203021211206);山西省研究生教育改革项目(2021YJJG238);山西省研究生科研创新项目(2023KY657);太原科技大学博士启动基金资助项目(20212021)

Action recognition algorithm based on attention mechanism and energy function

Lifang WANG¹(), Jingshuang WU¹, Pengliang YIN², Lihua HU¹

^1.College of Computer Science and Technology，Taiyuan University of Science and Technology，Taiyuan Shanxi 030024，China
^2.Shanghai Freedom-Chips Microelectronics Company Limited，Xi’an Shaanxi 710000，China

Received:2024-01-10 Revised:2024-03-15 Accepted:2024-03-21 Online:2024-05-09 Published:2025-01-10
Contact: Lifang WANG
About author:WU Jingshuang， born in 1998， M. S. candidate. Her research interests include computer vision， action recognition.
YIN Pengliang， born in 1987， M. S. His research interests include computer vision， deep learning， image processing and pattern recognition， test and measurement methods and instruments.
HU Lihua， born in 1982， Ph. D.， professor. Her research interests include computer vision.
Supported by:
National Natural Science Foundation of China(62273248);Natural Science Foundation of Shanxi Province(202203021211206);Graduate Education Reform Project of Shanxi Province(2021YJJG238);Graduate Research Innovation Project of Shanxi Province(2023KY657);Taiyuan University of Science and Technology PhD Start-up Fund(20212021)

摘要/Abstract

摘要：

针对零样本动作识别（ZSAR）算法的框架缺乏结构性指导的问题，以基于能量的模型（EBM）指导框架设计，提出基于注意力机制和能量函数的动作识别算法（ARAAE）。首先，为了得到EBM的输入，设计了光流加3D卷积（C3D）架构的组合以提取视觉特征，从而达到空间去冗余的效果；其次，将视觉Transformer （ViT）用于视觉特征的提取以减少时间冗余，同时利用ViT配合光流加C3D架构的组合以减少空间冗余，从而获得非冗余视觉空间；最后，为度量视觉空间和语义空间的相关性，实现能量评分评估机制，设计联合损失函数来进行优化实验。采用6个经典ZSAR算法及近年文献里的算法在两个数据集HMDB51和UCF101进行实验的结果表明：相较于CAGE （Coupling Adversarial Graph Embedding）、Bi-dir GAN （Bi-directional Generative Adversarial Network）和ETSAN （Energy-based Temporal Summarized Attentive Network）等算法，在平均分组的HMDB51数据集上，ARAAE平均识别准确率提升至（22.1±1.8）%，均明显优于对比算法；在平均分组的UCF101数据集上，ARAAE的平均识别准确率提升至（22.4±1.6）%，略优于对比算法；在以81/20为分割方式的UCF101数据集上，ARAAE的平均识别准确率提升至（40.2±2.6）%，均大于对比算法。可见，ARAAE在ZSAR中能有效提高识别性能。

关键词: 零样本动作识别, 能量函数, 注意力机制, 光流法, 视觉特征

Abstract:

Addressing the insufficiency of structural guidance in the framework of Zero-Shot Action Recognition （ZSAR） algorithms， an Action Recognition Algorithm based on Attention mechanism and Energy function （ARAAE） was proposed guided by the Energy-Based Model （EBM） for framework design. Firstly， to obtain the input for EBM， a combination of optical flow and Convolutional 3D （C3D） architecture was designed to extract visual features， achieving spatial non-redundancy. Secondly， Vision Transformer （ViT） was utilized for visual feature extraction to reduce temporal redundancy， and ViT cooperated with combination of optical flow and C3D architecture was used to reduce spatial redundancy， resulting in a non-redundant visual space. Finally， to measure the correlation between visual space and semantic space， an energy score evaluation mechanism was realized with the design of a joint loss function for optimization experiments. Experimental results on HMDB51 and UCF101 datasets using six classical ZSAR algorithms and algorithms in recent literature show that on the HMDB51 dataset with average grouping， the average recognition accuracy of ARAAE is （22.1±1.8）%， which is better than those of CAGE （Coupling Adversarial Graph Embedding）， Bi-dir GAN （Bi-directional Generative Adversarial Network） and ETSAN （Energy-based Temporal Summarized Attentive Network）. On UCF101 dataset with average grouping， the average recognition accuracy of ARAAE is （22.4±1.6）%， which is better than those of all comparison algorithm slightly. On UCF101 with 81/20 dataset segmentation method， the average recognition accuracy of ARAAE is （40.2±2.6）%， which is higher than those of the comparison algorithms. It can be seen that ARAAE improves the recognition performance in ZSAR effectively.

Key words: Zero-Shot Action Recognition (ZSAR), energy function, attention mechanism, optical flow, visual feature

中图分类号:

TP391.4

王丽芳, 吴荆双, 尹鹏亮, 胡立华. 基于注意力机制和能量函数的动作识别算法[J]. 计算机应用, 2025, 45(1): 234-239.

Lifang WANG, Jingshuang WU, Pengliang YIN, Lihua HU. Action recognition algorithm based on attention mechanism and energy function[J]. Journal of Computer Applications, 2025, 45(1): 234-239.

图/表 7

参考文献 29

1	YANG L， PENG H， ZHANG D， et al. Revisiting anchor mechanisms for temporal action localization ［J］. IEEE Transactions on Image Processing， 2020， 29： 8535-8548.
2	ZHAO T， HAN J， YANG L， et al. SODA： weakly supervised temporal action localization based on astute background response and self-distillation learning ［J］. International Journal of Computer Vision， 2021， 129（8）： 2474-2498.
3	WANG L， XIONG Y， WANG Z， et al. Temporal segment networks for action recognition in videos ［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2010， 41（11）： 2740-2755.
4	李永刚，王朝晖，万晓依，等.基于深度残差双单向DLSTM的时空一致视频事件识别［J］.计算机学报， 2018， 41（12）： 2852-2866.
	LI Y G， WANG Z H， WAN X Y， et al. Deep residual dual unidirectional DLSTM for video event recognition with spatial-temporal consistency ［J］. Chinese Journal of Computers， 2018， 41（12）： 2852-2866.
5	ESTEVAM V， PEDRINI H， MENOTTI D. Zero-shot action recognition in videos： a survey ［J］. Neurocomputing， 2021， 439： 159-175.
6	HUYNH D， ELHAMIFAR E. A shared multi-attention framework for multi-label zero-shot learning ［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 8773-8783.
7	PENG B， LEI J， FU H， et al. Deep video action clustering via spatio-temporal feature learning ［J］. Neurocomputing， 2021， 456： 519-527.
8	LIU L， ZHOU T， LONG G， et al. Attribute propagation network for graph zero-shot learning ［C］// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2020： 4868-4875.
9	KAMPFFMEYER M， CHEN Y， LIANG X， et al. Rethinking knowledge graph propagation for zero-shot learning ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 11479-11488.
10	HONG M， ZHANG X， LI G， et al. Multi-modal multi-grained embedding learning for generalized zero-shot video classification ［J］. IEEE Transactions on Circuits and Systems for Video Technology， 2023， 33（10）： 5959-5972.
11	LIN L， ZHANG J， LIU J. Actionlet-dependent contrastive learning for unsupervised skeleton-based action recognition ［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023： 2363-2372.
12	GAO J， HOU Y， GUO Z， et al. Learning spatio-temporal semantics and cluster relation for zero-shot action recognition ［J］. IEEE Transactions on Circuits and Systems for Video Technology， 2023， 33（11）： 6519-6530.
13	YANG H， REN Z， YUAN H， et al. Contrastive self-supervised representation learning without negative samples for multimodal human action recognition ［J］. Frontier in Neuroscience， 2023， 17： No.1225312.
14	XING M， FENG Z， SU Y， et al. Ventral & Dorsal Stream Theory based zero-shot action recognition ［J］. Pattern Recognition， 2021， 116： No.107953.
15	QI C， FENG Z， XING M， et al. Energy-based temporal summarized attentive network for zero-shot action recognition ［J］. IEEE Transactions on Multimedia， 2023， 25： 1940-1953.
16	LeCUN Y， CHOPRA S， HADSELL R， et al. A tutorial on energy-based learning ［EB/OL］. ［2023-10-05］. .
17	KAY W， CARREIRA J， SIMONYAN K， et al. The Kinetics human action video dataset ［EB/OL］. ［2023-09-10］. .
18	DOSOVITSKIY A， BEYER L， KOLESNIKOV A， et al. An image is worth 16 x16 words： Transformers for image recognition at scale ［EB/OL］. ［2023-10-02］. .
19	KUEHNE H， JHUANG H， GARROTE E， et al. HMDB： a large video database for human motion recognition ［C］// Proceedings of the 2011 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2011： 2556-2563.
20	SOOMRO K， ZAMIR A R， SHAH M. UCF101： a dataset of 101 human actions classes from videos in the wild ［EB/OL］. ［2022-12-12］. .
21	MIKOLOV T， SUYSKEVER I， CHEN K， et al. Distributed representations of words and phrases and their compositionality ［C］// Proceedings of the 26th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2013： 3111-3119.
22	AKATA Z， REED S， WALTER D， et al. Evaluation of output embeddings for fine-grained image classification ［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2015： 2927-2936.
23	XU X， HOSPEDALES T M， GONG S. Multi-task zero-shot action recognition with prioritised data augmentation ［C］// Proceedings of the 2016 European Conference on Computer Vision， LNCS 9906. Cham： Springer， 2016： 343-359.
24	QIN J， LIU L， SHAO L， et al. Zero-shot action recognition with error-correcting output codes ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 1042-1051.
25	WANG Q， CHEN K. Zero-shot visual recognition via bidirectional latent embedding ［J］. International Journal of Computer Vision， 2017， 124（3）： 356-383.
26	MISHRA A， VERMA V K， REDDY M S K， et al. A generative approach to zero-shot and few-shot action recognition ［C］// Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision. Piscataway： IEEE， 2018： 372-380.
27	BISHAY M， ZOUMPOURLIS G， PATRAS I. TARN： temporal attentive relation network for few-shot and zero-shot action recognition ［C］// Proceedings of the 2019 British Machine Vision Conference. Durham： BMVA Press， 2019： 1-14.
28	TIAN Y， HUANG Y， XU W， et al. Coupling Adversarial Graph Embedding for transductive zero-shot action recognition ［J］. Neurocomputing， 2021， 452： 239-252.
29	MISHRA A， PANDEY A， MURTHY H A. Zero-shot learning for action recognition using synthesized features ［J］. Neurocomputing， 2020， 390： 117-130.

软件	说明
PyTorch 1.9	计算深度学习的平台
Python 3.6	进行数据预处理，搭建算法框架
OpenCV	进行视频的预处理
Gensim	语义编码器

软件	说明
PyTorch 1.9	计算深度学习的平台
Python 3.6	进行数据预处理，搭建算法框架
OpenCV	进行视频的预处理
Gensim	语义编码器

算法	视觉特征	语义特征	HMDB51 （26/25）	UCF101 （51/50）	UCF101 （81/20）
SJE	L	WV	13.3±2.4	9.9±1.4
MTE	D	WV	19.7±1.6		15.8±1.3
ZSECOC	L	A		3.2±0.7
ZSECOC	L	WV	16.5±3.9	13.7±0.5
BiDiLEL	D	A		20.5±0.5	39.2±1.0
BiDiLEL	D	WV	18.6±0.7	18.9±0.4	38.3±1.2
GMM	D	WV	19.3±2.1	17.3±1.1
TARN	D	WV	19.5±4.2	19.0±2.3	36.0±5.3
CAGE	D	WV	20.8±2.9	12.9±1.8
Bi-dir GAN	D	WV	17.5±2.4	17.2±2.3
ETSAN	D	WV		20.6±1.6	39.4±2.1
ARAAE（本文）	D	A		13.4±1.8	24.9±2.6
ARAAE（本文）	D	WV	22.1±1.8	22.4±1.6	40.2±2.6

算法	视觉特征	语义特征	HMDB51 （26/25）	UCF101 （51/50）	UCF101 （81/20）
SJE	L	WV	13.3±2.4	9.9±1.4
MTE	D	WV	19.7±1.6		15.8±1.3
ZSECOC	L	A		3.2±0.7
ZSECOC	L	WV	16.5±3.9	13.7±0.5
BiDiLEL	D	A		20.5±0.5	39.2±1.0
BiDiLEL	D	WV	18.6±0.7	18.9±0.4	38.3±1.2
GMM	D	WV	19.3±2.1	17.3±1.1
TARN	D	WV	19.5±4.2	19.0±2.3	36.0±5.3
CAGE	D	WV	20.8±2.9	12.9±1.8
Bi-dir GAN	D	WV	17.5±2.4	17.2±2.3
ETSAN	D	WV		20.6±1.6	39.4±2.1
ARAAE（本文）	D	A		13.4±1.8	24.9±2.6
ARAAE（本文）	D	WV	22.1±1.8	22.4±1.6	40.2±2.6

算法	HMDB51（26/25）	UCF101（51/50）	UCF101（81/20）
TRAN	19.5±4.2	19.0±2.3	36.0±5.3
ARAAE（O）	17.9±1.2	15.3±2.2	33.1±5.6
ARAAE（C）	19.9±3.2	20.0±2.9	35.2±4.9
ARAAE（w/o ViT）	20.2±0.8	20.6±1.2	36.8±3.4
ARAAE（w/o E）	21.8±3.4	21.6±2.5	38.4±5.8
ARAAE（本文）	22.1±1.8	22.4±1.6	40.2±2.6

基于注意力机制和能量函数的动作识别算法

Action recognition algorithm based on attention mechanism and energy function

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献 29

相关文章 15

编辑推荐

Metrics

[1]	张嘉琳, 任庆桦, 毛启容. 利用全局-局部特征依赖的反欺骗说话人验证系统[J]. 《计算机应用》唯一官方网站, 2025, 45(1): 308-317.
[2]	黄颖, 李昌盛, 彭慧, 刘苏. 用于动态场景高动态范围成像的局部熵引导的双分支网络[J]. 《计算机应用》唯一官方网站, 2025, 45(1): 204-213.
[3]	徐杰, 钟勇, 王阳, 张昌福, 杨观赐. 基于上下文通道注意力机制的人脸属性估计与表情识别[J]. 《计算机应用》唯一官方网站, 2025, 45(1): 253-260.
[4]	陈俊颖, 郭士杰, 陈玲玲. 基于解耦注意力与幻影卷积的轻量级人体姿态估计[J]. 《计算机应用》唯一官方网站, 2025, 45(1): 223-233.
[5]	宋鹏程, 郭立君, 张荣. 利用局部-全局时间依赖的弱监督视频异常检测[J]. 《计算机应用》唯一官方网站, 2025, 45(1): 240-246.
[6]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[7]	李力铤, 华蓓, 贺若舟, 徐况. 基于解耦注意力机制的多变量时序预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2732-2738.
[8]	赵志强, 马培红, 黑新宏. 基于双重注意力机制的人群计数方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2886-2892.
[9]	薛凯鹏, 徐涛, 廖春节. 融合自监督和多层交叉注意力的多模态情感分析网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2387-2392.
[10]	汪雨晴, 朱广丽, 段文杰, 李书羽, 周若彤. 基于交互注意力机制的心理咨询文本情感分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2393-2399.
[11]	高鹏淇, 黄鹤鸣, 樊永红. 融合坐标与多头注意力机制的交互语音情感识别[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2400-2406.
[12]	李钟华, 白云起, 王雪津, 黄雷雷, 林初俊, 廖诗宇. 基于图像增强的低照度人脸检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2588-2594.
[13]	莫尚斌, 王文君, 董凌, 高盛祥, 余正涛. 基于多路信息聚合协同解码的单通道语音增强[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2611-2617.
[14]	刘丽, 侯海金, 王安红, 张涛. 基于多尺度注意力的生成式信息隐藏算法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2102-2109.
[15]	徐松, 张文博, 王一帆. 基于时空信息的轻量视频显著性目标检测网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2192-2199.