Action recognition algorithm based on attention mechanism and energy function

doi:10.11772/j.issn.1001-9081.2024010004

Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (1): 234-239.DOI: 10.11772/j.issn.1001-9081.2024010004

• Multimedia computing and computer simulation • Previous Articles Next Articles

Action recognition algorithm based on attention mechanism and energy function

Lifang WANG¹(), Jingshuang WU¹, Pengliang YIN², Lihua HU¹

^1.College of Computer Science and Technology，Taiyuan University of Science and Technology，Taiyuan Shanxi 030024，China
^2.Shanghai Freedom-Chips Microelectronics Company Limited，Xi’an Shaanxi 710000，China

Received:2024-01-10 Revised:2024-03-15 Accepted:2024-03-21 Online:2024-05-09 Published:2025-01-10
Contact: Lifang WANG
About author:WU Jingshuang， born in 1998， M. S. candidate. Her research interests include computer vision， action recognition.
YIN Pengliang， born in 1987， M. S. His research interests include computer vision， deep learning， image processing and pattern recognition， test and measurement methods and instruments.
HU Lihua， born in 1982， Ph. D.， professor. Her research interests include computer vision.
Supported by:
National Natural Science Foundation of China(62273248);Natural Science Foundation of Shanxi Province(202203021211206);Graduate Education Reform Project of Shanxi Province(2021YJJG238);Graduate Research Innovation Project of Shanxi Province(2023KY657);Taiyuan University of Science and Technology PhD Start-up Fund(20212021)

基于注意力机制和能量函数的动作识别算法

王丽芳¹(), 吴荆双¹, 尹鹏亮², 胡立华¹

^1.太原科技大学计算机科学与技术学院，太原 030024
^2.上海方宜万强微电子有限公司，西安 710000

通讯作者: 王丽芳
作者简介:王丽芳（1975—），女，山西和顺人，副教授，博士，CCF会员，主要研究方向：智能优化、图像处理；wanglifang@tyust.edu.cn
吴荆双（1998—），女，山西运城人，硕士研究生，主要研究方向：计算机视觉、动作识别；
尹鹏亮（1987—），男，山西忻州人，硕士，主要研究方向：计算机视觉、深度学习、图像处理与模式识别、测试计量方法与仪器；
胡立华（1982—），女，山西忻州人，教授，博士，CCF会员，主要研究方向：计算机视觉。
基金资助:
国家自然科学基金资助项目(62273248);山西省自然科学基金资助项目(202203021211206);山西省研究生教育改革项目(2021YJJG238);山西省研究生科研创新项目(2023KY657);太原科技大学博士启动基金资助项目(20212021)

Abstract

Abstract:

Addressing the insufficiency of structural guidance in the framework of Zero-Shot Action Recognition （ZSAR） algorithms， an Action Recognition Algorithm based on Attention mechanism and Energy function （ARAAE） was proposed guided by the Energy-Based Model （EBM） for framework design. Firstly， to obtain the input for EBM， a combination of optical flow and Convolutional 3D （C3D） architecture was designed to extract visual features， achieving spatial non-redundancy. Secondly， Vision Transformer （ViT） was utilized for visual feature extraction to reduce temporal redundancy， and ViT cooperated with combination of optical flow and C3D architecture was used to reduce spatial redundancy， resulting in a non-redundant visual space. Finally， to measure the correlation between visual space and semantic space， an energy score evaluation mechanism was realized with the design of a joint loss function for optimization experiments. Experimental results on HMDB51 and UCF101 datasets using six classical ZSAR algorithms and algorithms in recent literature show that on the HMDB51 dataset with average grouping， the average recognition accuracy of ARAAE is （22.1±1.8）%， which is better than those of CAGE （Coupling Adversarial Graph Embedding）， Bi-dir GAN （Bi-directional Generative Adversarial Network） and ETSAN （Energy-based Temporal Summarized Attentive Network）. On UCF101 dataset with average grouping， the average recognition accuracy of ARAAE is （22.4±1.6）%， which is better than those of all comparison algorithm slightly. On UCF101 with 81/20 dataset segmentation method， the average recognition accuracy of ARAAE is （40.2±2.6）%， which is higher than those of the comparison algorithms. It can be seen that ARAAE improves the recognition performance in ZSAR effectively.

Key words: Zero-Shot Action Recognition (ZSAR), energy function, attention mechanism, optical flow, visual feature

摘要：

针对零样本动作识别（ZSAR）算法的框架缺乏结构性指导的问题，以基于能量的模型（EBM）指导框架设计，提出基于注意力机制和能量函数的动作识别算法（ARAAE）。首先，为了得到EBM的输入，设计了光流加3D卷积（C3D）架构的组合以提取视觉特征，从而达到空间去冗余的效果；其次，将视觉Transformer （ViT）用于视觉特征的提取以减少时间冗余，同时利用ViT配合光流加C3D架构的组合以减少空间冗余，从而获得非冗余视觉空间；最后，为度量视觉空间和语义空间的相关性，实现能量评分评估机制，设计联合损失函数来进行优化实验。采用6个经典ZSAR算法及近年文献里的算法在两个数据集HMDB51和UCF101进行实验的结果表明：相较于CAGE （Coupling Adversarial Graph Embedding）、Bi-dir GAN （Bi-directional Generative Adversarial Network）和ETSAN （Energy-based Temporal Summarized Attentive Network）等算法，在平均分组的HMDB51数据集上，ARAAE平均识别准确率提升至（22.1±1.8）%，均明显优于对比算法；在平均分组的UCF101数据集上，ARAAE的平均识别准确率提升至（22.4±1.6）%，略优于对比算法；在以81/20为分割方式的UCF101数据集上，ARAAE的平均识别准确率提升至（40.2±2.6）%，均大于对比算法。可见，ARAAE在ZSAR中能有效提高识别性能。

关键词: 零样本动作识别, 能量函数, 注意力机制, 光流法, 视觉特征

CLC Number:

TP391.4

Lifang WANG, Jingshuang WU, Pengliang YIN, Lihua HU. Action recognition algorithm based on attention mechanism and energy function[J]. Journal of Computer Applications, 2025, 45(1): 234-239.

王丽芳, 吴荆双, 尹鹏亮, 胡立华. 基于注意力机制和能量函数的动作识别算法[J]. 《计算机应用》唯一官方网站, 2025, 45(1): 234-239.

Figures/Tables 7

References 29

1	YANG L， PENG H， ZHANG D， et al. Revisiting anchor mechanisms for temporal action localization ［J］. IEEE Transactions on Image Processing， 2020， 29： 8535-8548.
2	ZHAO T， HAN J， YANG L， et al. SODA： weakly supervised temporal action localization based on astute background response and self-distillation learning ［J］. International Journal of Computer Vision， 2021， 129（8）： 2474-2498.
3	WANG L， XIONG Y， WANG Z， et al. Temporal segment networks for action recognition in videos ［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2010， 41（11）： 2740-2755.
4	李永刚，王朝晖，万晓依，等.基于深度残差双单向DLSTM的时空一致视频事件识别［J］.计算机学报， 2018， 41（12）： 2852-2866.
	LI Y G， WANG Z H， WAN X Y， et al. Deep residual dual unidirectional DLSTM for video event recognition with spatial-temporal consistency ［J］. Chinese Journal of Computers， 2018， 41（12）： 2852-2866.
5	ESTEVAM V， PEDRINI H， MENOTTI D. Zero-shot action recognition in videos： a survey ［J］. Neurocomputing， 2021， 439： 159-175.
6	HUYNH D， ELHAMIFAR E. A shared multi-attention framework for multi-label zero-shot learning ［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 8773-8783.
7	PENG B， LEI J， FU H， et al. Deep video action clustering via spatio-temporal feature learning ［J］. Neurocomputing， 2021， 456： 519-527.
8	LIU L， ZHOU T， LONG G， et al. Attribute propagation network for graph zero-shot learning ［C］// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2020： 4868-4875.
9	KAMPFFMEYER M， CHEN Y， LIANG X， et al. Rethinking knowledge graph propagation for zero-shot learning ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 11479-11488.
10	HONG M， ZHANG X， LI G， et al. Multi-modal multi-grained embedding learning for generalized zero-shot video classification ［J］. IEEE Transactions on Circuits and Systems for Video Technology， 2023， 33（10）： 5959-5972.
11	LIN L， ZHANG J， LIU J. Actionlet-dependent contrastive learning for unsupervised skeleton-based action recognition ［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023： 2363-2372.
12	GAO J， HOU Y， GUO Z， et al. Learning spatio-temporal semantics and cluster relation for zero-shot action recognition ［J］. IEEE Transactions on Circuits and Systems for Video Technology， 2023， 33（11）： 6519-6530.
13	YANG H， REN Z， YUAN H， et al. Contrastive self-supervised representation learning without negative samples for multimodal human action recognition ［J］. Frontier in Neuroscience， 2023， 17： No.1225312.
14	XING M， FENG Z， SU Y， et al. Ventral & Dorsal Stream Theory based zero-shot action recognition ［J］. Pattern Recognition， 2021， 116： No.107953.
15	QI C， FENG Z， XING M， et al. Energy-based temporal summarized attentive network for zero-shot action recognition ［J］. IEEE Transactions on Multimedia， 2023， 25： 1940-1953.
16	LeCUN Y， CHOPRA S， HADSELL R， et al. A tutorial on energy-based learning ［EB/OL］. ［2023-10-05］. .
17	KAY W， CARREIRA J， SIMONYAN K， et al. The Kinetics human action video dataset ［EB/OL］. ［2023-09-10］. .
18	DOSOVITSKIY A， BEYER L， KOLESNIKOV A， et al. An image is worth 16 x16 words： Transformers for image recognition at scale ［EB/OL］. ［2023-10-02］. .
19	KUEHNE H， JHUANG H， GARROTE E， et al. HMDB： a large video database for human motion recognition ［C］// Proceedings of the 2011 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2011： 2556-2563.
20	SOOMRO K， ZAMIR A R， SHAH M. UCF101： a dataset of 101 human actions classes from videos in the wild ［EB/OL］. ［2022-12-12］. .
21	MIKOLOV T， SUYSKEVER I， CHEN K， et al. Distributed representations of words and phrases and their compositionality ［C］// Proceedings of the 26th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2013： 3111-3119.
22	AKATA Z， REED S， WALTER D， et al. Evaluation of output embeddings for fine-grained image classification ［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2015： 2927-2936.
23	XU X， HOSPEDALES T M， GONG S. Multi-task zero-shot action recognition with prioritised data augmentation ［C］// Proceedings of the 2016 European Conference on Computer Vision， LNCS 9906. Cham： Springer， 2016： 343-359.
24	QIN J， LIU L， SHAO L， et al. Zero-shot action recognition with error-correcting output codes ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 1042-1051.
25	WANG Q， CHEN K. Zero-shot visual recognition via bidirectional latent embedding ［J］. International Journal of Computer Vision， 2017， 124（3）： 356-383.
26	MISHRA A， VERMA V K， REDDY M S K， et al. A generative approach to zero-shot and few-shot action recognition ［C］// Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision. Piscataway： IEEE， 2018： 372-380.
27	BISHAY M， ZOUMPOURLIS G， PATRAS I. TARN： temporal attentive relation network for few-shot and zero-shot action recognition ［C］// Proceedings of the 2019 British Machine Vision Conference. Durham： BMVA Press， 2019： 1-14.
28	TIAN Y， HUANG Y， XU W， et al. Coupling Adversarial Graph Embedding for transductive zero-shot action recognition ［J］. Neurocomputing， 2021， 452： 239-252.
29	MISHRA A， PANDEY A， MURTHY H A. Zero-shot learning for action recognition using synthesized features ［J］. Neurocomputing， 2020， 390： 117-130.

软件	说明
PyTorch 1.9	计算深度学习的平台
Python 3.6	进行数据预处理，搭建算法框架
OpenCV	进行视频的预处理
Gensim	语义编码器

软件	说明
PyTorch 1.9	计算深度学习的平台
Python 3.6	进行数据预处理，搭建算法框架
OpenCV	进行视频的预处理
Gensim	语义编码器

算法	视觉特征	语义特征	HMDB51 （26/25）	UCF101 （51/50）	UCF101 （81/20）
SJE	L	WV	13.3±2.4	9.9±1.4
MTE	D	WV	19.7±1.6		15.8±1.3
ZSECOC	L	A		3.2±0.7
ZSECOC	L	WV	16.5±3.9	13.7±0.5
BiDiLEL	D	A		20.5±0.5	39.2±1.0
BiDiLEL	D	WV	18.6±0.7	18.9±0.4	38.3±1.2
GMM	D	WV	19.3±2.1	17.3±1.1
TARN	D	WV	19.5±4.2	19.0±2.3	36.0±5.3
CAGE	D	WV	20.8±2.9	12.9±1.8
Bi-dir GAN	D	WV	17.5±2.4	17.2±2.3
ETSAN	D	WV		20.6±1.6	39.4±2.1
ARAAE（本文）	D	A		13.4±1.8	24.9±2.6
ARAAE（本文）	D	WV	22.1±1.8	22.4±1.6	40.2±2.6

算法	视觉特征	语义特征	HMDB51 （26/25）	UCF101 （51/50）	UCF101 （81/20）
SJE	L	WV	13.3±2.4	9.9±1.4
MTE	D	WV	19.7±1.6		15.8±1.3
ZSECOC	L	A		3.2±0.7
ZSECOC	L	WV	16.5±3.9	13.7±0.5
BiDiLEL	D	A		20.5±0.5	39.2±1.0
BiDiLEL	D	WV	18.6±0.7	18.9±0.4	38.3±1.2
GMM	D	WV	19.3±2.1	17.3±1.1
TARN	D	WV	19.5±4.2	19.0±2.3	36.0±5.3
CAGE	D	WV	20.8±2.9	12.9±1.8
Bi-dir GAN	D	WV	17.5±2.4	17.2±2.3
ETSAN	D	WV		20.6±1.6	39.4±2.1
ARAAE（本文）	D	A		13.4±1.8	24.9±2.6
ARAAE（本文）	D	WV	22.1±1.8	22.4±1.6	40.2±2.6

算法	HMDB51（26/25）	UCF101（51/50）	UCF101（81/20）
TRAN	19.5±4.2	19.0±2.3	36.0±5.3
ARAAE（O）	17.9±1.2	15.3±2.2	33.1±5.6
ARAAE（C）	19.9±3.2	20.0±2.9	35.2±4.9
ARAAE（w/o ViT）	20.2±0.8	20.6±1.2	36.8±3.4
ARAAE（w/o E）	21.8±3.4	21.6±2.5	38.4±5.8
ARAAE（本文）	22.1±1.8	22.4±1.6	40.2±2.6

Action recognition algorithm based on attention mechanism and energy function

基于注意力机制和能量函数的动作识别算法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 7

References 29

Related Articles 15

Recommended Articles

Metrics

[1]	Jie XU, Yong ZHONG, Yang WANG, Changfu ZHANG, Guanci YANG. Facial attribute estimation and expression recognition based on contextual channel attention mechanism [J]. Journal of Computer Applications, 2025, 45(1): 253-260.
[2]	Junying CHEN, Shijie GUO, Lingling CHEN. Lightweight human pose estimation based on decoupled attention and ghost convolution [J]. Journal of Computer Applications, 2025, 45(1): 223-233.
[3]	Jialin ZHANG, Qinghua REN, Qirong MAO. Speaker verification system utilizing global-local feature dependency for anti-spoofing [J]. Journal of Computer Applications, 2025, 45(1): 308-317.
[4]	Ying HUANG, Changsheng LI, Hui PENG, Su LIU. Dual-branch network guided by local entropy for dynamic scene high dynamic range imaging [J]. Journal of Computer Applications, 2025, 45(1): 204-213.
[5]	Jing QIN, Zhiguang QIN, Fali LI, Yueheng PENG. Diagnosis of major depressive disorder based on probabilistic sparse self-attention neural network [J]. Journal of Computer Applications, 2024, 44(9): 2970-2974.
[6]	Liting LI, Bei HUA, Ruozhou HE, Kuang XU. Multivariate time series prediction model based on decoupled attention mechanism [J]. Journal of Computer Applications, 2024, 44(9): 2732-2738.
[7]	Zhiqiang ZHAO, Peihong MA, Xinhong HEI. Crowd counting method based on dual attention mechanism [J]. Journal of Computer Applications, 2024, 44(9): 2886-2892.
[8]	Kaipeng XUE, Tao XU, Chunjie LIAO. Multimodal sentiment analysis network with self-supervision and multi-layer cross attention [J]. Journal of Computer Applications, 2024, 44(8): 2387-2392.
[9]	Pengqi GAO, Heming HUANG, Yonghong FAN. Fusion of coordinate and multi-head attention mechanisms for interactive speech emotion recognition [J]. Journal of Computer Applications, 2024, 44(8): 2400-2406.
[10]	Zhonghua LI, Yunqi BAI, Xuejin WANG, Leilei HUANG, Chujun LIN, Shiyu LIAO. Low illumination face detection based on image enhancement [J]. Journal of Computer Applications, 2024, 44(8): 2588-2594.
[11]	Shangbin MO, Wenjun WANG, Ling DONG, Shengxiang GAO, Zhengtao YU. Single-channel speech enhancement based on multi-channel information aggregation and collaborative decoding [J]. Journal of Computer Applications, 2024, 44(8): 2611-2617.
[12]	Li LIU, Haijin HOU, Anhong WANG, Tao ZHANG. Generative data hiding algorithm based on multi-scale attention [J]. Journal of Computer Applications, 2024, 44(7): 2102-2109.
[13]	Song XU, Wenbo ZHANG, Yifan WANG. Lightweight video salient object detection network based on spatiotemporal information [J]. Journal of Computer Applications, 2024, 44(7): 2192-2199.
[14]	Dahai LI, Zhonghua WANG, Zhendong WANG. Dual-branch low-light image enhancement network combining spatial and frequency domain information [J]. Journal of Computer Applications, 2024, 44(7): 2175-2182.
[15]	Wenliang WEI, Yangping WANG, Biao YUE, Anzheng WANG, Zhe ZHANG. Deep learning model for infrared and visible image fusion based on illumination weight allocation and attention [J]. Journal of Computer Applications, 2024, 44(7): 2183-2191.