Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (1): 234-239.DOI: 10.11772/j.issn.1001-9081.2024010004

• Multimedia computing and computer simulation • Previous Articles     Next Articles

Action recognition algorithm based on attention mechanism and energy function

Lifang WANG1(), Jingshuang WU1, Pengliang YIN2, Lihua HU1   

  1. 1.College of Computer Science and Technology,Taiyuan University of Science and Technology,Taiyuan Shanxi 030024,China
    2.Shanghai Freedom-Chips Microelectronics Company Limited,Xi’an Shaanxi 710000,China
  • Received:2024-01-10 Revised:2024-03-15 Accepted:2024-03-21 Online:2024-05-09 Published:2025-01-10
  • Contact: Lifang WANG
  • About author:WU Jingshuang, born in 1998, M. S. candidate. Her research interests include computer vision, action recognition.
    YIN Pengliang, born in 1987, M. S. His research interests include computer vision, deep learning, image processing and pattern recognition, test and measurement methods and instruments.
    HU Lihua, born in 1982, Ph. D., professor. Her research interests include computer vision.
  • Supported by:
    National Natural Science Foundation of China(62273248);Natural Science Foundation of Shanxi Province(202203021211206);Graduate Education Reform Project of Shanxi Province(2021YJJG238);Graduate Research Innovation Project of Shanxi Province(2023KY657);Taiyuan University of Science and Technology PhD Start-up Fund(20212021)

基于注意力机制和能量函数的动作识别算法

王丽芳1(), 吴荆双1, 尹鹏亮2, 胡立华1   

  1. 1.太原科技大学 计算机科学与技术学院,太原 030024
    2.上海方宜万强微电子有限公司,西安 710000
  • 通讯作者: 王丽芳
  • 作者简介:王丽芳(1975—),女,山西和顺人,副教授,博士,CCF会员,主要研究方向:智能优化、图像处理;wanglifang@tyust.edu.cn
    吴荆双(1998—),女,山西运城人,硕士研究生,主要研究方向:计算机视觉、动作识别;
    尹鹏亮(1987—),男,山西忻州人,硕士,主要研究方向:计算机视觉、深度学习、图像处理与模式识别、测试计量方法与仪器;
    胡立华(1982—),女,山西忻州人,教授,博士,CCF会员,主要研究方向:计算机视觉。
  • 基金资助:
    国家自然科学基金资助项目(62273248);山西省自然科学基金资助项目(202203021211206);山西省研究生教育改革项目(2021YJJG238);山西省研究生科研创新项目(2023KY657);太原科技大学博士启动基金资助项目(20212021)

Abstract:

Addressing the insufficiency of structural guidance in the framework of Zero-Shot Action Recognition (ZSAR) algorithms, an Action Recognition Algorithm based on Attention mechanism and Energy function (ARAAE) was proposed guided by the Energy-Based Model (EBM) for framework design. Firstly, to obtain the input for EBM, a combination of optical flow and Convolutional 3D (C3D) architecture was designed to extract visual features, achieving spatial non-redundancy. Secondly, Vision Transformer (ViT) was utilized for visual feature extraction to reduce temporal redundancy, and ViT cooperated with combination of optical flow and C3D architecture was used to reduce spatial redundancy, resulting in a non-redundant visual space. Finally, to measure the correlation between visual space and semantic space, an energy score evaluation mechanism was realized with the design of a joint loss function for optimization experiments. Experimental results on HMDB51 and UCF101 datasets using six classical ZSAR algorithms and algorithms in recent literature show that on the HMDB51 dataset with average grouping, the average recognition accuracy of ARAAE is (22.1±1.8)%, which is better than those of CAGE (Coupling Adversarial Graph Embedding), Bi-dir GAN (Bi-directional Generative Adversarial Network) and ETSAN (Energy-based Temporal Summarized Attentive Network). On UCF101 dataset with average grouping, the average recognition accuracy of ARAAE is (22.4±1.6)%, which is better than those of all comparison algorithm slightly. On UCF101 with 81/20 dataset segmentation method, the average recognition accuracy of ARAAE is (40.2±2.6)%, which is higher than those of the comparison algorithms. It can be seen that ARAAE improves the recognition performance in ZSAR effectively.

Key words: Zero-Shot Action Recognition (ZSAR), energy function, attention mechanism, optical flow, visual feature

摘要:

针对零样本动作识别(ZSAR)算法的框架缺乏结构性指导的问题,以基于能量的模型(EBM)指导框架设计,提出基于注意力机制和能量函数的动作识别算法(ARAAE)。首先,为了得到EBM的输入,设计了光流加3D卷积(C3D)架构的组合以提取视觉特征,从而达到空间去冗余的效果;其次,将视觉Transformer (ViT)用于视觉特征的提取以减少时间冗余,同时利用ViT配合光流加C3D架构的组合以减少空间冗余,从而获得非冗余视觉空间;最后,为度量视觉空间和语义空间的相关性,实现能量评分评估机制,设计联合损失函数来进行优化实验。采用6个经典ZSAR算法及近年文献里的算法在两个数据集HMDB51和UCF101进行实验的结果表明:相较于CAGE (Coupling Adversarial Graph Embedding)、Bi-dir GAN (Bi-directional Generative Adversarial Network)和ETSAN (Energy-based Temporal Summarized Attentive Network)等算法,在平均分组的HMDB51数据集上,ARAAE平均识别准确率提升至(22.1±1.8)%,均明显优于对比算法;在平均分组的UCF101数据集上,ARAAE的平均识别准确率提升至(22.4±1.6)%,略优于对比算法;在以81/20为分割方式的UCF101数据集上,ARAAE的平均识别准确率提升至(40.2±2.6)%,均大于对比算法。可见,ARAAE在ZSAR中能有效提高识别性能。

关键词: 零样本动作识别, 能量函数, 注意力机制, 光流法, 视觉特征

CLC Number: