Addressing the insufficiency of structural guidance in the framework of Zero-Shot Action Recognition (ZSAR) algorithms, an Action Recognition Algorithm based on Attention mechanism and Energy function (ARAAE) was proposed guided by the Energy-Based Model (EBM) for framework design. Firstly, to obtain the input for EBM, a combination of optical flow and Convolutional 3D (C3D) architecture was designed to extract visual features, achieving spatial non-redundancy. Secondly, Vision Transformer (ViT) was utilized for visual feature extraction to reduce temporal redundancy, and ViT cooperated with combination of optical flow and C3D architecture was used to reduce spatial redundancy, resulting in a non-redundant visual space. Finally, to measure the correlation between visual space and semantic space, an energy score evaluation mechanism was realized with the design of a joint loss function for optimization experiments. Experimental results on HMDB51 and UCF101 datasets using six classical ZSAR algorithms and algorithms in recent literature show that on the HMDB51 dataset with average grouping, the average recognition accuracy of ARAAE is (22.1±1.8)%, which is better than those of CAGE (Coupling Adversarial Graph Embedding), Bi-dir GAN (Bi-directional Generative Adversarial Network) and ETSAN (Energy-based Temporal Summarized Attentive Network). On UCF101 dataset with average grouping, the average recognition accuracy of ARAAE is (22.4±1.6)%, which is better than those of all comparison algorithm slightly. On UCF101 with 81/20 dataset segmentation method, the average recognition accuracy of ARAAE is (40.2±2.6)%, which is higher than those of the comparison algorithms. It can be seen that ARAAE improves the recognition performance in ZSAR effectively.