Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (2): 521-528.DOI: 10.11772/j.issn.1001-9081.2022010017

• Multimedia computing and computer simulation • Previous Articles    

Action recognition method based on video spatio-temporal features

Ranyan NI, Yi ZHANG()   

  1. College of Computer Science,Sichuan University,Chengdu Sichuan 610065,China
  • Received:2022-01-07 Revised:2022-03-18 Accepted:2022-04-06 Online:2022-04-21 Published:2023-02-10
  • Contact: Yi ZHANG
  • About author:NI Ranyan, born in 1998, M. S. candidate. Her research interests include computer vision, action recognition.
  • Supported by:
    National Natural Science Foundation of China(U20A20161)


倪苒岩, 张轶()   

  1. 四川大学 计算机学院,成都 610065
  • 通讯作者: 张轶
  • 作者简介:倪苒岩(1998—),女,安徽黄山人,硕士研究生,主要研究方向:计算机视觉、行为识别;
  • 基金资助:


Aiming at the problems that the end-to-end recognition of two-stream networks cannot be realized due to the need of calculating optical flow maps in advance to extract motion information and the three-dimensional convolutional networks have a lot of parameters, an action recognition method based on video spatio-temporal features was proposed. In this method, the spatio-temporal information in videos were able to be extracted efficiently without adding any calculation of optical flows or any three-dimensional convolution operation. Firstly, the motion information extraction module based on attention mechanism was used to capture the motion shift information between two adjacent frames, thereby simulating the function of optical flows in two-stream network. Secondly, a decoupled spatio-temporal information extraction module was proposed to replace the three-dimensional convolution in order to encode the spatio-temporal information. Finally, the two modules were embedded into the two-dimensional residual network to complete the end-to-end action recognition. Experiments were carried out on several mainstream action recognition datasets. The results show that when only using RGB (Red-Green-Blue) video frames as input, the recognition accuracies of the proposed method on UCF101, HMDB51 and Something-Something-V1 datasets are 96.5%, 73.1% and 46.6% respectively. Compared with Temporal Segment Network (TSN) method using two-stream structure, the proposed method has the recognition accuracy on UCF101 improved by 2.5 percentage points. It can be seen that the proposed method is able to extract spatio-temporal features in videos efficiently.

Key words: Convolutional Neural Network (CNN), action recognition, spatio-temporal information, temporal reasoning, motion information



关键词: 卷积神经网络, 行为识别, 时空信息, 时序推理, 运动信息

CLC Number: