《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (2): 521-528.DOI: 10.11772/j.issn.1001-9081.2022010017

• 多媒体计算与计算机仿真 • 上一篇    

基于视频时空特征的行为识别方法

倪苒岩, 张轶()   

  1. 四川大学 计算机学院,成都 610065
  • 收稿日期:2022-01-07 修回日期:2022-03-18 接受日期:2022-04-06 发布日期:2022-04-21 出版日期:2023-02-10
  • 通讯作者: 张轶
  • 作者简介:倪苒岩(1998—),女,安徽黄山人,硕士研究生,主要研究方向:计算机视觉、行为识别;
  • 基金资助:
    国家自然科学基金资助项目(U20A20161)

Action recognition method based on video spatio-temporal features

Ranyan NI, Yi ZHANG()   

  1. College of Computer Science,Sichuan University,Chengdu Sichuan 610065,China
  • Received:2022-01-07 Revised:2022-03-18 Accepted:2022-04-06 Online:2022-04-21 Published:2023-02-10
  • Contact: Yi ZHANG
  • About author:NI Ranyan, born in 1998, M. S. candidate. Her research interests include computer vision, action recognition.
  • Supported by:
    National Natural Science Foundation of China(U20A20161)

摘要:

针对双流网络提取运动信息需要预先计算光流图,从而无法实现端到端的识别以及三维卷积网络参数量巨大的问题,提出了一种基于视频时空特征的行为识别方法。该方法能够高效提取视频中的时空信息,且无需添加任何光流计算和三维卷积操作。首先,利用基于注意力机制的运动信息提取模块捕获相邻两帧之间的运动位移信息,从而模拟双流网络中光流图的作用;其次,提出了一种解耦的时空信息提取模块代替三维卷积,从而实现时空信息的编码;最后,在将两个模块嵌入二维的残差网络中后,完成端到端的行为识别。将所提方法在几个主流的行为识别数据集上进行实验,结果表明在仅使用RGB视频帧作为输入的情况下,在UCF101、HMDB51、Something-Something-V1数据集上的识别准确率分别为96.5%、73.1%和46.6%,与使用双流结构的时间分段网络(TSN)方法相比,在UCF101数据集上的识别准确率提高了2.5个百分点。可见,所提方法能够高效提取视频中的时空特征。

关键词: 卷积神经网络, 行为识别, 时空信息, 时序推理, 运动信息

Abstract:

Aiming at the problems that the end-to-end recognition of two-stream networks cannot be realized due to the need of calculating optical flow maps in advance to extract motion information and the three-dimensional convolutional networks have a lot of parameters, an action recognition method based on video spatio-temporal features was proposed. In this method, the spatio-temporal information in videos were able to be extracted efficiently without adding any calculation of optical flows or any three-dimensional convolution operation. Firstly, the motion information extraction module based on attention mechanism was used to capture the motion shift information between two adjacent frames, thereby simulating the function of optical flows in two-stream network. Secondly, a decoupled spatio-temporal information extraction module was proposed to replace the three-dimensional convolution in order to encode the spatio-temporal information. Finally, the two modules were embedded into the two-dimensional residual network to complete the end-to-end action recognition. Experiments were carried out on several mainstream action recognition datasets. The results show that when only using RGB (Red-Green-Blue) video frames as input, the recognition accuracies of the proposed method on UCF101, HMDB51 and Something-Something-V1 datasets are 96.5%, 73.1% and 46.6% respectively. Compared with Temporal Segment Network (TSN) method using two-stream structure, the proposed method has the recognition accuracy on UCF101 improved by 2.5 percentage points. It can be seen that the proposed method is able to extract spatio-temporal features in videos efficiently.

Key words: Convolutional Neural Network (CNN), action recognition, spatio-temporal information, temporal reasoning, motion information

中图分类号: