Journal of Computer Applications ›› 2020, Vol. 40 ›› Issue (11): 3178-3183.DOI: 10.11772/j.issn.1001-9081.2020030399

• Artificial intelligence • Previous Articles     Next Articles

Human action recognition model based on tightly coupled spatiotemporal two-stream convolution neural network

LI Qian, YANG Wenzhu, CHEN Xiangyang, YUAN Tongtong, WANG Yuxia   

  1. School of Cyber Security and Computer, Hebei University, Baoding Hebei 071002, China
  • Received:2020-04-02 Revised:2020-06-19 Online:2020-11-10 Published:2020-07-01
  • Supported by:
    This work is partially supported by the Natural Science Foundation of Hebei Provence (F2020201011).


李前, 杨文柱, 陈向阳, 苑侗侗, 王玉霞   

  1. 河北大学 网络空间安全与计算机学院, 河北 保定 071002
  • 通讯作者: 杨文柱(1968-),男,河北保定人,教授,博士,CCF会员,主要研究方向:计算机视觉、智能系统;
  • 作者简介:李前(1994-),男,河南驻马店人,硕士研究生,CCF会员,主要研究方向:深度学习、动作识别;陈向阳(1977-),女,河南驻马店人,讲师,硕士,主要研究方向:计算机视觉,动作识别;苑侗侗(1994-),女,河北保定人,硕士研究生,主要研究方向:深度学习、目标跟踪;王玉霞(1994-),女,河北邯郸人,硕士研究生,主要研究方向:深度学习、目标检测
  • 基金资助:

Abstract: In consideration of the problems of low utilization rate of action information and insufficient attention of temporal information in video human action recognition, a human action recognition model based on tightly coupled spatiotemporal two-stream convolutional neural network was proposed. Firstly, two 2D convolutional neural networks were used to separately extract the spatial and temporal features in the video. Then, the forget gate module in the Long Short-Term Memory (LSTM) network was used to establish the feature-level tightly coupled connections between different sampled segments to achieve the transfer of information flow. After that, the Bi-directional Long Short-Term Memory (Bi-LSTM) network was used to evaluate the importance of each sampled segment and assign adaptive weight to it. Finally, the spatiotemporal two-stream features were combined to complete the human action recognition. The accuracy rates of this model on the datasets UCF101 and HMDB51 selected for the experiment and verification were 94.2% and 70.1% respectively. Experimental results show that the proposed model can effectively improve the utilization rate of temporal information and the ability of overall action representation, thus significantly improving the accuracy of human action recognition.

Key words: human action recognition, spatiotemporal model, Convolutional Neural Network (CNN), forget gate, feature fusion

摘要: 针对视频人体动作识别中动作信息利用率不高、时间信息关注度不足等问题,提出了一种基于紧耦合时空双流卷积神经网络的人体动作识别模型。首先,采用两个2D卷积神经网络分别提取视频中的空间特征和时间特征;然后,利用长短期记忆(LSTM)网络中的遗忘门模块在各采样片段之间建立特征层次的紧耦合连接以实现信息流的传递;接着,利用双向长短期记忆(Bi-LSTM)网络评估各采样片段的重要性并为其分配自适应权重;最后,结合时空双流特征以完成人体动作识别。在数据集UCF101和HMDB51上进行实验验证,该模型在这两个数据集上的准确率分别为94.2%和70.1%。实验结果表明,所提出的紧耦合时空双流卷积网络模型能够有效提高时间信息利用率和动作整体表达能力,由此明显提升人体动作识别的准确度。

关键词: 人体动作识别, 时空模型, 卷积神经网络, 遗忘门, 特征融合

CLC Number: