Action recognition method based on video spatio-temporal features

doi:10.11772/j.issn.1001-9081.2022010017

Abstract

Abstract:

Aiming at the problems that the end-to-end recognition of two-stream networks cannot be realized due to the need of calculating optical flow maps in advance to extract motion information and the three-dimensional convolutional networks have a lot of parameters， an action recognition method based on video spatio-temporal features was proposed. In this method， the spatio-temporal information in videos were able to be extracted efficiently without adding any calculation of optical flows or any three-dimensional convolution operation. Firstly， the motion information extraction module based on attention mechanism was used to capture the motion shift information between two adjacent frames， thereby simulating the function of optical flows in two-stream network. Secondly， a decoupled spatio-temporal information extraction module was proposed to replace the three-dimensional convolution in order to encode the spatio-temporal information. Finally， the two modules were embedded into the two-dimensional residual network to complete the end-to-end action recognition. Experiments were carried out on several mainstream action recognition datasets. The results show that when only using RGB （Red-Green-Blue） video frames as input， the recognition accuracies of the proposed method on UCF101， HMDB51 and Something-Something-V1 datasets are 96.5%， 73.1% and 46.6% respectively. Compared with Temporal Segment Network （TSN） method using two-stream structure， the proposed method has the recognition accuracy on UCF101 improved by 2.5 percentage points. It can be seen that the proposed method is able to extract spatio-temporal features in videos efficiently.

Key words: Convolutional Neural Network (CNN), action recognition, spatio-temporal information, temporal reasoning, motion information

摘要：

针对双流网络提取运动信息需要预先计算光流图，从而无法实现端到端的识别以及三维卷积网络参数量巨大的问题，提出了一种基于视频时空特征的行为识别方法。该方法能够高效提取视频中的时空信息，且无需添加任何光流计算和三维卷积操作。首先，利用基于注意力机制的运动信息提取模块捕获相邻两帧之间的运动位移信息，从而模拟双流网络中光流图的作用；其次，提出了一种解耦的时空信息提取模块代替三维卷积，从而实现时空信息的编码；最后，在将两个模块嵌入二维的残差网络中后，完成端到端的行为识别。将所提方法在几个主流的行为识别数据集上进行实验，结果表明在仅使用RGB视频帧作为输入的情况下，在UCF101、HMDB51、Something-Something-V1数据集上的识别准确率分别为96.5%、73.1%和46.6%，与使用双流结构的时间分段网络（TSN）方法相比，在UCF101数据集上的识别准确率提高了2.5个百分点。可见，所提方法能够高效提取视频中的时空特征。

关键词: 卷积神经网络, 行为识别, 时空信息, 时序推理, 运动信息

CLC Number:

TP391.4

Ranyan NI, Yi ZHANG. Action recognition method based on video spatio-temporal features[J]. Journal of Computer Applications, 2023, 43(2): 521-528.

倪苒岩, 张轶. 基于视频时空特征的行为识别方法[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 521-528.

Figures/Tables 10

References 35

1	WANG H， KLÄSER A， SCHMID C， et al. Dense trajectories and motion boundary descriptors for action recognition［J］. International Journal of Computer Vision， 2013， 103（1）： 60-79. 10.1007/s11263-012-0594-8
2	王萍，庞文浩. 基于视频分段的空时双通道卷积神经网络的行为识别［J］. 计算机应用， 2019， 39（7）：2081-2086. 10.11772/j.issn.1001-9081.2019010156
	WANG P， PANG W H. Two-stream CNN for action recognition based on video segmentation［J］. Journal of Computer Applications， 2019， 39（7）： 2081-2086. 10.11772/j.issn.1001-9081.2019010156
3	KLÄSER A， MARSZAŁEK M， SCHMID C. A spatio-temporal descriptor based on 3D-gradients［C］// Proceedings of the 2008 British Machine Vision Conference. Durham： BMVA Press， 2008： No.99. 10.5244/c.22.99
4	郭明祥，宋全军，徐湛楠，等. 基于三维残差稠密网络的人体行为识别算法［J］. 计算机应用， 2019， 39（12）：3482-3489. 10.11772/j.issn.1001-9081.2019061056
	GUO M X， SONG Q J， XU Z N， et al. Human behavior recognition algorithm based on three-dimensional residual dense network［J］. Journal of Computer Applications， 2019， 39（12）：3482-3489. 10.11772/j.issn.1001-9081.2019061056
5	SIMONYAN K， ZISSERMAN A. Two-stream convolutional networks for action recognition in videos［C］// Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1. Cambridge： MIT Press， 2014：568-576.
6	TRAN D， BOURDEV L， FERGUS R， et al. Learning spatiotemporal features with 3D convolutional networks［C］// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2015：4489-4497. 10.1109/iccv.2015.510
7	LIN J， GAN C， HAN S. TSM： temporal shift module for efficient video understanding［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019：7082-7092. 10.1109/iccv.2019.00718
8	WANG L M， XIONG Y J， WANG Z， et al. Temporal segment networks： towards good practices for deep action recognition［C］// Proceedings of the 2016 European Conference on Computer Vision， LNCS 9912. Cham： Springer， 2016： 20-36.
9	LAN Z Z， ZHU Y， HAUPTMANN A G， et al. Deep local video feature for action recognition［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Piscataway： IEEE， 2017： 1219-1225. 10.1109/cvprw.2017.161
10	LIN W Y， MI Y， WU J X， et al. Action recognition with coarse-to-fine deep feature integration and asynchronous fusion［C］// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2018：7130-7137. 10.1609/aaai.v32i1.12232
11	JI S W， XU W， YANG M， et al. 3D convolutional neural networks for human action recognition［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2013， 35（1）：221-231. 10.1109/tpami.2012.59
12	TRAN D， RAY J， SHOU Z， et al. ConvNet architecture search for spatio-temporal feature learning［EB/OL］. （2017-08-16）［2021-12-26］..
13	CAI J H， HU J G. 3D RANs： 3D residual attention networks for action recognition［J］. The Visual Computer， 2020， 36（6）： 1261-1270. 10.1007/s00371-019-01733-3
14	ZOLFAGHARI M， SINGH K， BROX T. ECO： efficient convolutional network for online video understanding［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11206. Cham： Springer， 2018： 713-730.
15	LEE M， LEE S， SON S， et al. Motion feature network： fixed motion filter for action recognition［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11214. Cham： Springer， 2018： 392-408.
16	HE K M， ZHANG X Y， REN S Q， et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778. 10.1109/cvpr.2016.90
17	DENG J， DONG W， SOCHER R， et al. ImageNet： a large-scale hierarchical image database［C］// Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2009： 248-255. 10.1109/cvpr.2009.5206848
18	HU J， SHEN L， SUN G. Squeeze-and-excitation networks［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 7132-7141. 10.1109/cvpr.2018.00745
19	SOOMRO K， ZAMIR A R， SHAH M. UCF101： a dataset of 101 human actions classes from videos in the wild［EB/OL］. ［2021-12-26］..
20	KUEHNE H， JHUANG H， GARROTE E， et al. HMDB： a large video database for human motion recognition［C］// Proceedings of the 2011 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2011：2556-2563. 10.1109/iccv.2011.6126543
21	GOYAL R， KAHOU S E， MICHALSKI V， et al. The "something something" video database for learning and evaluating visual common sense［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 5843-5851. 10.1109/iccv.2017.622
22	TRAN A， CHEONG L F. Two-stream flow-guided convolutional attention networks for action recognition［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops. Piscataway： IEEE， 2017：3110-3119. 10.1109/iccvw.2017.368
23	QIU Z F， YAO T， MEI T. Learning spatio-temporal representation with pseudo-3D residual networks［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 5534-5542. 10.1109/iccv.2017.590
24	DIBA， FAYYAZ M， SHARMA V， et al. Temporal 3D ConvNets： new architecture and transfer learning for video classification［EB/OL］. （2017-11-22）［2021-12-26］..
25	KAZAKOS E， NAGRANI A， ZISSERMAN A， et al. EPIC-fusion： audio-visual temporal binding for egocentric action recognition［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 5491-5500. 10.1109/iccv.2019.00559
26	WANG L M， LI W， LI W， et al. Appearance-and-relation networks for video classification［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 1430-1439. 10.1109/cvpr.2018.00155
27	LI X Y， SHUAI B， TIGHE J. Directional temporal modeling for action recognition［C］// Proceedings of the 2020 European Conference on Computer Vision， LNCS 12351. Cham： Springer， 2020：275-291.
28	KUMAWAT S， VERMA M， NAKASHIMA Y， et al. Depthwise spatio-temporal STFT convolutional neural networks for human action recognition［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2022， 44（9）：4839-4851.
29	SAHOO S P， ARI S， MAHAPATRA K， et al. HAR-Depth： a novel framework for human action recognition using sequential learning and depth estimated history images［J］. IEEE Transactions on Emerging Topics in Computational Intelligence， 2021， 5（5）： 813-825. 10.1109/tetci.2020.3014367
30	ZHANG J X， HU H F， LIU Z. Appearance-and-dynamic learning with bifurcated convolution neural network for action recognition［J］. IEEE Transactions on Circuits and Systems for Video Technology， 2021， 31（4）： 1593-1606. 10.1109/tcsvt.2020.3006223
31	BAI S K， WANG Q， LI X L. MFI： multi-range feature interchange for video action recognition［C］// Proceedings of the 25th International Conference on Pattern Recognition. Piscataway： IEEE， 2021： 6664-6671. 10.1109/icpr48806.2021.9412124
32	CARREIRA J， ZISSERMAN A. Quo vadis， action recognition？ a new model and the kinetics dataset［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 4724-4733. 10.1109/cvpr.2017.502
33	ZHOU B L， ANDONIAN A， OLIVA A， et al. Temporal relational reasoning in videos［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11205. Cham： Springer， 2018： 831-846.
34	LIU Z Y， WANG L M， WU W， et al. TAM： temporal adaptive module for video recognition［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021：13688-13698. 10.1109/iccv48922.2021.01345
35	SELVARAJU R R， COGSWELL M， DAS A， et al. Grad-CAM： visual explanations from deep networks via gradient-based localization［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 618-626. 10.1109/iccv.2017.74

数据集	方法	年份	准确率/%
UCF101	文献［5］方法	2014	88.8
	文献［6］方法	2015	82.3
	文献［7］方法	2019	95.9
	文献［8］方法	2016	94.0
	文献［22］方法	2017	92.0
	文献［23］方法	2017	88.6
	文献［24］方法	2017	93.2
	文献［25］方法	2019	93.6
	文献［26］方法	2018	94.3
	文献［27］方法	2020	95.6
	文献［28］方法	2020	94.7
	文献［29］方法	2020	93.0
	文献［30］方法	2020	94.9
	文献［31］方法	2021	95.6
	本文方法	2022	96.5
HMDB51	文献［5］方法	2014	59.4
	文献［6］方法	2015	56.8
	文献［7］方法	2019	70.7
	文献［8］方法	2016	68.5
	文献［14］方法	2018	68.3
	文献［24］方法	2017	59.2
	文献［25］方法	2019	69.4
	文献［28］方法	2020	71.5
	文献［29］方法	2020	69.7
	文献［30］方法	2020	72.1
	本文方法	2022	73.1
Something-Something-V1	文献［7］方法	2019	45.6
	文献［8］方法	2016	19.5
	文献［14］方法	2018	39.6
	文献［31］方法	2021	43.9
	文献［32］方法	2017	41.6
	文献［33］方法	2018	34.4
	文献［34］方法	2020	46.5
	本文方法	2022	46.6

数据集	方法	年份	准确率/%
UCF101	文献［5］方法	2014	88.8
	文献［6］方法	2015	82.3
	文献［7］方法	2019	95.9
	文献［8］方法	2016	94.0
	文献［22］方法	2017	92.0
	文献［23］方法	2017	88.6
	文献［24］方法	2017	93.2
	文献［25］方法	2019	93.6
	文献［26］方法	2018	94.3
	文献［27］方法	2020	95.6
	文献［28］方法	2020	94.7
	文献［29］方法	2020	93.0
	文献［30］方法	2020	94.9
	文献［31］方法	2021	95.6
	本文方法	2022	96.5
HMDB51	文献［5］方法	2014	59.4
	文献［6］方法	2015	56.8
	文献［7］方法	2019	70.7
	文献［8］方法	2016	68.5
	文献［14］方法	2018	68.3
	文献［24］方法	2017	59.2
	文献［25］方法	2019	69.4
	文献［28］方法	2020	71.5
	文献［29］方法	2020	69.7
	文献［30］方法	2020	72.1
	本文方法	2022	73.1
Something-Something-V1	文献［7］方法	2019	45.6
	文献［8］方法	2016	19.5
	文献［14］方法	2018	39.6
	文献［31］方法	2021	43.9
	文献［32］方法	2017	41.6
	文献［33］方法	2018	34.4
	文献［34］方法	2020	46.5
	本文方法	2022	46.6

方法	采样帧数	参数量/10⁶	浮点运算量/GFLOPs	准确率/%
文献［7］方法	8	24.3	33	45.6
文献［8］方法	8	10.7	16	19.5
文献［14］方法	8	47.5	32	39.6
文献［31］方法	8	24.6	34	43.9
文献［32］方法	32	28.0	153	41.6
文献［33］方法	8	18.3	33	34.4
文献［34］方法	8	25.6	33	46.5
本文方法	8	25.7	34	46.6

方法	采样帧数	参数量/10⁶	浮点运算量/GFLOPs	准确率/%
文献［7］方法	8	24.3	33	45.6
文献［8］方法	8	10.7	16	19.5
文献［14］方法	8	47.5	32	39.6
文献［31］方法	8	24.6	34	43.9
文献［32］方法	32	28.0	153	41.6
文献［33］方法	8	18.3	33	34.4
文献［34］方法	8	25.6	33	46.5
本文方法	8	25.7	34	46.6

方法	准确率/%
Baseline	45.6
Baseline+运动信息提取模块	46.0
Baseline+时空信息提取模块	45.9
Baseline+运动信息提取模块+时空信息提取模块	46.6