基于视频时空特征的行为识别方法

doi:10.11772/j.issn.1001-9081.2022010017

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (2): 521-528.DOI: 10.11772/j.issn.1001-9081.2022010017

所属专题：多媒体计算与计算机仿真

• 多媒体计算与计算机仿真 • 上一篇下一篇

基于视频时空特征的行为识别方法

倪苒岩, 张轶()

四川大学计算机学院，成都 610065

收稿日期:2022-01-07 修回日期:2022-03-18 接受日期:2022-04-06 发布日期:2022-04-21 出版日期:2023-02-10
通讯作者: 张轶
作者简介:倪苒岩（1998—），女，安徽黄山人，硕士研究生，主要研究方向：计算机视觉、行为识别；
基金资助:
国家自然科学基金资助项目(U20A20161)

Action recognition method based on video spatio-temporal features

Ranyan NI, Yi ZHANG()

College of Computer Science，Sichuan University，Chengdu Sichuan 610065，China

Received:2022-01-07 Revised:2022-03-18 Accepted:2022-04-06 Online:2022-04-21 Published:2023-02-10
Contact: Yi ZHANG
About author:NI Ranyan， born in 1998， M. S. candidate. Her research interests include computer vision， action recognition.
Supported by:
National Natural Science Foundation of China(U20A20161)

摘要/Abstract

摘要：

针对双流网络提取运动信息需要预先计算光流图，从而无法实现端到端的识别以及三维卷积网络参数量巨大的问题，提出了一种基于视频时空特征的行为识别方法。该方法能够高效提取视频中的时空信息，且无需添加任何光流计算和三维卷积操作。首先，利用基于注意力机制的运动信息提取模块捕获相邻两帧之间的运动位移信息，从而模拟双流网络中光流图的作用；其次，提出了一种解耦的时空信息提取模块代替三维卷积，从而实现时空信息的编码；最后，在将两个模块嵌入二维的残差网络中后，完成端到端的行为识别。将所提方法在几个主流的行为识别数据集上进行实验，结果表明在仅使用RGB视频帧作为输入的情况下，在UCF101、HMDB51、Something-Something-V1数据集上的识别准确率分别为96.5%、73.1%和46.6%，与使用双流结构的时间分段网络（TSN）方法相比，在UCF101数据集上的识别准确率提高了2.5个百分点。可见，所提方法能够高效提取视频中的时空特征。

关键词: 卷积神经网络, 行为识别, 时空信息, 时序推理, 运动信息

Abstract:

Aiming at the problems that the end-to-end recognition of two-stream networks cannot be realized due to the need of calculating optical flow maps in advance to extract motion information and the three-dimensional convolutional networks have a lot of parameters， an action recognition method based on video spatio-temporal features was proposed. In this method， the spatio-temporal information in videos were able to be extracted efficiently without adding any calculation of optical flows or any three-dimensional convolution operation. Firstly， the motion information extraction module based on attention mechanism was used to capture the motion shift information between two adjacent frames， thereby simulating the function of optical flows in two-stream network. Secondly， a decoupled spatio-temporal information extraction module was proposed to replace the three-dimensional convolution in order to encode the spatio-temporal information. Finally， the two modules were embedded into the two-dimensional residual network to complete the end-to-end action recognition. Experiments were carried out on several mainstream action recognition datasets. The results show that when only using RGB （Red-Green-Blue） video frames as input， the recognition accuracies of the proposed method on UCF101， HMDB51 and Something-Something-V1 datasets are 96.5%， 73.1% and 46.6% respectively. Compared with Temporal Segment Network （TSN） method using two-stream structure， the proposed method has the recognition accuracy on UCF101 improved by 2.5 percentage points. It can be seen that the proposed method is able to extract spatio-temporal features in videos efficiently.

Key words: Convolutional Neural Network (CNN), action recognition, spatio-temporal information, temporal reasoning, motion information

中图分类号:

TP391.4

倪苒岩, 张轶. 基于视频时空特征的行为识别方法[J]. 计算机应用, 2023, 43(2): 521-528.

Ranyan NI, Yi ZHANG. Action recognition method based on video spatio-temporal features[J]. Journal of Computer Applications, 2023, 43(2): 521-528.

图/表 10

参考文献 35

1	WANG H， KLÄSER A， SCHMID C， et al. Dense trajectories and motion boundary descriptors for action recognition［J］. International Journal of Computer Vision， 2013， 103（1）： 60-79. 10.1007/s11263-012-0594-8
2	王萍，庞文浩. 基于视频分段的空时双通道卷积神经网络的行为识别［J］. 计算机应用， 2019， 39（7）：2081-2086. 10.11772/j.issn.1001-9081.2019010156
	WANG P， PANG W H. Two-stream CNN for action recognition based on video segmentation［J］. Journal of Computer Applications， 2019， 39（7）： 2081-2086. 10.11772/j.issn.1001-9081.2019010156
3	KLÄSER A， MARSZAŁEK M， SCHMID C. A spatio-temporal descriptor based on 3D-gradients［C］// Proceedings of the 2008 British Machine Vision Conference. Durham： BMVA Press， 2008： No.99. 10.5244/c.22.99
4	郭明祥，宋全军，徐湛楠，等. 基于三维残差稠密网络的人体行为识别算法［J］. 计算机应用， 2019， 39（12）：3482-3489. 10.11772/j.issn.1001-9081.2019061056
	GUO M X， SONG Q J， XU Z N， et al. Human behavior recognition algorithm based on three-dimensional residual dense network［J］. Journal of Computer Applications， 2019， 39（12）：3482-3489. 10.11772/j.issn.1001-9081.2019061056
5	SIMONYAN K， ZISSERMAN A. Two-stream convolutional networks for action recognition in videos［C］// Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1. Cambridge： MIT Press， 2014：568-576.
6	TRAN D， BOURDEV L， FERGUS R， et al. Learning spatiotemporal features with 3D convolutional networks［C］// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2015：4489-4497. 10.1109/iccv.2015.510
7	LIN J， GAN C， HAN S. TSM： temporal shift module for efficient video understanding［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019：7082-7092. 10.1109/iccv.2019.00718
8	WANG L M， XIONG Y J， WANG Z， et al. Temporal segment networks： towards good practices for deep action recognition［C］// Proceedings of the 2016 European Conference on Computer Vision， LNCS 9912. Cham： Springer， 2016： 20-36.
9	LAN Z Z， ZHU Y， HAUPTMANN A G， et al. Deep local video feature for action recognition［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Piscataway： IEEE， 2017： 1219-1225. 10.1109/cvprw.2017.161
10	LIN W Y， MI Y， WU J X， et al. Action recognition with coarse-to-fine deep feature integration and asynchronous fusion［C］// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2018：7130-7137. 10.1609/aaai.v32i1.12232
11	JI S W， XU W， YANG M， et al. 3D convolutional neural networks for human action recognition［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2013， 35（1）：221-231. 10.1109/tpami.2012.59
12	TRAN D， RAY J， SHOU Z， et al. ConvNet architecture search for spatio-temporal feature learning［EB/OL］. （2017-08-16）［2021-12-26］..
13	CAI J H， HU J G. 3D RANs： 3D residual attention networks for action recognition［J］. The Visual Computer， 2020， 36（6）： 1261-1270. 10.1007/s00371-019-01733-3
14	ZOLFAGHARI M， SINGH K， BROX T. ECO： efficient convolutional network for online video understanding［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11206. Cham： Springer， 2018： 713-730.
15	LEE M， LEE S， SON S， et al. Motion feature network： fixed motion filter for action recognition［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11214. Cham： Springer， 2018： 392-408.
16	HE K M， ZHANG X Y， REN S Q， et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778. 10.1109/cvpr.2016.90
17	DENG J， DONG W， SOCHER R， et al. ImageNet： a large-scale hierarchical image database［C］// Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2009： 248-255. 10.1109/cvpr.2009.5206848
18	HU J， SHEN L， SUN G. Squeeze-and-excitation networks［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 7132-7141. 10.1109/cvpr.2018.00745
19	SOOMRO K， ZAMIR A R， SHAH M. UCF101： a dataset of 101 human actions classes from videos in the wild［EB/OL］. ［2021-12-26］..
20	KUEHNE H， JHUANG H， GARROTE E， et al. HMDB： a large video database for human motion recognition［C］// Proceedings of the 2011 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2011：2556-2563. 10.1109/iccv.2011.6126543
21	GOYAL R， KAHOU S E， MICHALSKI V， et al. The "something something" video database for learning and evaluating visual common sense［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 5843-5851. 10.1109/iccv.2017.622
22	TRAN A， CHEONG L F. Two-stream flow-guided convolutional attention networks for action recognition［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops. Piscataway： IEEE， 2017：3110-3119. 10.1109/iccvw.2017.368
23	QIU Z F， YAO T， MEI T. Learning spatio-temporal representation with pseudo-3D residual networks［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 5534-5542. 10.1109/iccv.2017.590
24	DIBA， FAYYAZ M， SHARMA V， et al. Temporal 3D ConvNets： new architecture and transfer learning for video classification［EB/OL］. （2017-11-22）［2021-12-26］..
25	KAZAKOS E， NAGRANI A， ZISSERMAN A， et al. EPIC-fusion： audio-visual temporal binding for egocentric action recognition［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 5491-5500. 10.1109/iccv.2019.00559
26	WANG L M， LI W， LI W， et al. Appearance-and-relation networks for video classification［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 1430-1439. 10.1109/cvpr.2018.00155
27	LI X Y， SHUAI B， TIGHE J. Directional temporal modeling for action recognition［C］// Proceedings of the 2020 European Conference on Computer Vision， LNCS 12351. Cham： Springer， 2020：275-291.
28	KUMAWAT S， VERMA M， NAKASHIMA Y， et al. Depthwise spatio-temporal STFT convolutional neural networks for human action recognition［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2022， 44（9）：4839-4851.
29	SAHOO S P， ARI S， MAHAPATRA K， et al. HAR-Depth： a novel framework for human action recognition using sequential learning and depth estimated history images［J］. IEEE Transactions on Emerging Topics in Computational Intelligence， 2021， 5（5）： 813-825. 10.1109/tetci.2020.3014367
30	ZHANG J X， HU H F， LIU Z. Appearance-and-dynamic learning with bifurcated convolution neural network for action recognition［J］. IEEE Transactions on Circuits and Systems for Video Technology， 2021， 31（4）： 1593-1606. 10.1109/tcsvt.2020.3006223
31	BAI S K， WANG Q， LI X L. MFI： multi-range feature interchange for video action recognition［C］// Proceedings of the 25th International Conference on Pattern Recognition. Piscataway： IEEE， 2021： 6664-6671. 10.1109/icpr48806.2021.9412124
32	CARREIRA J， ZISSERMAN A. Quo vadis， action recognition？ a new model and the kinetics dataset［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 4724-4733. 10.1109/cvpr.2017.502
33	ZHOU B L， ANDONIAN A， OLIVA A， et al. Temporal relational reasoning in videos［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11205. Cham： Springer， 2018： 831-846.
34	LIU Z Y， WANG L M， WU W， et al. TAM： temporal adaptive module for video recognition［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021：13688-13698. 10.1109/iccv48922.2021.01345
35	SELVARAJU R R， COGSWELL M， DAS A， et al. Grad-CAM： visual explanations from deep networks via gradient-based localization［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2017： 618-626. 10.1109/iccv.2017.74

数据集	方法	年份	准确率/%
UCF101	文献［5］方法	2014	88.8
	文献［6］方法	2015	82.3
	文献［7］方法	2019	95.9
	文献［8］方法	2016	94.0
	文献［22］方法	2017	92.0
	文献［23］方法	2017	88.6
	文献［24］方法	2017	93.2
	文献［25］方法	2019	93.6
	文献［26］方法	2018	94.3
	文献［27］方法	2020	95.6
	文献［28］方法	2020	94.7
	文献［29］方法	2020	93.0
	文献［30］方法	2020	94.9
	文献［31］方法	2021	95.6
	本文方法	2022	96.5
HMDB51	文献［5］方法	2014	59.4
	文献［6］方法	2015	56.8
	文献［7］方法	2019	70.7
	文献［8］方法	2016	68.5
	文献［14］方法	2018	68.3
	文献［24］方法	2017	59.2
	文献［25］方法	2019	69.4
	文献［28］方法	2020	71.5
	文献［29］方法	2020	69.7
	文献［30］方法	2020	72.1
	本文方法	2022	73.1
Something-Something-V1	文献［7］方法	2019	45.6
	文献［8］方法	2016	19.5
	文献［14］方法	2018	39.6
	文献［31］方法	2021	43.9
	文献［32］方法	2017	41.6
	文献［33］方法	2018	34.4
	文献［34］方法	2020	46.5
	本文方法	2022	46.6

数据集	方法	年份	准确率/%
UCF101	文献［5］方法	2014	88.8
	文献［6］方法	2015	82.3
	文献［7］方法	2019	95.9
	文献［8］方法	2016	94.0
	文献［22］方法	2017	92.0
	文献［23］方法	2017	88.6
	文献［24］方法	2017	93.2
	文献［25］方法	2019	93.6
	文献［26］方法	2018	94.3
	文献［27］方法	2020	95.6
	文献［28］方法	2020	94.7
	文献［29］方法	2020	93.0
	文献［30］方法	2020	94.9
	文献［31］方法	2021	95.6
	本文方法	2022	96.5
HMDB51	文献［5］方法	2014	59.4
	文献［6］方法	2015	56.8
	文献［7］方法	2019	70.7
	文献［8］方法	2016	68.5
	文献［14］方法	2018	68.3
	文献［24］方法	2017	59.2
	文献［25］方法	2019	69.4
	文献［28］方法	2020	71.5
	文献［29］方法	2020	69.7
	文献［30］方法	2020	72.1
	本文方法	2022	73.1
Something-Something-V1	文献［7］方法	2019	45.6
	文献［8］方法	2016	19.5
	文献［14］方法	2018	39.6
	文献［31］方法	2021	43.9
	文献［32］方法	2017	41.6
	文献［33］方法	2018	34.4
	文献［34］方法	2020	46.5
	本文方法	2022	46.6

方法	采样帧数	参数量/10⁶	浮点运算量/GFLOPs	准确率/%
文献［7］方法	8	24.3	33	45.6
文献［8］方法	8	10.7	16	19.5
文献［14］方法	8	47.5	32	39.6
文献［31］方法	8	24.6	34	43.9
文献［32］方法	32	28.0	153	41.6
文献［33］方法	8	18.3	33	34.4
文献［34］方法	8	25.6	33	46.5
本文方法	8	25.7	34	46.6

方法	采样帧数	参数量/10⁶	浮点运算量/GFLOPs	准确率/%
文献［7］方法	8	24.3	33	45.6
文献［8］方法	8	10.7	16	19.5
文献［14］方法	8	47.5	32	39.6
文献［31］方法	8	24.6	34	43.9
文献［32］方法	32	28.0	153	41.6
文献［33］方法	8	18.3	33	34.4
文献［34］方法	8	25.6	33	46.5
本文方法	8	25.7	34	46.6

方法	准确率/%
Baseline	45.6
Baseline+运动信息提取模块	46.0
Baseline+时空信息提取模块	45.9
Baseline+运动信息提取模块+时空信息提取模块	46.6

基于视频时空特征的行为识别方法

Action recognition method based on video spatio-temporal features

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 35

相关文章 15

编辑推荐

Metrics

[1]	李云, 王富铕, 井佩光, 王粟, 肖澳. 基于不确定度感知的帧关联短视频事件检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2903-2910.
[2]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[3]	赵宇博, 张丽萍, 闫盛, 侯敏, 高茂. 基于改进分段卷积神经网络和知识蒸馏的学科知识实体间关系抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2421-2429.
[4]	张春雪, 仇丽青, 孙承爱, 荆彩霞. 基于两阶段动态兴趣识别的购买行为预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2365-2371.
[5]	陈虹, 齐兵, 金海波, 武聪, 张立昂. 融合1D-CNN与BiGRU的类不平衡流量异常检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2493-2499.
[6]	王东炜, 刘柏辰, 韩志, 王艳美, 唐延东. 基于低秩分解和向量量化的深度网络压缩方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 1987-1994.
[7]	高阳峄, 雷涛, 杜晓刚, 李岁永, 王营博, 闵重丹. 基于像素距离图和四维动态卷积网络的密集人群计数与定位方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2233-2242.
[8]	黄梦源, 常侃, 凌铭阳, 韦新杰, 覃团发. 基于层间引导的低光照图像渐进增强算法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1911-1919.
[9]	李健京, 李贯峰, 秦飞舟, 李卫军. 基于不确定知识图谱嵌入的多关系近似推理模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1751-1759.
[10]	姚迅, 秦忠正, 杨捷. 生成式标签对抗的文本分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1781-1785.
[11]	沈君凤, 周星辰, 汤灿. 基于改进的提示学习方法的双通道情感分析模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1796-1806.
[12]	高文烁, 陈晓云. 基于节点结构的点云分类网络[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1471-1478.
[13]	席治远, 唐超, 童安炀, 王文剑. 基于双路时空网络的驾驶员行为识别[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1511-1519.
[14]	孙敏, 成倩, 丁希宁. 基于CBAM-CGRU-SVM的Android恶意软件检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1539-1545.
[15]	王杰, 孟华. 基于点云整体拓扑结构的图像分类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1107-1113.