《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (2): 521-528.DOI: 10.11772/j.issn.1001-9081.2022010017
所属专题: 多媒体计算与计算机仿真
收稿日期:
2022-01-07
修回日期:
2022-03-18
接受日期:
2022-04-06
发布日期:
2022-04-21
出版日期:
2023-02-10
通讯作者:
张轶
作者简介:
倪苒岩(1998—),女,安徽黄山人,硕士研究生,主要研究方向:计算机视觉、行为识别;
基金资助:
Received:
2022-01-07
Revised:
2022-03-18
Accepted:
2022-04-06
Online:
2022-04-21
Published:
2023-02-10
Contact:
Yi ZHANG
About author:
NI Ranyan, born in 1998, M. S. candidate. Her research interests include computer vision, action recognition.
Supported by:
摘要:
针对双流网络提取运动信息需要预先计算光流图,从而无法实现端到端的识别以及三维卷积网络参数量巨大的问题,提出了一种基于视频时空特征的行为识别方法。该方法能够高效提取视频中的时空信息,且无需添加任何光流计算和三维卷积操作。首先,利用基于注意力机制的运动信息提取模块捕获相邻两帧之间的运动位移信息,从而模拟双流网络中光流图的作用;其次,提出了一种解耦的时空信息提取模块代替三维卷积,从而实现时空信息的编码;最后,在将两个模块嵌入二维的残差网络中后,完成端到端的行为识别。将所提方法在几个主流的行为识别数据集上进行实验,结果表明在仅使用RGB视频帧作为输入的情况下,在UCF101、HMDB51、Something-Something-V1数据集上的识别准确率分别为96.5%、73.1%和46.6%,与使用双流结构的时间分段网络(TSN)方法相比,在UCF101数据集上的识别准确率提高了2.5个百分点。可见,所提方法能够高效提取视频中的时空特征。
中图分类号:
倪苒岩, 张轶. 基于视频时空特征的行为识别方法[J]. 计算机应用, 2023, 43(2): 521-528.
Ranyan NI, Yi ZHANG. Action recognition method based on video spatio-temporal features[J]. Journal of Computer Applications, 2023, 43(2): 521-528.
数据集 | 方法 | 年份 | 准确率/% |
---|---|---|---|
UCF101 | 文献[ | 2014 | 88.8 |
文献[ | 2015 | 82.3 | |
文献[ | 2019 | 95.9 | |
文献[ | 2016 | 94.0 | |
文献[ | 2017 | 92.0 | |
文献[ | 2017 | 88.6 | |
文献[ | 2017 | 93.2 | |
文献[ | 2019 | 93.6 | |
文献[ | 2018 | 94.3 | |
文献[ | 2020 | 95.6 | |
文献[ | 2020 | 94.7 | |
文献[ | 2020 | 93.0 | |
文献[ | 2020 | 94.9 | |
文献[ | 2021 | 95.6 | |
本文方法 | 2022 | 96.5 | |
HMDB51 | 文献[ | 2014 | 59.4 |
文献[ | 2015 | 56.8 | |
文献[ | 2019 | 70.7 | |
文献[ | 2016 | 68.5 | |
文献[ | 2018 | 68.3 | |
文献[ | 2017 | 59.2 | |
文献[ | 2019 | 69.4 | |
文献[ | 2020 | 71.5 | |
文献[ | 2020 | 69.7 | |
文献[ | 2020 | 72.1 | |
本文方法 | 2022 | 73.1 | |
Something-Something-V1 | 文献[ | 2019 | 45.6 |
文献[ | 2016 | 19.5 | |
文献[ | 2018 | 39.6 | |
文献[ | 2021 | 43.9 | |
文献[ | 2017 | 41.6 | |
文献[ | 2018 | 34.4 | |
文献[ | 2020 | 46.5 | |
本文方法 | 2022 | 46.6 |
表1 不同方法在三个数据集上的比较
Tab. 1 Comparison of different methods on three datasets
数据集 | 方法 | 年份 | 准确率/% |
---|---|---|---|
UCF101 | 文献[ | 2014 | 88.8 |
文献[ | 2015 | 82.3 | |
文献[ | 2019 | 95.9 | |
文献[ | 2016 | 94.0 | |
文献[ | 2017 | 92.0 | |
文献[ | 2017 | 88.6 | |
文献[ | 2017 | 93.2 | |
文献[ | 2019 | 93.6 | |
文献[ | 2018 | 94.3 | |
文献[ | 2020 | 95.6 | |
文献[ | 2020 | 94.7 | |
文献[ | 2020 | 93.0 | |
文献[ | 2020 | 94.9 | |
文献[ | 2021 | 95.6 | |
本文方法 | 2022 | 96.5 | |
HMDB51 | 文献[ | 2014 | 59.4 |
文献[ | 2015 | 56.8 | |
文献[ | 2019 | 70.7 | |
文献[ | 2016 | 68.5 | |
文献[ | 2018 | 68.3 | |
文献[ | 2017 | 59.2 | |
文献[ | 2019 | 69.4 | |
文献[ | 2020 | 71.5 | |
文献[ | 2020 | 69.7 | |
文献[ | 2020 | 72.1 | |
本文方法 | 2022 | 73.1 | |
Something-Something-V1 | 文献[ | 2019 | 45.6 |
文献[ | 2016 | 19.5 | |
文献[ | 2018 | 39.6 | |
文献[ | 2021 | 43.9 | |
文献[ | 2017 | 41.6 | |
文献[ | 2018 | 34.4 | |
文献[ | 2020 | 46.5 | |
本文方法 | 2022 | 46.6 |
方法 | 采样 帧数 | 参数量/106 | 浮点运算量/GFLOPs | 准确率/% |
---|---|---|---|---|
文献[ | 8 | 24.3 | 33 | 45.6 |
文献[ | 8 | 10.7 | 16 | 19.5 |
文献[ | 8 | 47.5 | 32 | 39.6 |
文献[ | 8 | 24.6 | 34 | 43.9 |
文献[ | 32 | 28.0 | 153 | 41.6 |
文献[ | 8 | 18.3 | 33 | 34.4 |
文献[ | 8 | 25.6 | 33 | 46.5 |
本文方法 | 8 | 25.7 | 34 | 46.6 |
表2 不同方法在Something-Something-V1数据集的采样帧数、参数量、浮点运算量及准确率对比
Tab. 2 Comparison of sampling frames, parameters, FLOPs and accuracy among different methods on Something-Something-V1 dataset
方法 | 采样 帧数 | 参数量/106 | 浮点运算量/GFLOPs | 准确率/% |
---|---|---|---|---|
文献[ | 8 | 24.3 | 33 | 45.6 |
文献[ | 8 | 10.7 | 16 | 19.5 |
文献[ | 8 | 47.5 | 32 | 39.6 |
文献[ | 8 | 24.6 | 34 | 43.9 |
文献[ | 32 | 28.0 | 153 | 41.6 |
文献[ | 8 | 18.3 | 33 | 34.4 |
文献[ | 8 | 25.6 | 33 | 46.5 |
本文方法 | 8 | 25.7 | 34 | 46.6 |
方法 | 准确率/% |
---|---|
Baseline | 45.6 |
Baseline+运动信息提取模块 | 46.0 |
Baseline+时空信息提取模块 | 45.9 |
Baseline+运动信息提取模块+时空信息提取模块 | 46.6 |
表3 不同模块对网络的影响
Tab. 3 Influence of different modules on network
方法 | 准确率/% |
---|---|
Baseline | 45.6 |
Baseline+运动信息提取模块 | 46.0 |
Baseline+时空信息提取模块 | 45.9 |
Baseline+运动信息提取模块+时空信息提取模块 | 46.6 |
图6 本文方法基于UCF101、HMDB51和Something-Something-V1的混淆矩阵
Fig. 6 Confusion matrices of the proposed method based on UCF101, HMDB51,and Something-Something-V1 dataset
1 | WANG H, KLÄSER A, SCHMID C, et al. Dense trajectories and motion boundary descriptors for action recognition[J]. International Journal of Computer Vision, 2013, 103(1): 60-79. 10.1007/s11263-012-0594-8 |
2 | 王萍,庞文浩. 基于视频分段的空时双通道卷积神经网络的行为识别[J]. 计算机应用, 2019, 39(7):2081-2086. 10.11772/j.issn.1001-9081.2019010156 |
WANG P, PANG W H. Two-stream CNN for action recognition based on video segmentation[J]. Journal of Computer Applications, 2019, 39(7): 2081-2086. 10.11772/j.issn.1001-9081.2019010156 | |
3 | KLÄSER A, MARSZAŁEK M, SCHMID C. A spatio-temporal descriptor based on 3D-gradients[C]// Proceedings of the 2008 British Machine Vision Conference. Durham: BMVA Press, 2008: No.99. 10.5244/c.22.99 |
4 | 郭明祥,宋全军,徐湛楠,等. 基于三维残差稠密网络的人体行为识别算法[J]. 计算机应用, 2019, 39(12):3482-3489. 10.11772/j.issn.1001-9081.2019061056 |
GUO M X, SONG Q J, XU Z N, et al. Human behavior recognition algorithm based on three-dimensional residual dense network[J]. Journal of Computer Applications, 2019, 39(12):3482-3489. 10.11772/j.issn.1001-9081.2019061056 | |
5 | SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[C]// Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1. Cambridge: MIT Press, 2014:568-576. |
6 | TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015:4489-4497. 10.1109/iccv.2015.510 |
7 | LIN J, GAN C, HAN S. TSM: temporal shift module for efficient video understanding[C]// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019:7082-7092. 10.1109/iccv.2019.00718 |
8 | WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]// Proceedings of the 2016 European Conference on Computer Vision, LNCS 9912. Cham: Springer, 2016: 20-36. |
9 | LAN Z Z, ZHU Y, HAUPTMANN A G, et al. Deep local video feature for action recognition[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Piscataway: IEEE, 2017: 1219-1225. 10.1109/cvprw.2017.161 |
10 | LIN W Y, MI Y, WU J X, et al. Action recognition with coarse-to-fine deep feature integration and asynchronous fusion[C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press, 2018:7130-7137. 10.1609/aaai.v32i1.12232 |
11 | JI S W, XU W, YANG M, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1):221-231. 10.1109/tpami.2012.59 |
12 | TRAN D, RAY J, SHOU Z, et al. ConvNet architecture search for spatio-temporal feature learning[EB/OL]. (2017-08-16) [2021-12-26].. |
13 | CAI J H, HU J G. 3D RANs: 3D residual attention networks for action recognition[J]. The Visual Computer, 2020, 36(6): 1261-1270. 10.1007/s00371-019-01733-3 |
14 | ZOLFAGHARI M, SINGH K, BROX T. ECO: efficient convolutional network for online video understanding[C]// Proceedings of the 2018 European Conference on Computer Vision, LNCS 11206. Cham: Springer, 2018: 713-730. |
15 | LEE M, LEE S, SON S, et al. Motion feature network: fixed motion filter for action recognition[C]// Proceedings of the 2018 European Conference on Computer Vision, LNCS 11214. Cham: Springer, 2018: 392-408. |
16 | HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778. 10.1109/cvpr.2016.90 |
17 | DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database[C]// Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2009: 248-255. 10.1109/cvpr.2009.5206848 |
18 | HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 7132-7141. 10.1109/cvpr.2018.00745 |
19 | SOOMRO K, ZAMIR A R, SHAH M. UCF101: a dataset of 101 human actions classes from videos in the wild[EB/OL]. [2021-12-26].. |
20 | KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: a large video database for human motion recognition[C]// Proceedings of the 2011 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2011:2556-2563. 10.1109/iccv.2011.6126543 |
21 | GOYAL R, KAHOU S E, MICHALSKI V, et al. The "something something" video database for learning and evaluating visual common sense[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 5843-5851. 10.1109/iccv.2017.622 |
22 | TRAN A, CHEONG L F. Two-stream flow-guided convolutional attention networks for action recognition[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops. Piscataway: IEEE, 2017:3110-3119. 10.1109/iccvw.2017.368 |
23 | QIU Z F, YAO T, MEI T. Learning spatio-temporal representation with pseudo-3D residual networks[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 5534-5542. 10.1109/iccv.2017.590 |
24 | DIBA, FAYYAZ M, SHARMA V, et al. Temporal 3D ConvNets: new architecture and transfer learning for video classification[EB/OL]. (2017-11-22) [2021-12-26].. |
25 | KAZAKOS E, NAGRANI A, ZISSERMAN A, et al. EPIC-fusion: audio-visual temporal binding for egocentric action recognition[C]// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 5491-5500. 10.1109/iccv.2019.00559 |
26 | WANG L M, LI W, LI W, et al. Appearance-and-relation networks for video classification[C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 1430-1439. 10.1109/cvpr.2018.00155 |
27 | LI X Y, SHUAI B, TIGHE J. Directional temporal modeling for action recognition[C]// Proceedings of the 2020 European Conference on Computer Vision, LNCS 12351. Cham: Springer, 2020:275-291. |
28 | KUMAWAT S, VERMA M, NAKASHIMA Y, et al. Depthwise spatio-temporal STFT convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(9):4839-4851. |
29 | SAHOO S P, ARI S, MAHAPATRA K, et al. HAR-Depth: a novel framework for human action recognition using sequential learning and depth estimated history images[J]. IEEE Transactions on Emerging Topics in Computational Intelligence, 2021, 5(5): 813-825. 10.1109/tetci.2020.3014367 |
30 | ZHANG J X, HU H F, LIU Z. Appearance-and-dynamic learning with bifurcated convolution neural network for action recognition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31(4): 1593-1606. 10.1109/tcsvt.2020.3006223 |
31 | BAI S K, WANG Q, LI X L. MFI: multi-range feature interchange for video action recognition[C]// Proceedings of the 25th International Conference on Pattern Recognition. Piscataway: IEEE, 2021: 6664-6671. 10.1109/icpr48806.2021.9412124 |
32 | CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? a new model and the kinetics dataset[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 4724-4733. 10.1109/cvpr.2017.502 |
33 | ZHOU B L, ANDONIAN A, OLIVA A, et al. Temporal relational reasoning in videos[C]// Proceedings of the 2018 European Conference on Computer Vision, LNCS 11205. Cham: Springer, 2018: 831-846. |
34 | LIU Z Y, WANG L M, WU W, et al. TAM: temporal adaptive module for video recognition[C]// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021:13688-13698. 10.1109/iccv48922.2021.01345 |
35 | SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 618-626. 10.1109/iccv.2017.74 |
[1] | 李云, 王富铕, 井佩光, 王粟, 肖澳. 基于不确定度感知的帧关联短视频事件检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2903-2910. |
[2] | 秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974. |
[3] | 赵宇博, 张丽萍, 闫盛, 侯敏, 高茂. 基于改进分段卷积神经网络和知识蒸馏的学科知识实体间关系抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2421-2429. |
[4] | 张春雪, 仇丽青, 孙承爱, 荆彩霞. 基于两阶段动态兴趣识别的购买行为预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2365-2371. |
[5] | 陈虹, 齐兵, 金海波, 武聪, 张立昂. 融合1D-CNN与BiGRU的类不平衡流量异常检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2493-2499. |
[6] | 王东炜, 刘柏辰, 韩志, 王艳美, 唐延东. 基于低秩分解和向量量化的深度网络压缩方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 1987-1994. |
[7] | 高阳峄, 雷涛, 杜晓刚, 李岁永, 王营博, 闵重丹. 基于像素距离图和四维动态卷积网络的密集人群计数与定位方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2233-2242. |
[8] | 黄梦源, 常侃, 凌铭阳, 韦新杰, 覃团发. 基于层间引导的低光照图像渐进增强算法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1911-1919. |
[9] | 李健京, 李贯峰, 秦飞舟, 李卫军. 基于不确定知识图谱嵌入的多关系近似推理模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1751-1759. |
[10] | 姚迅, 秦忠正, 杨捷. 生成式标签对抗的文本分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1781-1785. |
[11] | 沈君凤, 周星辰, 汤灿. 基于改进的提示学习方法的双通道情感分析模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1796-1806. |
[12] | 高文烁, 陈晓云. 基于节点结构的点云分类网络[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1471-1478. |
[13] | 席治远, 唐超, 童安炀, 王文剑. 基于双路时空网络的驾驶员行为识别[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1511-1519. |
[14] | 孙敏, 成倩, 丁希宁. 基于CBAM-CGRU-SVM的Android恶意软件检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1539-1545. |
[15] | 王杰, 孟华. 基于点云整体拓扑结构的图像分类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1107-1113. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||