《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (2): 521-528.DOI: 10.11772/j.issn.1001-9081.2022010017
• 多媒体计算与计算机仿真 • 上一篇
收稿日期:
2022-01-07
修回日期:
2022-03-18
接受日期:
2022-04-06
发布日期:
2022-04-21
出版日期:
2023-02-10
通讯作者:
张轶
作者简介:
倪苒岩(1998—),女,安徽黄山人,硕士研究生,主要研究方向:计算机视觉、行为识别;
基金资助:
Received:
2022-01-07
Revised:
2022-03-18
Accepted:
2022-04-06
Online:
2022-04-21
Published:
2023-02-10
Contact:
Yi ZHANG
About author:
NI Ranyan, born in 1998, M. S. candidate. Her research interests include computer vision, action recognition.
Supported by:
摘要:
针对双流网络提取运动信息需要预先计算光流图,从而无法实现端到端的识别以及三维卷积网络参数量巨大的问题,提出了一种基于视频时空特征的行为识别方法。该方法能够高效提取视频中的时空信息,且无需添加任何光流计算和三维卷积操作。首先,利用基于注意力机制的运动信息提取模块捕获相邻两帧之间的运动位移信息,从而模拟双流网络中光流图的作用;其次,提出了一种解耦的时空信息提取模块代替三维卷积,从而实现时空信息的编码;最后,在将两个模块嵌入二维的残差网络中后,完成端到端的行为识别。将所提方法在几个主流的行为识别数据集上进行实验,结果表明在仅使用RGB视频帧作为输入的情况下,在UCF101、HMDB51、Something-Something-V1数据集上的识别准确率分别为96.5%、73.1%和46.6%,与使用双流结构的时间分段网络(TSN)方法相比,在UCF101数据集上的识别准确率提高了2.5个百分点。可见,所提方法能够高效提取视频中的时空特征。
中图分类号:
倪苒岩, 张轶. 基于视频时空特征的行为识别方法[J]. 计算机应用, 2023, 43(2): 521-528.
Ranyan NI, Yi ZHANG. Action recognition method based on video spatio-temporal features[J]. Journal of Computer Applications, 2023, 43(2): 521-528.
数据集 | 方法 | 年份 | 准确率/% |
---|---|---|---|
UCF101 | 文献[ | 2014 | 88.8 |
文献[ | 2015 | 82.3 | |
文献[ | 2019 | 95.9 | |
文献[ | 2016 | 94.0 | |
文献[ | 2017 | 92.0 | |
文献[ | 2017 | 88.6 | |
文献[ | 2017 | 93.2 | |
文献[ | 2019 | 93.6 | |
文献[ | 2018 | 94.3 | |
文献[ | 2020 | 95.6 | |
文献[ | 2020 | 94.7 | |
文献[ | 2020 | 93.0 | |
文献[ | 2020 | 94.9 | |
文献[ | 2021 | 95.6 | |
本文方法 | 2022 | 96.5 | |
HMDB51 | 文献[ | 2014 | 59.4 |
文献[ | 2015 | 56.8 | |
文献[ | 2019 | 70.7 | |
文献[ | 2016 | 68.5 | |
文献[ | 2018 | 68.3 | |
文献[ | 2017 | 59.2 | |
文献[ | 2019 | 69.4 | |
文献[ | 2020 | 71.5 | |
文献[ | 2020 | 69.7 | |
文献[ | 2020 | 72.1 | |
本文方法 | 2022 | 73.1 | |
Something-Something-V1 | 文献[ | 2019 | 45.6 |
文献[ | 2016 | 19.5 | |
文献[ | 2018 | 39.6 | |
文献[ | 2021 | 43.9 | |
文献[ | 2017 | 41.6 | |
文献[ | 2018 | 34.4 | |
文献[ | 2020 | 46.5 | |
本文方法 | 2022 | 46.6 |
表1 不同方法在三个数据集上的比较
Tab. 1 Comparison of different methods on three datasets
数据集 | 方法 | 年份 | 准确率/% |
---|---|---|---|
UCF101 | 文献[ | 2014 | 88.8 |
文献[ | 2015 | 82.3 | |
文献[ | 2019 | 95.9 | |
文献[ | 2016 | 94.0 | |
文献[ | 2017 | 92.0 | |
文献[ | 2017 | 88.6 | |
文献[ | 2017 | 93.2 | |
文献[ | 2019 | 93.6 | |
文献[ | 2018 | 94.3 | |
文献[ | 2020 | 95.6 | |
文献[ | 2020 | 94.7 | |
文献[ | 2020 | 93.0 | |
文献[ | 2020 | 94.9 | |
文献[ | 2021 | 95.6 | |
本文方法 | 2022 | 96.5 | |
HMDB51 | 文献[ | 2014 | 59.4 |
文献[ | 2015 | 56.8 | |
文献[ | 2019 | 70.7 | |
文献[ | 2016 | 68.5 | |
文献[ | 2018 | 68.3 | |
文献[ | 2017 | 59.2 | |
文献[ | 2019 | 69.4 | |
文献[ | 2020 | 71.5 | |
文献[ | 2020 | 69.7 | |
文献[ | 2020 | 72.1 | |
本文方法 | 2022 | 73.1 | |
Something-Something-V1 | 文献[ | 2019 | 45.6 |
文献[ | 2016 | 19.5 | |
文献[ | 2018 | 39.6 | |
文献[ | 2021 | 43.9 | |
文献[ | 2017 | 41.6 | |
文献[ | 2018 | 34.4 | |
文献[ | 2020 | 46.5 | |
本文方法 | 2022 | 46.6 |
方法 | 采样 帧数 | 参数量/106 | 浮点运算量/GFLOPs | 准确率/% |
---|---|---|---|---|
文献[ | 8 | 24.3 | 33 | 45.6 |
文献[ | 8 | 10.7 | 16 | 19.5 |
文献[ | 8 | 47.5 | 32 | 39.6 |
文献[ | 8 | 24.6 | 34 | 43.9 |
文献[ | 32 | 28.0 | 153 | 41.6 |
文献[ | 8 | 18.3 | 33 | 34.4 |
文献[ | 8 | 25.6 | 33 | 46.5 |
本文方法 | 8 | 25.7 | 34 | 46.6 |
表2 不同方法在Something-Something-V1数据集的采样帧数、参数量、浮点运算量及准确率对比
Tab. 2 Comparison of sampling frames, parameters, FLOPs and accuracy among different methods on Something-Something-V1 dataset
方法 | 采样 帧数 | 参数量/106 | 浮点运算量/GFLOPs | 准确率/% |
---|---|---|---|---|
文献[ | 8 | 24.3 | 33 | 45.6 |
文献[ | 8 | 10.7 | 16 | 19.5 |
文献[ | 8 | 47.5 | 32 | 39.6 |
文献[ | 8 | 24.6 | 34 | 43.9 |
文献[ | 32 | 28.0 | 153 | 41.6 |
文献[ | 8 | 18.3 | 33 | 34.4 |
文献[ | 8 | 25.6 | 33 | 46.5 |
本文方法 | 8 | 25.7 | 34 | 46.6 |
方法 | 准确率/% |
---|---|
Baseline | 45.6 |
Baseline+运动信息提取模块 | 46.0 |
Baseline+时空信息提取模块 | 45.9 |
Baseline+运动信息提取模块+时空信息提取模块 | 46.6 |
表3 不同模块对网络的影响
Tab. 3 Influence of different modules on network
方法 | 准确率/% |
---|---|
Baseline | 45.6 |
Baseline+运动信息提取模块 | 46.0 |
Baseline+时空信息提取模块 | 45.9 |
Baseline+运动信息提取模块+时空信息提取模块 | 46.6 |
图6 本文方法基于UCF101、HMDB51和Something-Something-V1的混淆矩阵
Fig. 6 Confusion matrices of the proposed method based on UCF101, HMDB51,and Something-Something-V1 dataset
1 | WANG H, KLÄSER A, SCHMID C, et al. Dense trajectories and motion boundary descriptors for action recognition[J]. International Journal of Computer Vision, 2013, 103(1): 60-79. 10.1007/s11263-012-0594-8 |
2 | 王萍,庞文浩. 基于视频分段的空时双通道卷积神经网络的行为识别[J]. 计算机应用, 2019, 39(7):2081-2086. 10.11772/j.issn.1001-9081.2019010156 |
WANG P, PANG W H. Two-stream CNN for action recognition based on video segmentation[J]. Journal of Computer Applications, 2019, 39(7): 2081-2086. 10.11772/j.issn.1001-9081.2019010156 | |
3 | KLÄSER A, MARSZAŁEK M, SCHMID C. A spatio-temporal descriptor based on 3D-gradients[C]// Proceedings of the 2008 British Machine Vision Conference. Durham: BMVA Press, 2008: No.99. 10.5244/c.22.99 |
4 | 郭明祥,宋全军,徐湛楠,等. 基于三维残差稠密网络的人体行为识别算法[J]. 计算机应用, 2019, 39(12):3482-3489. 10.11772/j.issn.1001-9081.2019061056 |
GUO M X, SONG Q J, XU Z N, et al. Human behavior recognition algorithm based on three-dimensional residual dense network[J]. Journal of Computer Applications, 2019, 39(12):3482-3489. 10.11772/j.issn.1001-9081.2019061056 | |
5 | SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[C]// Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1. Cambridge: MIT Press, 2014:568-576. |
6 | TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015:4489-4497. 10.1109/iccv.2015.510 |
7 | LIN J, GAN C, HAN S. TSM: temporal shift module for efficient video understanding[C]// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019:7082-7092. 10.1109/iccv.2019.00718 |
8 | WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]// Proceedings of the 2016 European Conference on Computer Vision, LNCS 9912. Cham: Springer, 2016: 20-36. |
9 | LAN Z Z, ZHU Y, HAUPTMANN A G, et al. Deep local video feature for action recognition[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Piscataway: IEEE, 2017: 1219-1225. 10.1109/cvprw.2017.161 |
10 | LIN W Y, MI Y, WU J X, et al. Action recognition with coarse-to-fine deep feature integration and asynchronous fusion[C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press, 2018:7130-7137. 10.1609/aaai.v32i1.12232 |
11 | JI S W, XU W, YANG M, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1):221-231. 10.1109/tpami.2012.59 |
12 | TRAN D, RAY J, SHOU Z, et al. ConvNet architecture search for spatio-temporal feature learning[EB/OL]. (2017-08-16) [2021-12-26].. |
13 | CAI J H, HU J G. 3D RANs: 3D residual attention networks for action recognition[J]. The Visual Computer, 2020, 36(6): 1261-1270. 10.1007/s00371-019-01733-3 |
14 | ZOLFAGHARI M, SINGH K, BROX T. ECO: efficient convolutional network for online video understanding[C]// Proceedings of the 2018 European Conference on Computer Vision, LNCS 11206. Cham: Springer, 2018: 713-730. |
15 | LEE M, LEE S, SON S, et al. Motion feature network: fixed motion filter for action recognition[C]// Proceedings of the 2018 European Conference on Computer Vision, LNCS 11214. Cham: Springer, 2018: 392-408. |
16 | HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778. 10.1109/cvpr.2016.90 |
17 | DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database[C]// Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2009: 248-255. 10.1109/cvpr.2009.5206848 |
18 | HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 7132-7141. 10.1109/cvpr.2018.00745 |
19 | SOOMRO K, ZAMIR A R, SHAH M. UCF101: a dataset of 101 human actions classes from videos in the wild[EB/OL]. [2021-12-26].. |
20 | KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: a large video database for human motion recognition[C]// Proceedings of the 2011 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2011:2556-2563. 10.1109/iccv.2011.6126543 |
21 | GOYAL R, KAHOU S E, MICHALSKI V, et al. The "something something" video database for learning and evaluating visual common sense[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 5843-5851. 10.1109/iccv.2017.622 |
22 | TRAN A, CHEONG L F. Two-stream flow-guided convolutional attention networks for action recognition[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops. Piscataway: IEEE, 2017:3110-3119. 10.1109/iccvw.2017.368 |
23 | QIU Z F, YAO T, MEI T. Learning spatio-temporal representation with pseudo-3D residual networks[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 5534-5542. 10.1109/iccv.2017.590 |
24 | DIBA, FAYYAZ M, SHARMA V, et al. Temporal 3D ConvNets: new architecture and transfer learning for video classification[EB/OL]. (2017-11-22) [2021-12-26].. |
25 | KAZAKOS E, NAGRANI A, ZISSERMAN A, et al. EPIC-fusion: audio-visual temporal binding for egocentric action recognition[C]// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 5491-5500. 10.1109/iccv.2019.00559 |
26 | WANG L M, LI W, LI W, et al. Appearance-and-relation networks for video classification[C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 1430-1439. 10.1109/cvpr.2018.00155 |
27 | LI X Y, SHUAI B, TIGHE J. Directional temporal modeling for action recognition[C]// Proceedings of the 2020 European Conference on Computer Vision, LNCS 12351. Cham: Springer, 2020:275-291. |
28 | KUMAWAT S, VERMA M, NAKASHIMA Y, et al. Depthwise spatio-temporal STFT convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(9):4839-4851. |
29 | SAHOO S P, ARI S, MAHAPATRA K, et al. HAR-Depth: a novel framework for human action recognition using sequential learning and depth estimated history images[J]. IEEE Transactions on Emerging Topics in Computational Intelligence, 2021, 5(5): 813-825. 10.1109/tetci.2020.3014367 |
30 | ZHANG J X, HU H F, LIU Z. Appearance-and-dynamic learning with bifurcated convolution neural network for action recognition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021, 31(4): 1593-1606. 10.1109/tcsvt.2020.3006223 |
31 | BAI S K, WANG Q, LI X L. MFI: multi-range feature interchange for video action recognition[C]// Proceedings of the 25th International Conference on Pattern Recognition. Piscataway: IEEE, 2021: 6664-6671. 10.1109/icpr48806.2021.9412124 |
32 | CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? a new model and the kinetics dataset[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 4724-4733. 10.1109/cvpr.2017.502 |
33 | ZHOU B L, ANDONIAN A, OLIVA A, et al. Temporal relational reasoning in videos[C]// Proceedings of the 2018 European Conference on Computer Vision, LNCS 11205. Cham: Springer, 2018: 831-846. |
34 | LIU Z Y, WANG L M, WU W, et al. TAM: temporal adaptive module for video recognition[C]// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021:13688-13698. 10.1109/iccv48922.2021.01345 |
35 | SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 618-626. 10.1109/iccv.2017.74 |
[1] | 汪洋, 傅洪亮, 陶华伟, 杨静, 谢跃, 赵力. 基于决策边界优化域自适应的跨库语音情感识别[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 374-379. |
[2] | 王佑芯, 陈斌. 基于深度对比网络的印刷缺陷检测方法[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 250-258. |
[3] | 申志军, 穆丽娜, 高静, 史远航, 刘志强. 细粒度图像分类综述[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 51-60. |
[4] | 林荐壮, 杨文忠, 谭思翔, 周乐鑫, 陈丹妮. 融合滤波增强和反转注意力网络用于息肉分割[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 265-272. |
[5] | 刘月峰, 张小燕, 郭威, 边浩东, 何滢婕. 基于优化混合模型的航空发动机剩余寿命预测方法[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2960-2968. |
[6] | 刘汉卿, 康晓东, 张福青, 赵秀圆, 杨靖怡, 王笑天, 李梦凡. 改进的Libra区域卷积神经网络的脑动脉狭窄影像学检测算法[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2909-2916. |
[7] | 王宇航, 周永霞, 吴良武. 基于高斯函数的池化算法[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2800-2806. |
[8] | 衡红军, 徐天宝. 基于多尺度卷积和门控机制的注意力情感分析模型[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2674-2679. |
[9] | 张显杰, 张之明. 基于卷积神经网络和Transformer的手写体英文文本识别[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2394-2400. |
[10] | 程南江, 余贞侠, 陈琳, 乔贺辙. 基于领域自适应的多源多标签行人属性识别[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2401-2406. |
[11] | 吕振虎, 许新征, 张芳艳. 基于挤压激励的轻量化注意力机制模块[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2353-2360. |
[12] | 靳华中, 张修洋, 叶志伟, 张闻其, 夏小鱼. 基于近似U型网络结构的图像去噪模型[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2571-2577. |
[13] | 邓杰航, 郭文权, 陈汉杰, 顾国生, 刘景建, 杜宇坤, 刘超, 康晓东, 赵建. 融合多尺度多头自注意力和在线难例挖掘的小样本硅藻检测[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2593-2600. |
[14] | 徐成霞, 阎庆, 李腾, 苗开超. 基于联合注意力机制的单幅图像去雨算法[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2578-2585. |
[15] | 谭湘粤, 胡晓, 杨佳信, 向俊将. 基于递进式特征增强聚合的伪装目标检测[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2192-2200. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||