基于三维残差稠密网络的人体行为识别算法

doi:10.11772/j.issn.1001-9081.2019061056

计算机应用 ›› 2019, Vol. 39 ›› Issue (12): 3482-3489.DOI: 10.11772/j.issn.1001-9081.2019061056

基于三维残差稠密网络的人体行为识别算法

郭明祥^1,2, 宋全军¹, 徐湛楠¹, 董俊¹, 谢成军¹

1. 中国科学院合肥智能机械研究所, 合肥 230031;
2. 中国科学技术大学, 合肥 230026

收稿日期:2019-06-21 修回日期:2019-09-14 发布日期:2019-10-15 出版日期:2019-12-10
作者简介:郭明祥(1995-),男,四川达州人,硕士研究生,主要研究方向:计算机视觉、智能机器人;宋全军(1974-),男,安徽宿州人,教授,博士,主要研究方向:服务机器人、智能人机交互;徐湛楠(1987-),男,河南南阳人,助理研究员,硕士,主要研究方向:机器人智能控制、人机交互;董俊(1973-),男,安徽合肥人,副研究员,博士,CCF会员,主要研究方向:计算机视觉、人工智能;谢成军(1979-),男,安徽全椒人,副研究员,博士,主要研究方向:计算机视觉、模式识别、机器学习。
基金资助:
国家重点研发计划项目（2017YFC0806504）；安徽省科技强警项目（201904d07020007）。

Human behavior recognition algorithm based on three-dimensional residual dense network

GUO Mingxiang^1,2, SONG Quanjun¹, XU Zhannan¹, DONG Jun¹, XIE Chengjun¹

1. Institute of Intelligent Machines, Chinese Academy of Sciences, Hefei Anhui 230031, China;
2. University of Science and Technology of China, Hefei Anhui 230026, China

Received:2019-06-21 Revised:2019-09-14 Online:2019-10-15 Published:2019-12-10
Contact: 宋全军
Supported by:
This work is partially supported by the National Key Research and Development Program of China (2017YFC0806504), the Science and Technology Strong Police Project of Anhui Province (201904d07020007).

摘要/Abstract

摘要： 针对现有的人体行为识别算法不能充分利用网络多层次时空信息的问题，提出了一种基于三维残差稠密网络的人体行为识别算法。首先，所提算法使用三维残差稠密块作为网络的基础模块，模块通过稠密连接的卷积层提取人体行为的层级特征；其次，经过局部特征聚合自适应方法来学习人体行为的局部稠密特征；然后，应用残差连接模块来促进特征信息流动以及减轻训练的难度；最后，通过级联多个三维残差稠密块实现网络多层局部特征提取，并使用全局特征聚合自适应方法学习所有网络层的特征用以实现人体行为识别。设计的网络算法在结构上增强了对网络多层次时空特征的提取，充分利用局部和全局特征聚合学习到更具辨识力的特征，增强了模型的表达能力。在基准数据集KTH和UCF-101上的大量实验结果表明，所提算法的识别率（top-1精度）分别达到了93.52％和57.35％，与三维卷积神经网络（C3D）算法相比分别提升了3.93和13.91个百分点。所提算法框架有较好的鲁棒性和迁移学习能力，能够有效地处理多种视频行为识别任务。

关键词: 人体行为识别, 视频分类, 三维残差稠密网络, 深度学习, 特征聚合

Abstract: Concerning the problem that the existing algorithm for human behavior recognition cannot fully utilize the multi-level spatio-temporal information of network, a human behavior recognition algorithm based on three-dimensional residual dense network was proposed. Firstly, the proposed network adopted the three-dimensional residual dense blocks as the building blocks, these blocks extracted the hierarchical features of human behavior through the densely-connected convolutional layer. Secondly, the local dense features of human behavior were learned by the local feature aggregation adaptive method. Thirdly, residual connection module was adopted to facilitate the flow of feature information and mitigate the difficulty of training. Finally, after realizing the multi-level local feature extraction by concatenating multiple three-dimensional residual dense blocks, the aggregation adaptive method for global feature was proposed to learn the features of all network layers for realizing human behavior recognition. In conclusion, the proposed algorithm has improved the extraction of network multi-level spatio-temporal features and the features with high discrimination are learned by local and global feature aggregation, which enhances the expression ability of model. The experimental results on benchmark datasets KTH and UCF-101 show that, the recognition rate (top-1 recognition accuracy) of the proposed algorithm can achieve 93.52% and 57.35% respectively, which outperforms that of Three-Dimensional Convolutional neural network (C3D) algorithm by 3.93 percentage points and 13.91 percentage points respectively. The proposed algorithm framework has excellent robustness and migration learning ability, and can effectively handle multiple video behavior recognition tasks.

Key words: human behavior recognition, video classification, three-dimensional residual dense network, deep learning, feature aggregation

中图分类号:

TP391.41

郭明祥, 宋全军, 徐湛楠, 董俊, 谢成军. 基于三维残差稠密网络的人体行为识别算法[J]. 计算机应用, 2019, 39(12): 3482-3489.

GUO Mingxiang, SONG Quanjun, XU Zhannan, DONG Jun, XIE Chengjun. Human behavior recognition algorithm based on three-dimensional residual dense network[J]. Journal of Computer Applications, 2019, 39(12): 3482-3489.

参考文献

[1] ZHU F, SHA L, XIE J, et al. From handcrafted to learned representations for human action recognition:a survey[J]. Image and Vision Computing, 2016, 55:42-52.
[2] GUANGLE Y, TAO L, JIANDAN Z. A review of convolutional-neural-network-based action recognition[J]. Pattern Recognition Letters, 2019, 118:14-22.
[3] 李瑞峰,王亮亮,王珂.人体动作行为识别研究综述[J].模式识别与人工智能,2014,27(1):35-48.(LI R F, WANG L L, WANG K. A survey of human body action recognition[J]. Pattern Recognition and Artificial Intelligence, 2014, 27(1):35-48.
[4] SCOVANNER P, ALI S, SHAH M. A 3-dimensional SIFT descriptor and its application to action recognition[C]//Proceedings of the 15th ACM International Conference on Multimedia. New York:ACM, 2007:357-360.
[5] LAPTEV I, MARSZALEK M, SCHMID C, et al. Learning realistic human actions from movies[C]//Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2008:1-8.
[6] WANG H, KLÄSER A, SCHMID C, et al. Dense trajectories and motion boundary descriptors for action recognition[J]. International Journal of Computer Vision, 2013, 103(1):60-79.
[7] BOBICK A F, DAVIS J W. The recognition of human movement using temporal templates[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001, 23(3):257-267.
[8] 李英杰,尹怡欣,邓飞.一种有效的行为识别视频特征[J].计算机应用,2011,31(2):406-409,419.(LI Y J, YIN Y X, DENG F. Effective video feature for action recognition[J]. Journal of Computer Applications, 2011, 31(2):406-409, 419.)
[9] LIU J, KUIPERS B, SAVARESE S. Recognizing human actions by attributes[C]//Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2011:3337-3344.
[10] WANG H, SCHMID C. Action recognition with improved trajectories[C]//Proceedings of the 2013 IEEE International Conference on Computer Vision. Piscataway:IEEE, 2013:3551-3558.
[11] LAN Z, LIN M, LI X, et al. Beyond Gaussian pyramid:multi-skip feature stacking for action recognition[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2015:204-212.
[12] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 25th International Conference on Neural Information Processing Systems. New York:Curran Associates Inc., 2012:1097-1105.
[13] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL].[2019-03-31]. https://arxiv.org/pdf/1409.1556.pdf.
[14] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2015:1-9.
[15] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:770-778.
[16] HUANG G, LIU Z, VAN DER MAATEN L, et al. Densely connected convolutional networks[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2017:2261-2269.
[17] JI S, XU W, YANG M, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1):221-231.
[18] TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway:IEEE, 2015:4489-4497.
[19] TRAN D, RAY J, SHOU Z, et al. ConvNet architecture search for spatiotemporal feature learning[J]. Computer Vision and Pattern Recognition, 2017, 17(8):65-77.
[20] HARA K, KATAOKA H, SATOH Y. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2018:6546-6555.
[21] TRAN D, WANG H, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2018:6450-6459.
[22] CHEN Y, KALANTIDIS Y, LI J, et al. Multi-fiber networks for video recognition[C]//Proceedings of the 2018 European Conference on Computer Vision, LNCS 11205. Berlin:Springer, 2018:364-380.
[23] YANG H, YUAN C, LI B, et al. Asymmetric 3D convolutional neural networks for action recognition[J]. Pattern Recognition, 2019, 85:1-12.
[24] DIBA A, FAYYAZ M, SHARMA V, et al. Spatio-temporal channel correlation networks for action classification[C]//Proceedings of the 2018 European Conference on Computer Vision, LNCS 11208. Cham:Springer:284-299.
[25] HUSSEIN N, GAVVES E, SMEULDERS A W M, et al. Timeception for complex action recognition[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2019:254-263.
[26] SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[C]//Proceedings of the 2014 International Conference on Neural Information Processing Systems. New York:Curran Associates Inc., 2014:568-576.
[27] WANG L, XIONG Y, WANG Z, et al. Towards good practices for very deep two-stream convents[EB/OL].[2019-03-31]. https://arxiv.org/pdf/1507.02159.pdf.
[28] WANG L, XIONG Y, WANG Z, et al. Temporal segment networks:towards good practices for deep action recognition[C]//Proceedings of the 2016 European Conference on Computer Vision, LNCS 9912. Cham:Springer, 2016:20-36.
[29] FEICHTENHOFER C, PINZ A, WILDES R P. Spatiotemporal residual networks for video action recognition[C]//Proceedings of the 2016 International Conference on Neural Information Processing Systems. New York:Curran Associates Inc., 2016:3468-3476.
[30] FEICHTENHOFER C, PINZ A, ZISSERMAN A. Convolutional two-stream network fusion for video action recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:1933-1941.
[31] CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? A new model and the kinetics dataset[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2017:4724-4733.
[32] MA S, SIGAL L, SCLAROFF S. Learning activity progression in LSTMs for activity detection and early detection[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:1942-1950.
[33] DONAHUE J, HENDRICKS L A, GUADARRAMA S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2015:2625-2634.
[34] NG J Y H, HAUSKNECHT M, VIJAYANARASIMHAN S, et al. Beyond short snippets:deep networks for video classification[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2015:4694-4702.
[35] WU Z, WANG X, JIANG Y, et al. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification[C]//Proceedings of the 23rd ACM International Conference on Multimedia. New York:ACM, 2015:461-470.
[36] SCHULDT C, LAPTEV I, CAPUTO B. Recognizing human actions:A local SVM approach[C]//Proceedings of the 17th International Conference on Pattern Recognition. Piscataway:IEEE, 2004:32-36.
[37] DOLLAR P, RABAUD V, COTTRELL G, et al. Behavior recognition via sparse spatio-temporal features[C]//Proceedings of the 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance. Piscataway:IEEE, 2005:65-72.
[38] TAYLOR G W, FERGUS R, LECUN Y, et al. Convolutional learning of spatio-temporal features[C]//Proceedings of the 2010 European Conference on Computer Vision, LNCS 6316. Berlin:Springer, 2010:140-153.

基于三维残差稠密网络的人体行为识别算法

Human behavior recognition algorithm based on three-dimensional residual dense network

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[2]	王熙源, 张战成, 徐少康, 张宝成, 罗晓清, 胡伏原. 面向手术导航3D/2D配准的无监督跨域迁移网络[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2911-2918.
[3]	潘烨新, 杨哲. 基于多级特征双向融合的小目标检测优化模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2871-2877.
[4]	黄云川, 江永全, 黄骏涛, 杨燕. 基于元图同构网络的分子毒性预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2964-2969.
[5]	李顺勇, 李师毅, 胥瑞, 赵兴旺. 基于自注意力融合的不完整多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2696-2703.
[6]	刘禹含, 吉根林, 张红苹. 基于骨架图与混合注意力的视频行人异常检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2551-2557.
[7]	顾焰杰, 张英俊, 刘晓倩, 周围, 孙威. 基于时空多图融合的交通流量预测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2618-2625.
[8]	石乾宏, 杨燕, 江永全, 欧阳小草, 范武波, 陈强, 姜涛, 李媛. 面向空气质量预测的多粒度突变拟合网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2643-2650.
[9]	赵亦群, 张志禹, 董雪. 基于密集残差物理信息神经网络的各向异性旅行时计算方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2310-2318.
[10]	徐松, 张文博, 王一帆. 基于时空信息的轻量视频显著性目标检测网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2192-2199.
[11]	孙逊, 冯睿锋, 陈彦如. 基于深度与实例分割融合的单目3D目标检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2208-2215.
[12]	吴筝, 程志友, 汪真天, 汪传建, 王胜, 许辉. 基于深度学习的患者麻醉复苏过程中的头部运动幅度分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2258-2263.
[13]	李欢欢, 黄添强, 丁雪梅, 罗海峰, 黄丽清. 基于多尺度时空图卷积网络的交通出行需求预测[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2065-2072.
[14]	张郅, 李欣, 叶乃夫, 胡凯茜. 基于暗知识保护的模型窃取防御技术DKP[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2080-2086.
[15]	赵雅娟, 孟繁军, 徐行健. 在线教育学习者知识追踪综述[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1683-1698.