Behavior recognition method based on two-stream non-local residual network

doi:10.11772/j.issn.1001-9081.2020010041

Abstract

Abstract: The traditional Convolutional Neural Network (CNN) can only extract local features for human behaviors and actions, which leads to low recognition accuracy for similar behaviors. To resolve this problem, a two-stream Non-Local Residual Network (NL-ResNet) based behavior recognition method was proposed. First, the RGB (Red-Green-Blue) frame and the dense optical flow graph of the video were extracted, which were used as the inputs of spatial and temporal flow networks, respectively, and a pre-processing method combining corner cropping and multiple scales was used to perform data enhancement. Second, the residual blocks of the residual network were used to extract local appearance features and motion features of the video respectively, then the global information of the video was extracted by the non-local CNN module connected after the residual block, so as to achieve the crossover extraction of local and global features of the network. Finally, the two branch networks were classified more accurately by A-softmax loss function, and the recognition results after weighted fusion were output. The method makes full use of global and local features to improve the representation capability of the model. On UCF101 dataset, NL-ResNet achieves a recognition accuracy of 93.5%, which is 5.5 percentage points higher compared to the original two-stream network. Experimental results show that the proposed model can better extract behavior features, and effectively improve the behavior recognition accuracy.

Key words: behavior recognition, Two-Stream Convolutional neural Network (Two-Stream ConvNet), non-local, feature extraction, A-softmax

摘要： 针对传统卷积神经网络（CNN）对人体行为动作仅能提取局部特征易导致相似行为动作识别准确率不高的问题，提出了一种基于双流非局部残差网络（NL-ResNet）的行为识别方法。首先提取视频的RGB帧和密集光流图，分别作为空间流和时间流网络的输入，并通过角落裁剪和多尺度相结合的预处理方法进行数据增强；其次分别利用残差网络的残差块提取视频的局部表观特征和运动特征，再通过在残差块之后接入的非局部CNN模块提取视频的全局信息，实现网络局部特征和全局特征的交叉提取；最后将两个分支网络分别通过A-softmax损失函数进行更精细的分类，并输出加权融合后的识别结果。该方法能充分利用局部和全局特征提高模型的表征能力。在UCF101数据集上，NL-ResNet取得了93.5%的识别精度，与原始双流网络相比提高了5.5个百分点。实验结果表明，所提模型能更好地提取行为特征，有效提高行为识别的准确率。

关键词: 行为识别, 双流卷积神经网络, 非局部, 特征提取, A-softmax

CLC Number:

TP391

ZHOU Yun, CHEN Shurong. Behavior recognition method based on two-stream non-local residual network[J]. Journal of Computer Applications, 2020, 40(8): 2236-2240.

周云, 陈淑荣. 基于双流非局部残差网络的行为识别方法[J]. 计算机应用, 2020, 40(8): 2236-2240.

References

[1] 周志华. 机器学习[M]. 北京:清华大学出版社, 2016:171-173. (ZHOU Z H. Machine Learning[M]. Beijing:Tsinghua University Press, 2016:171-173.)
[2] JI S, XU W, YANG M, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1):221-231.
[3] SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge:MIT Press, 2014:568-576.
[4] WANG L, XIONG Y, WANG Z, et al. Temporal segment networks:towards good practices for deep action recognition[C]//Proceedings of the 2016 European Conference on Computer Vision, LNCS 9912. Cham:Springer, 2016:22-36.
[5] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:770-778.
[6] WANG X, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2018:7794-7803.
[7] LIU W, WEN Y, YU Z, et al. SphereFace:deep hypersphere embedding for face recognition[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2017:6738-6746.
[8] SOOMRO K, ZAMIR A R, SHAH M. UCF101:a dataset of 101 human actions classes from videos in the wild[EB/OL].[2019-12-12].https://arxiv.org/pdf/1212.0402.pdf.
[9] ZHU Y, LAN Z, NEWSAM S, et al. Hidden two-stream convolutional networks for action recognition[C]//Proceedings of the 2018 Asian Conference on Computer Vision, LNCS 11363. Cham:Springer, 2018:363-378.
[10] ZACH C, POCK T, BISCHOF H. A duality based approach for realtime TV-L1 optical flow[C]//Proceedings of the 2007 Joint Pattern Recognition Symposium, LNCS 4713. Berlin:Springer, 2007:214-223.
[11] NG J Y H, HAUSKNECHT M, VIJAYANARASIMHAN S, et al. Beyond short snippets:deep networks for video classification[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2015:4694-4702.
[12] YANG H, YUAN C, LI B, et al. Asymmetric 3D convolutional neural networks for action recognition[J]. Pattern Recognition, 2019, 85:1-12.
[13] LUVIZON D C, PICARD D, TABIA H, et al. 2D/3D pose estimation and action recognition using multitask deep learning[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2018:5137-5146
[14] 王萍,庞文浩. 基于视频分段的空时双通道卷积神经网络的行为识别[J]. 计算机应用, 2019, 39(7):2081-2086. (WANG P, PANG W H. Two-stream CNN for action recognition based on video segmentation[J]. Journal of Computer Applications, 2019, 39(7):2081-2086.)
[15] 刘天亮,谯庆伟,万俊伟,等. 融合空间-时间双网络流和视觉注意的人体行为识别[J]. 电子与信息学报, 2018, 40(10):2395-2401. (LIU T L, QIAO Q W, WAN J W, et al. Human action recognition via spatio-temporal dual network flow and visual attention fusion[J]. Journal of Electronics and Information Technology, 2018, 40(10):2395-2401.)
[16] 杨天明,陈志,岳文静. 基于视频深度学习的时空双流人物动作识别模型[J]. 计算机应用, 2018, 38(3):895-899, 915. (YANG T M, CHEN Z, YUE W J. Spatio-temporal two-stream human action recognition model based on video deep learning[J]. Journal of Computer Applications, 2018, 38(3):895-899, 915.)