Abstract:The traditional Convolutional Neural Network (CNN) can only extract local features for human behaviors and actions, which leads to low recognition accuracy for similar behaviors. To resolve this problem, a two-stream Non-Local Residual Network (NL-ResNet) based behavior recognition method was proposed. First, the RGB (Red-Green-Blue) frame and the dense optical flow graph of the video were extracted, which were used as the inputs of spatial and temporal flow networks, respectively, and a pre-processing method combining corner cropping and multiple scales was used to perform data enhancement. Second, the residual blocks of the residual network were used to extract local appearance features and motion features of the video respectively, then the global information of the video was extracted by the non-local CNN module connected after the residual block, so as to achieve the crossover extraction of local and global features of the network. Finally, the two branch networks were classified more accurately by A-softmax loss function, and the recognition results after weighted fusion were output. The method makes full use of global and local features to improve the representation capability of the model. On UCF101 dataset, NL-ResNet achieves a recognition accuracy of 93.5%, which is 5.5 percentage points higher compared to the original two-stream network. Experimental results show that the proposed model can better extract behavior features, and effectively improve the behavior recognition accuracy.
[1] 周志华. 机器学习[M]. 北京:清华大学出版社, 2016:171-173. (ZHOU Z H. Machine Learning[M]. Beijing:Tsinghua University Press, 2016:171-173.) [2] JI S, XU W, YANG M, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1):221-231. [3] SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge:MIT Press, 2014:568-576. [4] WANG L, XIONG Y, WANG Z, et al. Temporal segment networks:towards good practices for deep action recognition[C]//Proceedings of the 2016 European Conference on Computer Vision, LNCS 9912. Cham:Springer, 2016:22-36. [5] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2016:770-778. [6] WANG X, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2018:7794-7803. [7] LIU W, WEN Y, YU Z, et al. SphereFace:deep hypersphere embedding for face recognition[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2017:6738-6746. [8] SOOMRO K, ZAMIR A R, SHAH M. UCF101:a dataset of 101 human actions classes from videos in the wild[EB/OL].[2019-12-12].https://arxiv.org/pdf/1212.0402.pdf. [9] ZHU Y, LAN Z, NEWSAM S, et al. Hidden two-stream convolutional networks for action recognition[C]//Proceedings of the 2018 Asian Conference on Computer Vision, LNCS 11363. Cham:Springer, 2018:363-378. [10] ZACH C, POCK T, BISCHOF H. A duality based approach for realtime TV-L1 optical flow[C]//Proceedings of the 2007 Joint Pattern Recognition Symposium, LNCS 4713. Berlin:Springer, 2007:214-223. [11] NG J Y H, HAUSKNECHT M, VIJAYANARASIMHAN S, et al. Beyond short snippets:deep networks for video classification[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2015:4694-4702. [12] YANG H, YUAN C, LI B, et al. Asymmetric 3D convolutional neural networks for action recognition[J]. Pattern Recognition, 2019, 85:1-12. [13] LUVIZON D C, PICARD D, TABIA H, et al. 2D/3D pose estimation and action recognition using multitask deep learning[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2018:5137-5146 [14] 王萍,庞文浩. 基于视频分段的空时双通道卷积神经网络的行为识别[J]. 计算机应用, 2019, 39(7):2081-2086. (WANG P, PANG W H. Two-stream CNN for action recognition based on video segmentation[J]. Journal of Computer Applications, 2019, 39(7):2081-2086.) [15] 刘天亮,谯庆伟,万俊伟,等. 融合空间-时间双网络流和视觉注意的人体行为识别[J]. 电子与信息学报, 2018, 40(10):2395-2401. (LIU T L, QIAO Q W, WAN J W, et al. Human action recognition via spatio-temporal dual network flow and visual attention fusion[J]. Journal of Electronics and Information Technology, 2018, 40(10):2395-2401.) [16] 杨天明,陈志,岳文静. 基于视频深度学习的时空双流人物动作识别模型[J]. 计算机应用, 2018, 38(3):895-899, 915. (YANG T M, CHEN Z, YUE W J. Spatio-temporal two-stream human action recognition model based on video deep learning[J]. Journal of Computer Applications, 2018, 38(3):895-899, 915.)