Aiming at the issue that original spatial-temporal two-stream Convolutional Neural Network (CNN) model has low accuracy for action recognition in long and complex videos, a two-stream CNN for action recognition based on video segmentation was proposed. Firstly, a video was split into multiple non-overlapping segments with same length. For each segment, one frame image was sampled randomly to represent its static features and stacked optical flow images were calculated to represent its motion features. Secondly, these two patterns of images were input into the spatial CNN and temporal CNN for feature extraction, respectively. And the classification prediction features of spatial and temporal domains for action recognition were obtained by merging all segment features in two streams respectively. Finally, the two-steam predictive features were integrated to obtain the action recognition results for the video. In series of experiments, some data augmentation techniques and transfer learning methods were discussed to solve the problem of over-fitting caused by the lack of training samples. The effects of various factors including the number of segments, network architectures, feature fusion schemes based on segmentation and two-stream integration strategy on the performance of action recognition were analyzed. The experimental results show that the accuracy of action recognition of the proposed model on dataset UCF101 reaches 91.80%, which is 3.8% higher than that of original two-stream CNN model; and the accuracy of the proposed model on dataset HMDB51 is improved to 61.39%, which is higher than that of the original model. It shows that the proposed model can better learn and express the action features in long and complex videos.
王萍, 庞文浩. 基于视频分段的空时双通道卷积神经网络的行为识别[J]. 计算机应用, 2019, 39(7): 2081-2086.
WANG Ping, PANG Wenhao. Two-stream CNN for action recognition based on video segmentation. Journal of Computer Applications, 2019, 39(7): 2081-2086.
[1] 单言虎,张彰,黄凯奇.人的视觉行为识别研究回顾、现状及展望[J].计算机研究与发展,2016,53(1):93-112.(SHAN Y H, ZHANG Z, HUANG K Q. Review, current situation and prospect of human visual behavior recognition[J]. Journal of Computer Research and Development, 2016, 53(1):93-112.)
[2] FORSYTH D A. Computer Vision:A Modern Approach[M]. 2nd ed. Englewood Cliffs, NJ:Prentice Hall, 2011:1-2.
[3] CAI Z, WANG L, PENG X, et al. Multi-view super vector for action recognition[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Washington, DC:IEEE Computer Society, 2014:596-603.
[4] WANG H, SCHMID C. Action recognition with improved trajectories[C]//Proceedings of the 2013 IEEE International Conference on Computer Vision. Washington, DC:IEEE Computer Society, 2014:3551-3558.
[5] PENG X, WANG L, WANG X, et al. Bag of visual words and fusion methods for action recognition:comprehensive study and good practice[J]. Computer Vision and Image Understanding, 2016, 150:109-125.
[6] WANG L, QIAO Y, TANG X. MoFAP:a multi-level representation for action recognition[J]. International Journal of Computer Vision, 2016, 119(3):254-271.
[7] KARPATHY A, TODERICI G, SHETTY S, et al. Large-scale video classification with convolutional neural networks[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Rec-ognition. Washington, DC:IEEE Computer Society, 2014:1725-1732.
[8] TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of the 2014 IEEE International Conference on Computer Vision. Washington, DC:IEEE Computer Society, 2015:4489-4497.
[9] VAROL G, LAPTEV I, SCHMID C. Long-term temporal convolutions for action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(6):1510-1517.
[10] SIMONYAN K, ZISSERMAN A. Two-stream convolutional net-works for action recognition in videos[C]//Proceedings of the 2014 Conference on Neural Information Processing Systems. New York:Curran Associates, 2014:568-576.
[11] NG Y H, HAUSKNECHT M, VIJAYANARASIMHAN S, et al. Beyond short snippets:deep networks for video classification[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Washington, DC:IEEE Computer Society, 2015:4694-4702.
[12] WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks:towards good practices for deep action recognition[C]//Proceedings of the 2016 European Conference on Computer Vision. Berlin:Springer, 2016:22-36.
[13] WANG L, QIAO Y, TANG X. Action recognition with trajectory-pooled deep-convolutional descriptors[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Washington, DC:IEEE Computer Society, 2015:4305-4314.
[14] SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Washington, DC:IEEE Computer Society, 2016:2818-2826.
[15] MURPHY K P. Machine Learning:A Probabilistic Perspective[M]. Cambridge:MIT Press, 2012:22.
[16] HORN B K P, SCHUNCK B G. Determining optical flow[J]. Artificial Intelligence, 1981, 17(1/2/3):185-203.
[17] 周志华.机器学习[M].北京:清华大学出版社,2016:171-173.(ZHOU Z H. Machine Learning[M]. Beijing:Tsinghua University Press, 2016:171-173.)
[18] JIANG Y G, LIU J, ZAMIR A, et.al. Competition track evaluation setup, the first international workshop on action recognition with a large number of classes[EB/OL].[2018-05-20]. http://www.crcv.ucf.edu/ICCV13-Action-Workshop/index.files/Competition_Track_Evaluation.pdf.