基于视频分段的空时双通道卷积神经网络的行为识别

doi:10.11772/j.issn.1001-9081.2019010156

计算机应用 ›› 2019, Vol. 39 ›› Issue (7): 2081-2086.DOI: 10.11772/j.issn.1001-9081.2019010156

• 虚拟现实与多媒体计算 • 上一篇下一篇

基于视频分段的空时双通道卷积神经网络的行为识别

王萍, 庞文浩

西安交通大学电子与信息工程学院, 西安 710049

收稿日期:2019-01-22 修回日期:2019-04-03 发布日期:2019-04-15 出版日期:2019-07-10
通讯作者: 王萍
作者简介:王萍(1976-),女,陕西西安人,副教授,博士,主要研究方向:视频编码、视频分析;庞文浩(1994-),男,山东临沂人,硕士研究生,主要研究方向:视频分类、视频摘要。
基金资助:
国家自然科学基金资助项目（61671365）。

Two-stream CNN for action recognition based on video segmentation

WANG Ping, PANG Wenhao

School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an Shaanxi 710049, China

Received:2019-01-22 Revised:2019-04-03 Online:2019-04-15 Published:2019-07-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61671365).

摘要/Abstract

摘要：

针对原始空时双通道卷积神经网络（CNN）模型对长时段复杂视频中行为识别率低的问题，提出了一种基于视频分段的空时双通道卷积神经网络的行为识别方法。首先将视频分成多个等长不重叠的分段，对每个分段随机采样得到代表视频静态特征的帧图像和代表运动特征的堆叠光流图像；然后将这两种图像分别输入到空域和时域卷积神经网络进行特征提取，再在两个通道分别融合各视频分段特征得到空域和时域的类别预测特征；最后集成双通道的预测特征得到视频行为识别结果。通过实验讨论了多种数据增强方法和迁移学习方案以解决训练样本不足导致的过拟合问题，分析了不同分段数、预训练网络、分段特征融合方案和双通道集成策略对行为识别性能的影响。实验结果显示所提模型在UCF101数据集上的行为识别准确率达到91.80%，比原始的双通道模型提高了3.8个百分点；同时在HMDB51数据集上的行为识别准确率也比原模型提高，达到61.39%，这表明所提模型能够更好地学习和表达长时段复杂视频中人体行为特征。

关键词: 双通道卷积神经网络, 行为识别, 视频分段, 迁移学习, 特征融合

Abstract:

Aiming at the issue that original spatial-temporal two-stream Convolutional Neural Network (CNN) model has low accuracy for action recognition in long and complex videos, a two-stream CNN for action recognition based on video segmentation was proposed. Firstly, a video was split into multiple non-overlapping segments with same length. For each segment, one frame image was sampled randomly to represent its static features and stacked optical flow images were calculated to represent its motion features. Secondly, these two patterns of images were input into the spatial CNN and temporal CNN for feature extraction, respectively. And the classification prediction features of spatial and temporal domains for action recognition were obtained by merging all segment features in two streams respectively. Finally, the two-steam predictive features were integrated to obtain the action recognition results for the video. In series of experiments, some data augmentation techniques and transfer learning methods were discussed to solve the problem of over-fitting caused by the lack of training samples. The effects of various factors including the number of segments, network architectures, feature fusion schemes based on segmentation and two-stream integration strategy on the performance of action recognition were analyzed. The experimental results show that the accuracy of action recognition of the proposed model on dataset UCF101 reaches 91.80%, which is 3.8% higher than that of original two-stream CNN model; and the accuracy of the proposed model on dataset HMDB51 is improved to 61.39%, which is higher than that of the original model. It shows that the proposed model can better learn and express the action features in long and complex videos.

Key words: two-stream Convolutional Neural Network (CNN), action recognition, video segmentation, transfer learning, feature fusion

中图分类号:

王萍, 庞文浩. 基于视频分段的空时双通道卷积神经网络的行为识别[J]. 计算机应用, 2019, 39(7): 2081-2086.

WANG Ping, PANG Wenhao. Two-stream CNN for action recognition based on video segmentation[J]. Journal of Computer Applications, 2019, 39(7): 2081-2086.

参考文献

[1] 单言虎,张彰,黄凯奇.人的视觉行为识别研究回顾、现状及展望[J].计算机研究与发展,2016,53(1):93-112.(SHAN Y H, ZHANG Z, HUANG K Q. Review, current situation and prospect of human visual behavior recognition[J]. Journal of Computer Research and Development, 2016, 53(1):93-112.)
[2] FORSYTH D A. Computer Vision:A Modern Approach[M]. 2nd ed. Englewood Cliffs, NJ:Prentice Hall, 2011:1-2.
[3] CAI Z, WANG L, PENG X, et al. Multi-view super vector for action recognition[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Washington, DC:IEEE Computer Society, 2014:596-603.
[4] WANG H, SCHMID C. Action recognition with improved trajectories[C]//Proceedings of the 2013 IEEE International Conference on Computer Vision. Washington, DC:IEEE Computer Society, 2014:3551-3558.
[5] PENG X, WANG L, WANG X, et al. Bag of visual words and fusion methods for action recognition:comprehensive study and good practice[J]. Computer Vision and Image Understanding, 2016, 150:109-125.
[6] WANG L, QIAO Y, TANG X. MoFAP:a multi-level representation for action recognition[J]. International Journal of Computer Vision, 2016, 119(3):254-271.
[7] KARPATHY A, TODERICI G, SHETTY S, et al. Large-scale video classification with convolutional neural networks[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Rec-ognition. Washington, DC:IEEE Computer Society, 2014:1725-1732.
[8] TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of the 2014 IEEE International Conference on Computer Vision. Washington, DC:IEEE Computer Society, 2015:4489-4497.
[9] VAROL G, LAPTEV I, SCHMID C. Long-term temporal convolutions for action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(6):1510-1517.
[10] SIMONYAN K, ZISSERMAN A. Two-stream convolutional net-works for action recognition in videos[C]//Proceedings of the 2014 Conference on Neural Information Processing Systems. New York:Curran Associates, 2014:568-576.
[11] NG Y H, HAUSKNECHT M, VIJAYANARASIMHAN S, et al. Beyond short snippets:deep networks for video classification[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Washington, DC:IEEE Computer Society, 2015:4694-4702.
[12] WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks:towards good practices for deep action recognition[C]//Proceedings of the 2016 European Conference on Computer Vision. Berlin:Springer, 2016:22-36.
[13] WANG L, QIAO Y, TANG X. Action recognition with trajectory-pooled deep-convolutional descriptors[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Washington, DC:IEEE Computer Society, 2015:4305-4314.
[14] SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Washington, DC:IEEE Computer Society, 2016:2818-2826.
[15] MURPHY K P. Machine Learning:A Probabilistic Perspective[M]. Cambridge:MIT Press, 2012:22.
[16] HORN B K P, SCHUNCK B G. Determining optical flow[J]. Artificial Intelligence, 1981, 17(1/2/3):185-203.
[17] 周志华.机器学习[M].北京:清华大学出版社,2016:171-173.(ZHOU Z H. Machine Learning[M]. Beijing:Tsinghua University Press, 2016:171-173.)
[18] JIANG Y G, LIU J, ZAMIR A, et.al. Competition track evaluation setup, the first international workshop on action recognition with a large number of classes[EB/OL].[2018-05-20]. http://www.crcv.ucf.edu/ICCV13-Action-Workshop/index.files/Competition_Track_Evaluation.pdf.

基于视频分段的空时双通道卷积神经网络的行为识别

Two-stream CNN for action recognition based on video segmentation

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	周险兵, 樊小超, 任鸽, 杨勇. 基于多层次语义特征的英文作文自动评分方法[J]. 计算机应用, 2021, 41(8): 2205-2211.
[2]	王伟, 赵尔平, 崔志远, 孙浩. 基于HowNet义原和Word2vec词向量表示的多特征融合消歧方法[J]. 计算机应用, 2021, 41(8): 2193-2198.
[3]	李扬志, 袁家政, 刘宏哲. 基于时空注意力图卷积网络模型的人体骨架动作识别算法[J]. 计算机应用, 2021, 41(7): 1915-1921.
[4]	吴丽丹, 薛雨阳, 童同, 杜民, 高钦泉. 基于前景语义信息的图像着色算法[J]. 计算机应用, 2021, 41(7): 2048-2053.
[5]	杜炎, 吕良福, 焦一辰. 基于模糊推理的模糊原型网络[J]. 计算机应用, 2021, 41(7): 1885-1890.
[6]	刘世泽, 秦艳君, 王晨星, 高存远, 罗海勇, 赵方, 王宝会. 基于多尺度特征提取的交通模式识别算法[J]. 计算机应用, 2021, 41(6): 1573-1580.
[7]	章荪, 尹春勇. 基于多任务学习的时序多模态情感分析模型[J]. 计算机应用, 2021, 41(6): 1631-1639.
[8]	田志强, 邓春华, 张俊雯. 基于骨骼时序散度特征的人体行为识别算法[J]. 计算机应用, 2021, 41(5): 1450-1457.
[9]	赖雪梅, 唐宏, 陈虹羽, 李珊珊. 基于注意力机制的特征融合-双向门控循环单元多模态情感分析[J]. 计算机应用, 2021, 41(5): 1268-1274.
[10]	陈争涛, 黄灿, 杨波, 赵立, 廖勇. 基于迁移学习的并行卷积神经网络牦牛脸识别算法[J]. 计算机应用, 2021, 41(5): 1332-1336.
[11]	卞鹏程, 郑忠龙, 李明禄, 何依然, 王天翔, 张大伟, 陈丽媛. 基于注意力融合网络的视频超分辨率重建[J]. 计算机应用, 2021, 41(4): 1012-1019.
[12]	胡屹杉, 秦品乐, 曾建潮, 柴锐, 王丽芳. 基于特征融合和动态多尺度空洞卷积的超声甲状腺分割网络[J]. 计算机应用, 2021, 41(3): 891-897.
[13]	姜倩玉, 王凤英, 贾立鹏. 基于感知哈希算法和特征融合的恶意代码检测方法[J]. 计算机应用, 2021, 41(3): 780-785.
[14]	后云龙, 朱磊, 陈琴, 吕燧栋. 基于高斯差分特征网络的显著目标检测[J]. 计算机应用, 2021, 41(3): 706-713.
[15]	蒋丽, 黄仕建, 严文娟. 基于低秩行为信息和多尺度卷积神经网络的人体行为识别方法[J]. 计算机应用, 2021, 41(3): 721-726.