Human interaction recognition based on RGB and skeleton data fusion model

doi:10.11772/j.issn.1001-9081.2019040633

Journal of Computer Applications ›› 2019, Vol. 39 ›› Issue (11): 3349-3354.DOI: 10.11772/j.issn.1001-9081.2019040633

• Virtual reality and multimedia computing • Previous Articles Next Articles

Human interaction recognition based on RGB and skeleton data fusion model

JI Xiaofei, QIN Linlin, WANG Yangyang

College of Automation, Shenyang Aerospace University, Shenyang Liaoning 110136, China

Received:2019-04-15 Revised:2019-07-26 Online:2019-11-10 Published:2019-08-26
Supported by:
This work is partially supported by National Natural Science Foundation of China (61602321), the Local Project of Scientific Research Service of Liaoning Education Department (L201708), the Scientific Research Youth Project of Liaoning Education Department (L201745).

基于RGB和关节点数据融合模型的双人交互行为识别

姬晓飞, 秦琳琳, 王扬扬

沈阳航空航天大学自动化学院, 沈阳 110168

通讯作者: 姬晓飞
作者简介:姬晓飞(1978-),女,辽宁鞍山人,副教授,博士,主要研究方向:视频分析与处理、模式识别;秦琳琳(1994-),女,山东菏泽人,硕士研究生,主要研究方向:视频分析与处理、生物特征与行为分析;王扬扬(1979-),女,辽宁沈阳人,工程师,博士,主要研究方向:视频分析与处理。
基金资助:
国家自然科学基金资助项目（61602321）；辽宁省教育厅科学研究服务地方项目（L201708）；辽宁省教育厅科学研究青年项目（L201745）。

Abstract

Abstract: In recent years, significant progress has been made in human interaction recognition based on RGB video sequences. Due to its lack of depth information, it cannot obtain accurate recognition results for complex interactions. The depth sensors (such as Microsoft Kinect) can effectively improve the tracking accuracy of the joint points of the whole body and obtain three-dimensional data that can accurately track the movement and changes of the human body. According to the respective characteristics of RGB and joint point data, a convolutional neural network structure model based on RGB and joint point data dual-stream information fusion was proposed. Firstly, the region of interest of the RGB video in the time domain was obtained by using the Vibe algorithm, and the key frames were extracted and mapped to the RGB space to obtain the spatial-temporal map representing the video information. The map was sent to the convolutional neural network to extract features. Then, a vector was constructed in each frame of the joint point sequence to extract the Cosine Distance (CD) and Normalized Magnitude (NM) features. The cosine distance and the characteristics of the joint nodes in each frame were connected in time order of the joint point sequence, and were fed into the convolutional neural network to learn more advanced temporal features. Finally, the softmax recognition probability matrixes of the two information sources were fused to obtain the final recognition result. The experimental results show that combining RGB video information with joint point information can effectively improve the recognition result of human interaction behavior, and achieves 92.55% and 80.09% recognition rate on the international public SBU Kinect interaction database and NTU RGB+D database respectively, verifying the effectiveness of the proposed model for the identification of interaction behaviour between two people.

Key words: RGB video, skeleton data, Convolutional Neural Network (CNN), softmax, fusion, human interaction recognition

摘要： 基于RGB视频序列的双人交互行为识别已经取得了重大进展，但因缺乏深度信息，对于复杂的交互动作识别不够准确。深度传感器（如微软Kinect）能够有效提高全身各关节点的跟踪精度，得到准确的人体运动及变化的三维关节点数据。依据RGB视频和关节点数据的各自特性，提出一种基于RGB和关节点数据双流信息融合的卷积神经网络（CNN）结构模型。首先，利用Vibe算法获得RGB视频在时间域的感兴趣区域，之后提取关键帧映射到RGB空间，以得到表示视频信息的时空图，并把图送入CNN提取特征；然后，在每帧关节点序列中构建矢量，以提取余弦距离（CD）和归一化幅值（NM）特征，将单帧中的余弦距离和关节点特征按照关节点序列的时间顺序连接，馈送入CNN学习更高级的时序特征；最后，将两种信息源的softmax识别概率矩阵进行融合，得到最终的识别结果。实验结果表明，将RGB视频信息和关节点信息结合可以有效地提高双人交互行为识别结果，在国际公开的SBU Kinect interaction数据库和NTU RGB+D数据库中分别达到92.55%和80.09%的识别率，证明了提出的模型对双人交互行为识别的有效性。

关键词: RGB视频, 关节点数据, 卷积神经网路, softmax, 融合, 双人交互行为识别

CLC Number:

TP391

JI Xiaofei, QIN Linlin, WANG Yangyang. Human interaction recognition based on RGB and skeleton data fusion model[J]. Journal of Computer Applications, 2019, 39(11): 3349-3354.

姬晓飞, 秦琳琳, 王扬扬. 基于RGB和关节点数据融合模型的双人交互行为识别[J]. 计算机应用, 2019, 39(11): 3349-3354.

References

[1] 王世刚,孙爱朦,赵文婷,等. 基于时空兴趣点的单人行为及交互行为识别[J]. 吉林大学学报(工学版), 2015, 45(1):304-308.(WANG S G, SUN A M, ZHAO W T, et al. Single and interactive human behavior recognition algorithm based on spatio-temporal interest point[J]. Journal of Jilin University (Engineering and Technology Edition), 2015, 45(1):304-308.)
[2] GAVRILA D M, DAVIS L S. 3-D model-based tracking of humans in action:a multi-view approach[C]//Proceedings of the 1996 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 1996:73-80.
[3] 赵海勇,刘志镜,张浩. 基于轮廓特征的人体行为识别[J]. 光电子·激光, 2010, 21(10):1547-1551. (ZHAO H Y, LIU Z J, ZHANG H. Human action recognition based on image contour[J]. Journal of Photoelectron·Laser, 2010, 21(10):1547-1551)
[4] 韩磊,李军峰,贾云得. 基于时空单词的双人交互行为识别方法[J].计算机学报, 2010, 33(4):776-784. (HAN L, LI J F, JIA Y D. Human interaction recognition method using spatio-temporal words[J]. Chinese Journal of Computers, 2010, 33(4):776-784.)
[5] LI N, CHENG X, GUO H, et al. Recognizing human interactions by genetic algorithm-based random forest spatio-temporal correlation[J]. Pattern Analysis and Applications, 2016, 19(1):267-282.
[6] YUN K, HONORIO J, CHATTOPADHYAY D, et al. Two-person interaction detection using body-pose features and multiple instance learning[C]//Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. Piscataway:IEEE, 2012:28-35.
[7] SLAMA R, WANNOUS H, DAOUDI M, et al. Accurate 3D action recognition using learning on the Grassmann manifold[J]. Pattern Recognition, 2015, 48(2):556-567.
[8] GHORBEL E, BOUTTEAU R, BOONAERT J, et al. 3D real-time human action recognition using a spline interpolation approach[C]//Proceedings of the 2015 International Conference on Image Processing Theory, Tools and Applications. Piscataway:IEEE, 2015:61-66.
[9] SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge, MA:MIT Press, 2014:568-576.
[10] LI C, ZHONG Q, XIE D, et al. Skeleton-based action recognition with convolutional neural networks[C]//Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops. Piscataway:IEEE, 2017:597-600.
[11] LIU J, WANG G, DUAN L, et al. Skeleton-based human action recognition with global context-aware attention LSTM networks[J]. IEEE Transactions on Image Processing, 2018, 27(4):1586-1599.
[12] KE Q, BENNAMOUN M, AN S, et al. Learning clip representations for skeleton-based 3D action recognition[J]. IEEE Transactions on Image Processing, 2018, 27(6):2842-2855.
[13] LIU J, SHAHROUDY A, XU D, et al. Spatio-temporal LSTM with trust gates for 3D human act in recognition[C]//Proceedings of the 2016 European Conference on Computer Vision, LNCS 9907. Berlin:Springer, 2016:816-833.
[14] LI C, ZHONG Q, XIE D, et al. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation[EB/OL].[2019-03-20].http://arxiv.org/pdf/1804.06055.
[15] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL].[2019-01-10].https://arxiv.org/pdf/1409.1556.pdf.

Human interaction recognition based on RGB and skeleton data fusion model

基于RGB和关节点数据融合模型的双人交互行为识别

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	SONG Zhongshan, LIANG Jiarui, ZHENG Lu, LIU Zhenyu, TIE Jun. Remote sensing scene classification based on bidirectional gated scale feature fusion [J]. Journal of Computer Applications, 2021, 41(9): 2726-2735.
[2]	LI Kangkang, ZHANG Jing. Multi-layer encoding and decoding model for image captioning based on attention mechanism [J]. Journal of Computer Applications, 2021, 41(9): 2504-2509.
[3]	ZHANG Yongbin, CHANG Wenxin, SUN Lianshan, ZHANG Hang. Detection method of domains generated by dictionary-based domain generation algorithm [J]. Journal of Computer Applications, 2021, 41(9): 2609-2614.
[4]	ZHAO Hong, KONG Dongyi. Chinese description of image content based on fusion of image feature attention and adaptive attention [J]. Journal of Computer Applications, 2021, 41(9): 2496-2503.
[5]	XU Jianglang, LI Linyan, WAN Xinjun, HU Fuyuan. Indoor scene recognition method combined with object detection [J]. Journal of Computer Applications, 2021, 41(9): 2720-2725.
[6]	WANG Hebing, ZHANG Chunmei. Facial landmark detection based on ResNeXt with asymmetric convolution and squeeze excitation [J]. Journal of Computer Applications, 2021, 41(9): 2741-2747.
[7]	ZHOU Xianbing, FAN Xiaochao, REN Ge, YANG Yong. Automated English essay scoring method based on multi-level semantic features [J]. Journal of Computer Applications, 2021, 41(8): 2205-2211.
[8]	ZENG Xiangyin, ZHENG Bochuan, LIU Dan. Detection of left and right railway tracks based on deep convolutional neural network and clustering [J]. Journal of Computer Applications, 2021, 41(8): 2324-2329.
[9]	WANG Wei, ZHAO Erping, CUI Zhiyuan, SUN Hao. Disambiguation method of multi-feature fusion based on HowNet sememe and Word2vec word embedding representation [J]. Journal of Computer Applications, 2021, 41(8): 2193-2198.
[10]	CAO Yuhong, XU Hai, LIU Sun'ao, WANG Zixiao, LI Hongliang. Review of deep learning-based medical image segmentation [J]. Journal of Computer Applications, 2021, 41(8): 2273-2287.
[11]	QIN Binbin, PENG Liangkang, LU Xiangming, QIAN Jiangbo. Research progress on driver distracted driving detection [J]. Journal of Computer Applications, 2021, 41(8): 2330-2337.
[12]	HUANG Chengcheng, DONG Xiaoxiao, LI Zhao. Deep pipeline 5×5 convolution method based on two-dimensional Winograd algorithm [J]. Journal of Computer Applications, 2021, 41(8): 2258-2264.
[13]	GAO Qinquan, HUANG Bingcheng, LIU Wenzhe, TONG Tong. Bamboo strip surface defect detection method based on improved CenterNet [J]. Journal of Computer Applications, 2021, 41(7): 1933-1938.
[14]	LI Chao, LAN Hai, WEI Xian. Attention-based object detection with millimeter wave radar-lidar fusion [J]. Journal of Computer Applications, 2021, 41(7): 2137-2144.
[15]	YANG Su, OUYANG Zhi, DU Nisuo. Unsupervised parallel hash image retrieval based on correlation distance [J]. Journal of Computer Applications, 2021, 41(7): 1902-1907.