计算机应用 ›› 2019, Vol. 39 ›› Issue (11): 3349-3354.DOI: 10.11772/j.issn.1001-9081.2019040633

• 虚拟现实与多媒体计算 • 上一篇    下一篇

基于RGB和关节点数据融合模型的双人交互行为识别

姬晓飞, 秦琳琳, 王扬扬   

  1. 沈阳航空航天大学 自动化学院, 沈阳 110168
  • 收稿日期:2019-04-15 修回日期:2019-07-26 出版日期:2019-11-10 发布日期:2019-08-26
  • 通讯作者: 姬晓飞
  • 作者简介:姬晓飞(1978-),女,辽宁鞍山人,副教授,博士,主要研究方向:视频分析与处理、模式识别;秦琳琳(1994-),女,山东菏泽人,硕士研究生,主要研究方向:视频分析与处理、生物特征与行为分析;王扬扬(1979-),女,辽宁沈阳人,工程师,博士,主要研究方向:视频分析与处理。
  • 基金资助:
    国家自然科学基金资助项目(61602321);辽宁省教育厅科学研究服务地方项目(L201708);辽宁省教育厅科学研究青年项目(L201745)。

Human interaction recognition based on RGB and skeleton data fusion model

JI Xiaofei, QIN Linlin, WANG Yangyang   

  1. College of Automation, Shenyang Aerospace University, Shenyang Liaoning 110136, China
  • Received:2019-04-15 Revised:2019-07-26 Online:2019-11-10 Published:2019-08-26
  • Supported by:
    This work is partially supported by National Natural Science Foundation of China (61602321), the Local Project of Scientific Research Service of Liaoning Education Department (L201708), the Scientific Research Youth Project of Liaoning Education Department (L201745).

摘要: 基于RGB视频序列的双人交互行为识别已经取得了重大进展,但因缺乏深度信息,对于复杂的交互动作识别不够准确。深度传感器(如微软Kinect)能够有效提高全身各关节点的跟踪精度,得到准确的人体运动及变化的三维关节点数据。依据RGB视频和关节点数据的各自特性,提出一种基于RGB和关节点数据双流信息融合的卷积神经网络(CNN)结构模型。首先,利用Vibe算法获得RGB视频在时间域的感兴趣区域,之后提取关键帧映射到RGB空间,以得到表示视频信息的时空图,并把图送入CNN提取特征;然后,在每帧关节点序列中构建矢量,以提取余弦距离(CD)和归一化幅值(NM)特征,将单帧中的余弦距离和关节点特征按照关节点序列的时间顺序连接,馈送入CNN学习更高级的时序特征;最后,将两种信息源的softmax识别概率矩阵进行融合,得到最终的识别结果。实验结果表明,将RGB视频信息和关节点信息结合可以有效地提高双人交互行为识别结果,在国际公开的SBU Kinect interaction数据库和NTU RGB+D数据库中分别达到92.55%和80.09%的识别率,证明了提出的模型对双人交互行为识别的有效性。

关键词: RGB视频, 关节点数据, 卷积神经网路, softmax, 融合, 双人交互行为识别

Abstract: In recent years, significant progress has been made in human interaction recognition based on RGB video sequences. Due to its lack of depth information, it cannot obtain accurate recognition results for complex interactions. The depth sensors (such as Microsoft Kinect) can effectively improve the tracking accuracy of the joint points of the whole body and obtain three-dimensional data that can accurately track the movement and changes of the human body. According to the respective characteristics of RGB and joint point data, a convolutional neural network structure model based on RGB and joint point data dual-stream information fusion was proposed. Firstly, the region of interest of the RGB video in the time domain was obtained by using the Vibe algorithm, and the key frames were extracted and mapped to the RGB space to obtain the spatial-temporal map representing the video information. The map was sent to the convolutional neural network to extract features. Then, a vector was constructed in each frame of the joint point sequence to extract the Cosine Distance (CD) and Normalized Magnitude (NM) features. The cosine distance and the characteristics of the joint nodes in each frame were connected in time order of the joint point sequence, and were fed into the convolutional neural network to learn more advanced temporal features. Finally, the softmax recognition probability matrixes of the two information sources were fused to obtain the final recognition result. The experimental results show that combining RGB video information with joint point information can effectively improve the recognition result of human interaction behavior, and achieves 92.55% and 80.09% recognition rate on the international public SBU Kinect interaction database and NTU RGB+D database respectively, verifying the effectiveness of the proposed model for the identification of interaction behaviour between two people.

Key words: RGB video, skeleton data, Convolutional Neural Network (CNN), softmax, fusion, human interaction recognition

中图分类号: