Journal of Computer Applications ›› 2019, Vol. 39 ›› Issue (6): 1657-1662.DOI: 10.11772/j.issn.1001-9081.2018122551

• Artificial intelligence • Previous Articles     Next Articles

Video frame prediction based on deep convolutional long short-term memory neural network

ZHANG Dezheng, WENG Liguo, XIA Min, CAO Hui   

  1. Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology(Nanjing University of Information Science & Technology), Nanjing Jiangsu 210044, China
  • Received:2018-12-26 Revised:2019-03-17 Online:2019-06-10 Published:2019-06-17
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61503192, 61773219), the Natural Science Foundation of Jiangsu Province (BK20161533), the Qing Lan Project of Jiangsu Province.

基于深度卷积长短时神经网络的视频帧预测

张德正, 翁理国, 夏旻, 曹辉   

  1. 江苏省大气环境与装备技术协同创新中心(南京信息工程大学), 南京 210044
  • 通讯作者: 夏旻
  • 作者简介:张德正(1995-),男,江苏泗阳人,硕士研究生,主要研究方向:机器学习、大数据分析;翁理国(1981-),男,江苏南京人,副教授,博士,主要研究方向:机器学习、大数据分析;夏旻(1983-),男,江苏东台人,副教授,博士,主要研究方向:机器学习、大数据分析;曹辉(1993-),男,江苏淮安人,硕士研究生,主要研究方向:机器学习、大数据分析。
  • 基金资助:
    国家自然科学基金资助项目(61503192,61773219);江苏省自然科学基金资助项目(BK20161533);江苏省青蓝工程。

Abstract: Concerning the difficulty in accurately predicting the spatial structure information details in video frame prediction, a method of deep convolutional Long Short Term Memory (LSTM) neural network was proposed by the improvement of the convolutional LSTM neural network. Firstly, the input sequence images were input into the coding network composed of two deep convolutional LSTM of different channels, and the position information change features and the spatial structure information change features of the input sequence images were learned by the coding network. Then, the learned change features were input into the decoding network corresponding to the coding network channel, and the next predicted picture was output by the decoding network. Finally, the picture was input back to the decoding network, and the next picture was predicted, and all the predicted pictures were output after the pre-set loop times. In the experiments on Moving-MNIST dataset, compared with the convolutional LSTM neural network, the proposed method preserved the accuracy of position information prediction, and had stronger spatial structure information detail representation ability with the same training steps. With the convolutional layer of the convolutional Gated Recurrent Unit (GRU) deepened, the method improved the details of the spatial structure information, verifying the versatility of the idea of the proposed method.

Key words: video frame prediction, Convolutional Neural Network (CNN), Long and Short-Term Memory (LSTM) neural network, encoding prediction, convolutional Gated Recurrent Unit (GRU)

摘要: 针对视频帧预测中难以准确预测空间结构信息细节的问题,通过对卷积长短时记忆(LSTM)神经网络的改进,提出了一种深度卷积长短时神经网络的方法。首先,将输入序列图像输入到两个不同通道的深度卷积LSTM网络组成的编码网络中,由编码网络学习输入序列图像的位置信息变化特征和空间结构信息变化特征;然后,将学习到的变化特征输入到与编码网络通道数对应的解码网络中,由解码网络输出预测的下一张图;最后,将这张图输入回解码网络中,预测接下来的一张图,循环预先设定的次后输出全部的预测图。与卷积LSTM神经网络相比,在Moving-MNIST数据集上的实验中,相同训练步数下所提方法不仅保留了位置信息预测准确的特点,而且空间结构信息细节表征能力更强。同时,将卷积门控循环单元(GRU)神经网络的卷积层加深后,该方法在空间结构信息细节表征上也取得了提升,检验了该方法思想的通用性。

关键词: 视频帧预测, 卷积神经网络, 长短时记忆神经网络, 编码预测, 卷积门控循环单元

CLC Number: