《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (1): 113-122.DOI: 10.11772/j.issn.1001-9081.2023060853

• 人工智能 • 上一篇    

结合内卷与卷积算子的视频预测模型

朱俊宏1, 赖俊宇1,2(), 甘炼强1, 陈智勇1, 刘华烁1, 徐国尧1   

  1. 1.电子科技大学 航空航天学院, 成都 611731
    2.飞行器集群智能感知与协同控制四川省重点实验室(电子科技大学), 成都 611731
  • 收稿日期:2023-06-30 修回日期:2023-10-10 接受日期:2023-10-13 发布日期:2024-01-24 出版日期:2024-01-10
  • 通讯作者: 赖俊宇
  • 作者简介:朱俊宏(1998—),男,四川德阳人,硕士研究生,主要研究方向:计算机视觉、视频预测;
    甘炼强(2000—),男,四川广安人,硕士研究生,主要研究方向:计算机视觉、视频预测;
    陈智勇(1997—),男,湖南岳阳人,硕士研究生,主要研究方向:机器学习、深度学习;
    刘华烁(2000—),男,河北衡水人,硕士研究生,主要研究方向:深度学习、强化学习;
    徐国尧(2000—),男,湖北黄冈人,硕士研究生,主要研究方向:深度学习、强化学习。
    第一联系人:赖俊宇(1981—),男,四川德阳人,副教授,博士,主要研究方向:计算机视觉、计算机网络;
  • 基金资助:
    四川省重点研发计划项目(2022YFS0546)

Video prediction model combining involution and convolution operators

Junhong ZHU1, Junyu LAI1,2(), Lianqiang GAN1, Zhiyong CHEN1, Huashuo LIU1, Guoyao XU1   

  1. 1.School of Aeronautics and Astronautics,University of Electronic Science and Technology of China,Chengdu Sichuan 611731,China
    2.Aircraft Swarm Intelligent Sensing and Cooperative Control Key Laboratory of Sichuan Province (University of Electronic Science and Technology of China),Chengdu Sichuan 611731,China
  • Received:2023-06-30 Revised:2023-10-10 Accepted:2023-10-13 Online:2024-01-24 Published:2024-01-10
  • Contact: Junyu LAI
  • About author:ZHU Junhong, born in 1998, M. S. candidate. His research interests include computer vision, video prediction.
    GAN Lianqiang, born in 2000, M. S. candidate. His research interests include computer vision, video prediction.
    CHEN Zhiyong, born in 1997, M. S. candidate. His research interests include machine learning, deep learning.
    LIU Huashuo, born in 2000, M. S. candidate. His research interests include deep learning, reinforcement learning.
    XU Guoyao, born in 2000, M. S. candidate. His research interests include deep learning, reinforcement learning.
  • Supported by:
    Key Research and Development Project of Sichuan Province(2022YFS0546)

摘要:

针对基于传统深度学习的视频预测中对数据空间特征提取效果不佳及预测精度低的问题,提出一种结合内卷与卷积算子(CICO)的视频预测模型。该模型主要通过以下三个方面提高视频序列的预测性能:首先,采用不同大小的卷积核增强对数据多粒度空间特征的提取能力,较大的卷积核能够提取更大空间范围的特征,而较小的卷积核可更精确地捕获视频目标的运动细节,实现对目标多角度表征学习;其次,用计算效率更高、参数更少的内卷算子替代核较大的卷积算子,内卷通过高效的通道间交互避免了大量的不必要参数,在降低计算和存储成本的同时提升模型预测能力;最后,引入核为1×1的卷积进行线性映射,增强不同特征之间的联合表达,提高了模型参数的利用效率并增强了预测的鲁棒性。通过多个数据集对该模型进行全面测试,结果表明,相较于目前最优的SimVP(Simpler yet better Video Prediction)模型,所提模型在多项指标上均有显著提升。在移动手写数据集上,均方误差和平均绝对误差分别降低25.2%和17.4%;在北京交通数据集上,均方误差降低1.2%;在人体行为数据集上,结构相似性指数和峰值信噪比分别提高0.66%和0.47%。可见,所提模型在提升视频预测精度方面十分有效。

关键词: 深度学习, 视频预测, 内卷, 卷积, 时空信息

Abstract:

To address the inadequate feature extraction from data space and low prediction accuracy in traditional deep learning based video prediction, a video prediction model Combining Involution and Convolution Operators (CICO) was proposed. The model enhanced video prediction performance through three aspects. Firstly, convolutions with varying kernel sizes were adopted to enhance extraction ability of multi-granularity spatial features and enable multi-angle representational learning of targets. In particular, larger kernels were applied to extract features from broader spatial ranges, while smaller kernels were employed to capture motion details more precisely. Secondly, large-kernel convolutions were replaced by the computationally efficient involution operators with fewer parameters in order to achieve efficient inter-channel interaction, avoid redundant parameters, decrease computational and storage costs. The predictive capacity of the model was enhanced at the same time. Finally, convolutions with kernel size 1×1 were introduced for linear mapping to strengthen joint expression between distinct features, improve parameter utilization efficiency, and strengthen prediction robustness. The proposed model’s superiority was validated through comprehensive experiments on various datasets, resulting in significant improvements over the state-of-the-art SimVP (Simpler yet Better Video Prediction) model. On Moving MNIST dataset, the Mean Squared Error (MSE) and Mean Absolute Error (MAE) were reduced by 25.2% and 17.4%, respectively. On Traffic Beijing dataset, the MSE was reduced by 1.2%. On KTH dataset, the Structure Similarity Index Measure (SSIM) and Peak Signal-to-Noise Ratio (PSNR) were improved by 0.66% and 0.47%, respectively. It can be seen that the proposed model is very effective in improving accuracy of video prediction.

Key words: deep learning, video prediction, involution, convolution, spatiotemporal information

中图分类号: