Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (4): 1012-1019.DOI: 10.11772/j.issn.1001-9081.2020081292

Attention fusion network based video super-resolution reconstruction

BIAN Pengcheng, ZHENG Zhonglong, LI Minglu, HE Yiran, WANG Tianxiang, ZHANG Dawei, CHEN Liyuan   

  1. College of Mathematics and Computer Science, Zhejiang Normal University, Jinhua Zhejiang 321004, China
  • Received:2020-08-24 Revised:2020-09-18 Online:2021-04-10 Published:2020-11-05
    This work is partially supported by the National Natural Science Foundation of China (61672467), the Zhejiang Provincial Natural Science Foundation (LGG18F020017).


卞鹏程, 郑忠龙, 李明禄, 何依然, 王天翔, 张大伟, 陈丽媛   

  1. 浙江师范大学 数学与计算机科学学院, 浙江 金华 321004
  • 作者简介:卞鹏程(1993—),男,安徽六安人,硕士研究生,主要研究方向:深度学习、计算机视觉;郑忠龙(1976—),男,河北沧州人,教授,博士,CCF会员,主要研究方向:模式识别、机器学习、图像处理;李明禄(1965—),男,重庆人,教授,博士,CCF会员,主要研究方向:云计算、车辆自组网络、无线传感器网络、大数据分析;何依然(1996—),女,浙江杭州人,硕士,主要研究方向:机器学习;王天翔(1994—),男,浙江金华人,博士研究生,主要研究方向:机器学习、计算机视觉;张大伟(1995—),男,江苏宿迁人,博士研究生,主要研究方向:深度学习、计算机视觉;陈丽媛(1994—),女,河南焦作人,博士研究生,主要研究方向:深度学习、计算机视觉。
Abstract: Video super-resolution methods based on deep learning mainly focus on the inter-frame and intra-frame spatio-temporal relationships in the video, but previous methods have many shortcomings in the feature alignment and fusion of video frames, such as inaccurate motion information estimation and insufficient feature fusion. Aiming at these problems, a video super-resolution model based on Attention Fusion Network(AFN) was constructed with the use of the back-projection principle and the combination of multiple attention mechanisms and fusion strategies. Firstly, at the feature extraction stage, in order to deal with multiple motions between neighbor frames and reference frame, the back-projection architecture was used to obtain the error feedback of motion information. Then, a temporal, spatial and channel attention fusion module was used to perform the multi-dimensional feature mining and fusion. Finally, at the reconstruction stage, the obtained high-dimensional features were convoluted to reconstruct high-resolution video frames. By learning different weights of features within and between video frames, the correlations between video frames were fully explored, and an iterative network structure was adopted to process the extracted features gradually from coarse to fine. Experimental results on two public benchmark datasets show that AFN can effectively process videos with multiple motions and occlusions, and achieves significant improvements in quantitative indicators compared to some mainstream methods. For instance, for 4-times reconstruction task, the Peak Signal-to-Noise Ratio(PSNR) of the frame reconstructed by AFN is 13.2% higher than that of Frame Recurrent Video Super-Resolution network(FRVSR) on Vid4 dataset and 15.3% higher than that of Video Super-Resolution network using Dynamic Upsampling Filter(VSR-DUF) on SPMCS dataset.

Key words: super-resolution, attention mechanism, feature fusion, back-projection, video reconstruction

摘要: 基于深度学习的视频超分辨率方法主要关注视频帧内和帧间的时空关系,但以往的方法在视频帧的特征对齐和融合方面存在运动信息估计不精确、特征融合不充分等问题。针对这些问题,采用反向投影原理并结合多种注意力机制和融合策略构建了一个基于注意力融合网络(AFN)的视频超分辨率模型。首先,在特征提取阶段,为了处理相邻帧和参考帧之间的多种运动,采用反向投影结构来获取运动信息的误差反馈;然后,使用时间、空间和通道注意力融合模块来进行多维度的特征挖掘和融合;最后,在重建阶段,将得到的高维特征经过卷积重建出高分辨率的视频帧。通过学习视频帧内和帧间特征的不同权重,充分挖掘了视频帧之间的相关关系,并利用迭代网络结构采取渐进的方式由粗到精地处理提取到的特征。在两个公开的基准数据集上的实验结果表明,AFN能够有效处理包含多种运动和遮挡的视频,与一些主流方法相比在量化指标上提升较大,如对于4倍重建任务,AFN产生的视频帧的峰值信噪比(PSNR)在Vid4数据集上比帧循环视频超分辨率网络(FRVSR)产生的视频帧的PSNR提高了13.2%,在SPMCS数据集上比动态上采样滤波视频超分辨率网络(VSR-DUF)产生的视频帧的PSNR提高了15.3%。

关键词: 超分辨率, 注意力机制, 特征融合, 反向投影, 视频重建

