Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (12): 3907-3914.DOI: 10.11772/j.issn.1001-9081.2023111713

• Multimedia computing and computer simulation • Previous Articles     Next Articles

Self-supervised monocular depth estimation using multi-frame sequence images

Wei XIONG1(), Yibo CHEN1, Lizhen ZHANG1, Qian YANG1, Qin ZOU2()   

  1. 1.School of Electrical and Electronic Engineering,Hubei University of Technology,Wuhan Hubei 430068,China
    2.School of Computer Science,Wuhan University,Wuhan Hubei 430072,China
  • Received:2023-12-09 Revised:2024-03-20 Accepted:2024-05-24 Online:2024-07-25 Published:2024-12-10
  • Contact: Wei XIONG, Qin ZOU
  • About author:CHEN Yibo, born in 1998, M. S. candidate. His research interests include monocular depth estimation.
    ZHANG Lizhen, born in 1999, M. S. candidate. Her research interests include medical image segmentation.
    YANG Qian, born in 1997, M. S. candidate. Her research interests include transmission line anomaly detection.
  • Supported by:
    National Natural Science Foundation of China(62202148)

利用多帧序列影像的自监督单目深度估计

熊炜1(), 陈奕博1, 张丽真1, 杨茜1, 邹勤2()   

  1. 1.湖北工业大学 电气与电子工程学院,武汉 430068
    2.武汉大学 计算机学院,武汉 430072
  • 通讯作者: 熊炜,邹勤
  • 作者简介:熊炜(1976—),男,湖北宜昌人,副教授,博士,主要研究方向:计算机视觉、模式识别
    陈奕博(1998—),男,湖北武汉人,硕士研究生,主要研究方向:单目深度估计
    张丽真(1999—),女,陕西西安人,硕士研究生,主要研究方向:医学影像分割
    杨茜(1997—),女,湖北武汉人,硕士研究生,主要研究方向:输电线路异常检测;
  • 基金资助:
    国家自然科学基金资助项目(62202148)

Abstract:

Multi-frame self-supervised monocular depth estimation constructs a Cost Volume (CV) based on the relationship between current frame and the previous frame, serving as an additional input source for the monocular depth estimation network. This approach provides a more accurate description of the temporal and spatial structure of scene videos. However, the cost volume becomes unreliable in the presence of dynamic objects or untextured regions in the scene. Overreliance on the unreliable information within the cost volume leads to a decrease in depth estimation accuracy. To tackle the issue of unreliable information in the cost volume, a multi-frame fusion module was designed to reduce the weights of unreliable information sources dynamically and mitigate the impact of unreliable information sources on the network. Besides, to handle the negative impact of unreliable information sources in cost volume on network training, a network was designed to guide the training of the depth estimation network, preventing the depth estimation network from overly depending on unreliable information. The proposed method achieves excellent performance on KITTI dataset, with absolute relative error, squared relative error, and Root Mean Square Error (RMSE) decreased by 0.015, 0.094, and 0.200, respectively, compared to the benchmark method Lite-Mono. In comparison to similar methods, the proposed method not only has higher precision, but also requires fewer computational resources. The proposed network structure makes full use of the advantages of multi-frame training, while avoiding the defects of multi-frame training (i.e., the influence of cost volume uncertainty on the network), and improves the model precision effectively.

Key words: self-supervised monocular depth estimation, multi-view stereo, monocular video, cost volume

摘要:

多帧自监督单目深度估计通过当前帧和上一帧之间的关系构建代价体积(CV),可以作为单目深度估计网络的额外输入源,更准确地描述场景视频中的时间序列关系和空间结构信息;然而,当场景中存在动态物体或者无纹理区域时,CV会成为不可靠的信息来源。当单目深度估计网络过度依赖CV中的不可靠信息源时,会导致深度估计精度下降。为此,设计一种多帧融合模块动态降低不可靠信息源的权重,减小不可靠信息源对网络的影响。为了应对CV中不可靠信息源对网络训练的负面影响,还设计了一种引导深度估计网络训练的网络,防止深度估计网络过度依赖不可靠信息。所提方法在KITTI数据集上取得了出色的性能,与基准方法Lite-Mono相比,它的绝对相对误差、平方相对误差和均方根误差(RMSE)分别下降了0.015、0.094和0.200;与同类方法相比,所提方法精度更高,且占用的计算资源更少。所提网络结构充分利用了多帧训练的优势,同时避免了多帧训练的缺陷(即CV不确定性对网络的影响),可有效提升模型精度。

关键词: 自监督单目深度估计, 多视图立体, 单目视频, 代价体积

CLC Number: