《计算机应用》唯一官方网站

• •    下一篇

基于改进Lite-Mono架构的动态空洞分层注意力网络模型

李光辉,屈立成   

  1. 长安大学
  • 收稿日期:2025-07-24 修回日期:2025-09-25 发布日期:2025-11-05 出版日期:2025-11-05
  • 通讯作者: 李光辉

Dynamic dilated convolution and hierarchical local-global attention model based on improved Lite-Mono architecture

  • Received:2025-07-24 Revised:2025-09-25 Online:2025-11-05 Published:2025-11-05

摘要: 单目深度估计作为三维环境感知的核心技术,在自动驾驶、机器人导航等领域具有重要应用价值。在这个领域,Lite-Mono作为轻量级单目深度估计的代表模型,通过空洞卷积和局部?全局注意力机制在保持高效的同时实现了优异的性能。目前Lite-Mono的网络模型面临两大瓶颈:空洞卷积受限于固定空洞率,导致感受野与目标尺度严重失配,从而使小目标因过大空洞率丢失特征,而大目标因过小感受野缺失上下文;另外,局部?全局注意力模块因O(H2W2C)计算复杂度难以处理高分辨率输入,导致实时应用受到限制。因此针对上述问题,在Lite-Mono架构基础上,提出了动态空洞分层注意力网络(DDHL)模型。DDHL模型中,首先提出了动态空洞卷积模块,通过融合权重率预测器实时调整感受野,结合通道注意力生成自适应权重;其次,提出了分层局部?全局注意力机制模块将复杂度降至O((HW/M)^2C);最后,引入可学习全局Token构建跨窗口约束,助力模型在复杂场景中更精准地理解和分析视觉内容,提升对整体场景的感知与建模能力。实验结果表明,DDHL模型在Make3D数据集上的关键指标取得突破:绝对相对误差(Abs Rel)从0.462降至0.290(降幅37.2%),平方相对误差(Sq Rel)降低49.7%。该模型在精度与效率之间取得了良好平衡,具有实际应用价值。

Abstract: Monocular depth estimation is a core technology for 3D environmental perception and is of significant value in applications such as autonomous driving and robot navigation. As a representative lightweight model in this field, Lite-Mono achieved an excellent balance between performance and efficiency through use of dilated convolutions and a local-global attention mechanism. However, two major bottlenecks in current Lite-Mono architecture were identified. Firstly, fixed dilation rate in its dilated convolutions led to mismatch between receptive field and target scales. Small objects lost detailed features due to excessively large rate, while large objects lacked contextual information due to insufficient receptive field. Secondly, high computational complexity of O(H2W2C) in local-global attention module hindered processing of high-resolution inputs, thus limiting real-time application. To address these issues, an enhanced model, DDHL (Dynamic Dilated convolution and Hierarchical Local-global attention) model based on Lite-Mono was proposed. It introduces a Dynamic Dilated Convolution module, which first adjusts the receptive field in real-time via a fusion weight rate predictor and generates adaptive weights by combining channel attention. Furthermore, a Hierarchical Local-Global Attention module was designed to reduce computational complexity to O((HW/M)^2C). Additionally, a learnable global token was incorporated to establish cross-window dependencies. Experimental results on Make3D dataset demonstrate that DDHL model achieves significant improvements on key metrics. Absolute Relative Error (Abs Rel) is reduced from 0.462 to 0.290, a decrease of 37.2%. Squared Relative Error (Sq Rel) is decreased by 49.7%. Improved model achieves better balance between accuracy and efficiency, demonstrating considerable practical application value.

中图分类号: