Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (8): 2537-2545.DOI: 10.11772/j.issn.1001-9081.2024071058

• Artificial intelligence • Previous Articles    

3D object detection algorithm based on multi-scale network and axial attention

Chengzhi YAN, Ying CHEN(), Kai ZHONG, Han GAO   

  1. School of Computer Science and Information Engineering,Shanghai Institute of Technology,Shanghai 201418,China
  • Received:2024-07-29 Revised:2024-09-29 Accepted:2024-10-08 Online:2024-11-19 Published:2025-08-10
  • Contact: Ying CHEN
  • About author:YAN Chengzhi, born in 2000, M. S. candidate. His research interests include computer vision, object detection.
    ZHONG Kai, born in 2000, M. S. candidate. His research interests include computer vision, object detection.
    GAO Han, born in 1999, M. S. candidate. His research interests include computer vision, object detection.
  • Supported by:
    National Natural Science Foundation of China(61976140);Collaborative Innovation Fund Project of Shanghai Institute of Technology(XTCX2022-25)

基于多尺度网络与轴向注意力的3D目标检测算法

颜承志, 陈颖(), 钟凯, 高寒   

  1. 上海应用技术大学 计算机科学与信息工程学院,上海 201418
  • 通讯作者: 陈颖
  • 作者简介:颜承志(2000—),男,湖南株洲人,硕士研究生,CCF会员,主要研究方向:计算机视觉、目标检测
    钟凯(2000—),男,江西新余人,硕士研究生,主要研究方向:计算机视觉、目标检测
    高寒(1999—),男,安徽亳州人,硕士研究生,主要研究方向:计算机视觉、目标检测。
  • 基金资助:
    国家自然科学基金资助项目(61976140);上海应用技术大学协同创新基金资助项目(XTCX2022-25)

Abstract:

In 3D object detection, the detection accuracy of small targets such as pedestrians and cyclists remains low, presenting a challenging issue to perception systems of autonomous vehicles. To estimate the state of surrounding environment accurately and enhance driving safety, a 3D object detection algorithm based on a multi-scale network and axial attention was proposed after improving Voxel R-CNN (Voxel Region-based Convolutional Neural Network) algorithm. Firstly, a multi-scale network and a Pixel-level Fusion Module (PFM) were constructed in the backbone network to obtain richer and more precise feature representations, thereby enhancing robustness and generalization of the algorithm in complex scenarios. Secondly, an axial attention mechanism, tailored for 3D spatial dimension features, was designed and applied to Region of Interest (RoI) multi-scale pooling features, so as to capture both local and global features effectively while preserving essential information in 3D spatial structure, thereby improving accuracy and efficiency of object detection and classification of the algorithm. Finally, a Rotation-Decoupled Intersection over Union (RDIoU) method was brought into regression and classification branches, thereby enabling network to learn more precise bounding boxes and addressing alignment issue between classification and regression. Experimental results on KITTI public dataset show that the proposed algorithm achieves the mean Average Precision (mAP) of 62.25% for pedestrians and 79.36% for cyclists, which are improved by 4.02 and 3.15 percentage points, respectively, compared to baseline algorithm Voxel R-CNN, demonstrating the effectiveness of the improved algorithm in detecting hard-to-perceive objects.

Key words: 3D object detection, multi-scale network, feature fusion, axial attention, loss function

摘要:

在3D目标检测中小目标诸如行人和骑行者的检测精确度较低,这是自动驾驶感知系统中存在的挑战性问题。为了准确估计周围环境的状态从而提高行车安全,对Voxel R-CNN(Voxel Region-based Convolutional Neural Network)算法进行改进,提出一种基于多尺度网络与轴向注意力的3D目标检测算法。首先,在主干网络中构建多尺度网络和像素级融合模块(PFM)获取更丰富和精准的特征表示,从而增强算法在复杂场景下的鲁棒性和泛化能力;其次,设计适用于具有3D空间维度特征的轴向注意力,并将它应用于感兴趣区域(RoI)的多尺度池化特征,以在有效捕捉局部和全局特征的同时保留3D空间结构中的重要信息,从而提升算法的目标检测和分类的精度和效率;最后,将一种旋转解耦的交并比(RDIoU)方法纳入回归和分类分支,从而使网络学习更精确的边界框,并解决分类和回归之间的对齐问题。在KITTI公开数据集上的实验结果表明,所提算法对行人和骑行者的平均精度均值(mAP)分别达到了62.25%和79.36%,与基准算法Voxel R-CNN相比分别提高了4.02和3.15个百分点,显示出了改进算法在难感知目标检测上的有效性。

关键词: 3D目标检测, 多尺度网络, 特征融合, 轴向注意力, 损失函数

CLC Number: