In 3D object detection, the detection accuracy of small targets such as pedestrians and cyclists remains low, presenting a challenging issue to perception systems of autonomous vehicles. To estimate the state of surrounding environment accurately and enhance driving safety, a 3D object detection algorithm based on a multi-scale network and axial attention was proposed after improving Voxel R-CNN (Voxel Region-based Convolutional Neural Network) algorithm. Firstly, a multi-scale network and a Pixel-level Fusion Module (PFM) were constructed in the backbone network to obtain richer and more precise feature representations, thereby enhancing robustness and generalization of the algorithm in complex scenarios. Secondly, an axial attention mechanism, tailored for 3D spatial dimension features, was designed and applied to Region of Interest (RoI) multi-scale pooling features, so as to capture both local and global features effectively while preserving essential information in 3D spatial structure, thereby improving accuracy and efficiency of object detection and classification of the algorithm. Finally, a Rotation-Decoupled Intersection over Union (RDIoU) method was brought into regression and classification branches, thereby enabling network to learn more precise bounding boxes and addressing alignment issue between classification and regression. Experimental results on KITTI public dataset show that the proposed algorithm achieves the mean Average Precision (mAP) of 62.25% for pedestrians and 79.36% for cyclists, which are improved by 4.02 and 3.15 percentage points, respectively, compared to baseline algorithm Voxel R-CNN, demonstrating the effectiveness of the improved algorithm in detecting hard-to-perceive objects.