As the core transmission and distribution carrier of the power system, the operating condition of high-voltage transmission lines directly impacts the safety of the power grid. To address the problems of low efficiency and high missed rate in traditional manual inspection, a lightweight method for transmission line defect detection based on a two-stage multi-modal attention mechanism and dynamic feature decoupling was proposed. In the first stage, accurate localization of key components was achieved on the basis of an improved lightweight detection network, Light-YOLO. In the second stage, a dual-branch contrastive learning-based defect detection network, Dual-DifferNet, was built to achieve precise classification and identification of defects. In the design of Light-YOLO, a hybrid structure of hierarchical Separable Vision Transformer (SepViT) and deep Deformable Convolutional Network (DCN) was introduced, and by stacking local perception convolutional layers and global attention Transformer blocks alternately, the model’s modeling capability of long-range dependencies was enhanced while reducing computational cost, thereby improving the detection accuracy of small targets such as insulators and conductor splices effectively. For the defect classification task, in Dual-DifferNet, a dual-branch structure was adopted to embed a Spatial-Channel Dual Attention (SCDA) module in each branch, and the dual-modal feature interaction was promoted using a cross attention mechanism, thereby improving the robustness and generalization capability of defect identification. Experimental results show that the proposed method achieves a mean Average Precision (mAP@50) of 96.9%, which is 16.1 percentage points higher than that of the baseline model YOLOv8, with the floating-point operations reduced by 56.73%, fully verifying the method’s high detection accuracy, excellent computational efficiency, and deployment potential.