Current deep learning-based methods for crop disease recognition rely on specific image datasets of crop diseases for image representation learning, and do not consider the importance of text features in assisting image feature learning. To enhance feature extraction and disease recognition capabilities of the model for crop disease images more effectively, a Crop Disease Recognition method through multi-modal data fusion based on Contrastive Language-Image Pre-training (CDR-CLIP) was proposed. Firstly, high-quality disease recognition image-text pair datasets were constructed to enhance image feature representation through textual information. Then, a multi-modal fusion strategy was applied to integrate text and image features effectively, which strengthened the model capability of distinguishing diseases. Finally, specialized pre-training and fine-tuning strategies were designed to optimize the model’s performance in specific crop disease recognition tasks. Experimental results demonstrate that CDR-CLIP achieves the disease recognition accuracies of 99.31% and 87.66% with F1 values of 99.04% and 87.56%, respectively, on PlantVillage and AI Challenger 2018 crop disease datasets. On PlantDoc dataset, CDR-CLIP achieves the mean Average Precision mAP@0.5 of 51.10%, showing the strong performance advantage of CDR-CLIP.