Social media multi-modal rumor detection faces challenges such as weak cross-modal feature correlation and insufficient intrinsic representation of data. Therefore, a rumor detection method based on cross-modal attention mechanism and contrastive learning was proposed. In the method, fine-grained features of text and vision were extracted by a multi-modal feature module, cross-modal co-attention mechanism and discriminative learning were utilized to enhance inter-modal correlation, complex semantic contexts were captured by using multi-head self-attention, and a contrastive learning module was introduced innovatively to achieve feature optimization under machine supervision. Experimental results on the public Twitter-16 and Weibo datasets show that the accuracy of the proposed method is improved by 5.47 and 4.44 percentage points, respectively, compared with that of the existing optimal model MMFN (Multi-Modal Fusion Network), verifying the key roles of fine-grained feature mining and cross-modal similarity modeling in detection performance. It can be seen that analyzing multi-modal content differences deeply and strengthening cross-modal association mechanism can improve the recognition accuracy of social media rumors effectively.