The introduction of multimodal information to enhance knowledge graph link prediction has become a recent hotspot. However, most existing methods typically rely on simple concatenation or attention mechanisms for multimodal feature fusion, ignoring the correlation and semantic inconsistency between different modalities, which may fail to preserve modality-specific information and inadequately exploit the complementary information between modalities. To address these issues, a multimodal knowledge graph link prediction model based on cross-modal attention mechanism and contrastive learning was proposed, namely FITILP(Fusing Image and Textual Information for Link Prediction). Firstly, pretrained models, such as BERT (Bidirectional Encoder Representation of Transformer) and ResNet (Residual Network), were used to extract textual and visual features of entities. Then, a Contrastive Learning (CL) approach was applied to reduce semantic inconsistencies across modalities. A cross-modal attention module was designed to refine text feature attention parameters using image features, thereby enhancing the cross-modal correlations between text and image features. And Translation models, such as TransE (Translating Embeddings) and TransH (Translation on Hyperplanes), were employed to generate graph structural, visual, and textual features. Finally, the three types of features were fused to perform link prediction between entities. Experimental results on the DB15K dataset show that the FITILP model improves Mean Reciprocal Rank (MRR) by 6.6 percentage points compared to single-modal baseline TransE, and achieves improvements of 3.95, 11.37, and 14.01 percentage points in Hits@1, Hits@10 and Hits@100, respectively. The results indicate that the proposed method outperforms comparative baseline methods, demonstrating its effectiveness in leveraging multimodal information to enhance prediction performance.