The key of cross-modal image-text retrieval is how to capture the semantic correlation between images and text effectively. Most of the existing methods learn the global semantic correlation between image region features and text features or local semantic correlation between inter-modality objects, and ignore the correlation between the intra-modality object relationships and inter-modality object relationships. To solve this problem, a method of Cross-Modal Tensor Fusion Network based on Semantic Relation Graph (CMTFN-SRG) for image-text retrieval was proposed. Firstly, the relationships of image regions and text words were generated by Graph Convolutional Network (GCN) and Bidirectional Gated Recurrent Unit (Bi-GRU) respectively. Then, the fine-grained semantic correlation between the data of two modals was learned by using the tensor fusion network to match the learned semantic relation graph of image regions and the graph of text words. At the same time, Gated Recurrent Unit (GRU) was used to learn global features of the image, and the global features of the image and the text were matched to capture the inter-modality global semantic correlation. The proposed method was compared with the Multi-Modality Cross Attention (MMCA) method on the benchmark datasets Flickr30K and MS-COCO. Experimental results show that the proposed method improves the Recall@1 of text-to-image retrieval task by 2.6%, 9.0% and 4.1% respectively on the test datasets Flickr30K, MS-COCO1K and MS-COCO5K.And mean Recall (mR) improves by 0.4, 1.3 and 0.1 percentage points respectively. It can be seen that the proposed method can effectively improve the precision of image-text retrieval.