Image captioning technology can help computers understand image content better, and achieve cross-modal interaction. To address the issues of incomplete extraction of multi-granularity features from images and insufficient understanding of image-text correlation in Chinese image captioning tasks, a method for extracting multi-level visual and semantic features of images and dynamically integrating them in decoding process was proposed. Firstly, multi-level visual features were extracted on the encoder, and multi-granularity features were obtained through an auxiliary guidance module of the image local feature extractor. Then, a text-image interaction module was designed to dynamically focus on semantic associations between visual and textual information. Concurrently, a dynamic feature fusion decoder was designed to perform closed-loop dynamic fusion and decoding of features with adaptive cross-modal weights, ensuring enhanced information integrity while maintaining semantic relevance. Finally, coherent Chinese descriptive sentences were generated. The method's effectiveness was evaluated using BLEU-n, Rouge, Meteor, and CIDEr metrics, with comparisons against eight existing approaches. Experimental results demonstrate consistent improvements across all semantic relevance evaluation metrics. Specifically, compared with the baseline model NIC (Neural Image Caption), the proposed method improves the BLEU-1, BLEU-2, BLEU-3, BLEU-4, Rouge_L, Meteor, and CIDEr by 5.62%, 7.25%, 8.78%, 10.85%, 14.06%, 5.14%, and 15.16%, respectively, confirming its superior accuracy.