Bridging the semantic gaps among visual images and text-based questions is the key to improve the reasoning accuracy of Visual Question Answering (VQA) models. However, most the existing related models rely on extracting low-level image features and using attention mechanisms to reason and obtain answers of questions, while ignoring the important role of high-level image semantic features in visual reasoning, such as relationship features and attribute features. In order to solve the above problems, a VQA model based on multi-semantic association and fusion was proposed to establish semantic association among questions and images. Firstly, based on scene graph generation framework, multiple semantic features in images were extracted and refined as the feature input of VQA model to fully explore the information in visual scenes. Secondly, to enhance the semantic value of image features, an information filter was designed to remove noise and redundant information in the image features. Finally, a multi-layer attention fusion and reasoning module was designed to fuse multiple image semantics with question features, respectively, and strengthen the semantic association among the important regions of images and the questions. Experimental results show that compared with Bilinear Attention Network (BAN) and Coarse-to-Fine Reasoning (CFR) models, the proposed model has the accuracy on VQA2.0 test set increased by 2.9 and 0.4 percentage points respectively, and the accuracy on GQA test set increased by 17.2 and 0.3 percentage points respectively, demonstrating that the proposed model can better understand the semantics in image scenes and answer compositional visual questions.