In response to the difficulty of extracting features from lesion areas in pneumonia X-ray images and the limited lightweight degree of the existing models, a Feature Fusion MV2-Transformer (FFMV2-Transformer) pneumonia X-ray image classification model was proposed. Firstly, the lightweight network MobileNetV2 (Mobile Network Version 2) was employed as the backbone network, with the Coordinate Attention (CA) mechanism embedded in the inverted residual bottleneck blocks, so as to enhance the model’s ability to extract features from lesion areas by embedding positional information into channel information. Secondly, a Local and Global Feature Fusion Module (LGFFM) was proposed to combine local features extracted by convolutional layers with global features captured by Transformer, thereby enabling the model to capture detailed and holistic information of lesion areas simultaneously, and further improving the model’s semantic feature extraction capabilities. Finally, a Cross-layer Feature Fusion Module (CFFM) was proposed to combine the spatial information from shallow features enhanced by the spatial attention mechanism with the semantic information from deep features enhanced by the channel attention mechanism, thereby obtaining rich contextual information. To verify the model’s effectiveness, ablation experiments and comparison experiments were conducted on a pneumonia X-ray dataset. The results show that compared to MobileViT (Mobile Vision Transformer) model, FFMV2-Transformer model achieves improvements of 1.09, 0.31, 1.91, 1.08 and 0.40 percentage points in accuracy, precision, recall, F1-score and AUC (Area Under ROC (Receiver Operating Characteristic) Curve), respectively. It can be seen that FFMV2-Transformer model extracts lesion area features from pneumonia X-ray images effectively while realizing model lightweighting.