In order to improve the alignment accuracy and fusion efficiency between different modal features in multimodal event extraction methods and enhance the model’s understanding of semantic relationship between images and texts, a multimodal event extraction model based on dual-channel “text-image” feature gated fusion mechanism named MEE-DF (Multimodal Event Extraction based on Dual-channel Fusion) was proposed. Firstly, the channel of generating text descriptions from images was expanded, the event arguments existed in the images implicitly were mined, and the information representation of event extraction was improved. Secondly, Locality Constrained Cross Attention (LCCA) mechanism was built, the geometric alignment graphs were generated to embed image information, and image features with high discrimination were extracted. Thirdly, an adversarial gating mechanism based on interactive attention maps was constructed to achieve fine-grained alignment of text entities and image objects. Finally, a dual-channel fusion feature strategy was used to filter important Patch features, remove redundant information, and improve feature integration efficiency. Experimental results on the MEED and the M2E2 public datasets show that MEE-DF has the F1 value reached 90.9% and 88.8% on the event type detection task, respectively, and the F1 value reached 73.3% and 68.1% on the Event Argument Extraction (EAE) task, respectively. It can be seen that MEE-DF is better than the existing event extraction models. Ablation experiments further demonstrate that each module of the proposed model has significant contribution to the improvement of event extraction performance.