Aiming at the problem of model compatibility caused by modality absence in real complex scenes, an emotion recognition method was proposed, supporting input from any available modality. Firstly, during the pre-training and fine-tuning stages, a modality-random-dropout training strategy was adopted to ensure model compatibility during reasoning. Secondly, a spatio-temporal masking strategy and a feature fusion strategy based on cross-modal attention mechanism were proposed respectively, so as to reduce risk of the model over-fitting and enhance cross-modal feature fusion effects. Finally, to solve the noise label problem brought by inconsistent emotion labels across various modalities, an adaptive denoising strategy based on multi-prototype clustering was proposed. In the strategy, class centers were set for different modalities, and noisy labels were removed by comparing the consistency between clustering categories of each modality’ features and their labels. Experimental results show that on a self-built dataset, compared with the baseline Audio-Visual Hidden unit Bidirectional Encoder Representation from Transformers (AV-HuBERT), the proposed method improves the Weighted Average Recall rate (WAR) index by 6.98 percentage points of modality alignment reasoning, 4.09 percentage points while video modality is absent, and 33.05 percentage points while audio modality is absent; compared with AV-HuBERT on public video dataset DFEW, the proposed method achieves the highest WAR, reaching 68.94%.