Real-world networks are often composed of multiple types of entities and interaction relationships, with topological structure and attributes evolving with time continuously. The heterogeneity and dynamics inherent in such networks can be fully described by Dynamic Heterogeneous Graph (DHG). To solve the problems of coarse spatio-temporal information fusion and heavy reliance of the supervised learning paradigm on manual labels in the existing DHG representation learning models, a Masked AutoEncoder (MAE) enhanced DHG representation learning model was proposed. Firstly, heterogeneous spatial information was fused through a multi-level attention structure, and temporal information was fused across snapshots. Then, representation information of nodes was enriched by leveraging the reconstruction loss of the masked autoencoder. Experimental results show that improvements of at least 1.26 to 3.99 percentage points in Area Under the receiver operating Characteristic curve (AUC) are achieved by the proposed model on link prediction tasks compared to baseline models on multiple real-world datasets. It can be seen that the proposed model provides an effective self-supervised framework for DHG representation learning, facilitating more precise capture of heterogeneous information and dynamic evolution laws in real networks.