Existing methods struggle to accurately predict the structural response of buildings to dynamic loads, such as earthquakes, facing challenges such as the inability to effectively learn the cyclic variation of seismic waves and insufficient feature fusion. To address these challenges, a deep learning model for structural response prediction based on a frequency-domain attention mechanism was proposed. By combining the frequency-domain augmented attention mechanism with Gated Recurrent Units (GRUs), the sparse nature of seismic wave time-series data in the frequency domain was exploited to mine its feature information deeply, and the high efficiency of GRU in time-series tasks was also retained, thereby enabling the efficient encoding of potential seismic wave features. Furthermore, a pyramid network structure with weight stacking was introduced to address the problem of training deep networks by facilitating shortcuts across layers. Additionally, an autoregressive prediction framework was proposed to enrich the feature space and enhance the prediction accuracy of the network by utilizing historical structural responses as auxiliary features. Experimental results of three case studies demonstrate that the proposed model outperforms existing approaches, such as the Residual Long Short-Term Memory (ResLSTM) network and the Physics-informed LSTM (PhyLSTM) network.