Due to the diversity and complexity of spatio-temporal features in videos, as well as the wide variability of actions across different speeds and scales, problems of insufficient capture of local motion details and inadequate mining of long-range temporal dependencies are commonly encountered in the existing action recognition methods. Therefore, a video action recognition network based on complementary modeling of local and long-range temporal information was proposed. The network is composed of the Two-level Fusion Motion Excitation (TFME) and the Temporal Aggregation Channel Excitation (TACE) modules. In the TFME module, the first-order and second-order differences between adjacent feature maps were computed and fused, and the fused weights were used to excite channels of the original feature maps, so as to enhance the fine-grained extraction capability of multi-level motion features, thereby modeling local temporal information. In the TACE module, a hierarchical residual pyramid structure was constructed using a channel grouping strategy, which expanded the temporal receptive field and enhanced the learning ability of multi-scale features. Meanwhile, a Temporal Channel Attention (TCA) mechanism was designed to adjust the aggregated feature maps dynamically and optimize the weight allocation among temporal channels, thereby modeling long-range temporal information. Finally, the above complementary modules were integrated and embedded into a 2D residual network to realize end-to-end action recognition. Experimental results on the Something-SomethingV1 and V2 validation sets show that using only RGB frames with a random 8-frame sampling strategy, the proposed network achieves the Top-1 accuracies of 50.6% and 61.9%, respectively; with a 16-frame sampling strategy, the accuracies are 54.1% and 65.6%, respectively. It can be seen that the proposed network models both local motion details and long-range temporal dependencies efficiently, offering a new way of thinking for action recognition tasks in complex temporal scenarios.