In view of the inaccuracy of action classification and localization caused by the independent processing of video clips as single action instances in the existing weakly supervised action localization studies, a weakly supervised action localization method that integrates temporal and global contextual feature enhancement was proposed. Firstly, the temporal feature enhancement branch was constructed to enlarge the receptive field by using dilated convolution, and the attention mechanism was introduced to capture the temporal dependency between video clips. Secondly, an EM (Expectation-Maximization) algorithm based on Gaussian Mixture Model (GMM) was designed to capture video context information. At the same time, global contextual feature enhancement was performed by using binary walk propagation. As the result, high-quality Temporal Class Activation Maps (TCAMs) were generated as pseudo labels to supervise the temporal enhancement branch online. Thirdly, the momentum update network was used to obtain a cross-video dictionary that reflects the action features between videos. Finally, cross-video contrastive learning was used to improve the accuracy of action classification. Experimental results show that the proposed method has the mean Average Precision (mAP) of 42.0% and 42.2% on THUMOS’14 and ActivityNet v1.3 datasets when the Intersection-over-Union (IoU) is 0.5, and compared with CCKEE (Cross-video Contextual Knowledge Exploration and Exploitation), the proposed method has the mAP improved by 2.6 and 0.6 percentage points, respectively, proving the effectiveness of the proposed method.