Journal of Computer Applications

    Next Articles

Video Semantic Segmentation Algorithm with Inter-frame Temporal Fusion and Motion Compensation

  

  • Received:2025-02-25 Revised:2025-04-03 Online:2025-04-09 Published:2025-04-09
  • Contact: miao **cheng

基于帧间时序融合和运动补偿的视频语义分割算法

韦绍运1,成苗2   

  1. 1. 中国科学院大学成都计算机应用研究所
    2. 深圳市中钞科信金融科技有限公司
  • 通讯作者: 成苗

Abstract: To address the challenges faced by current video semantic segmentation techniques in complex outdoor dynamic scenes, such as the lack of temporal multi-frame consistency and feature misalignment between adjacent frames, we propose a novel video semantic segmentation framework called STDA-Net (Spatiotemporal and Temporal Dynamics Alignment Network). In dynamic video analysis, the absence of temporal feature consistency leads to discontinuous segmentation results across adjacent frames, causing abrupt changes that compromise the semantic stability of dynamic scenes. Meanwhile, feature misalignment across multiple frames is often affected by occlusions, lighting variations, and other disturbances, making precise alignment difficult. This misalignment introduces noise during feature fusion, thereby reducing segmentation accuracy.To mitigate these issues, STDA-Net incorporates two key modules: the Multi-Frame Dynamic Feature Aggregation Module (MFDA) and the Temporal Affine Motion Enhancement Module (TACFM). The MFDA module enhances inter-frame feature consistency by leveraging multi-scale feature extraction, channel self-attention mechanisms, and a multi-frame multi-scale wavelet fusion strategy, effectively reducing abrupt changes in segmentation results and ensuring stability in dynamic scenes. The TACFM module integrates affine motion compensation techniques to precisely align non-keyframe features and strengthens edge structure perception using motion displacement information, improving segmentation accuracy in rapidly changing environments.Extensive experiments on the VSPW dataset and a rail transit power grid surveillance video dataset demonstrate that STDA-Net significantly improves segmentation accuracy and temporal consistency. On VSPW, STDA-Net achieves a VmIoU of 40.5% and an mVC of 86.4%, outperforming the baseline TCBst-ppm by 4.0% and 1.1%, and surpassing CFFM by 1.3% and 1.2%, respectively. On the rail transit dataset, STDA-Net attains a VmIoU of 67.5% and an mVC of 91.7%, with improvements of 4.1% and 6.2% over TCBst-ppm, and 2.3% and 6.4% over CFFM. These results strongly validate the effectiveness of STDA-Net in complex dynamic environments, demonstrating significant improvements in both segmentation accuracy and consistency. The framework exhibits strong adaptability and robustness, making it well-suited for applications in rail transit and other challenging dynamic scenarios.

Key words: video semantic segmentation, multi-scale feature fusion, temporal consistency, motion compensation, feature alignment

摘要: 针对当前视频语义分割技术在复杂户外动态场景中面临的时序多帧一致性缺失和相邻帧特征对齐偏差等问题,提出了一种新型的视频语义分割框架——STDA-Net(Spatiotemporal and Temporal Dynamics Alignment Network)。在动态视频分析中,时序特征一致性缺失导致相邻帧的分割结果不连贯,出现跳跃性变化,影响了动态场景的语义稳定性;而多帧特征对齐偏差则由于视频帧受到遮挡、光照变化等因素干扰,难以确保准确对齐,进而在特征融合时引入噪声,降低了分割效果。为了解决这些问题,STDA-Net引入了两个关键模块:多帧动态特征聚合模块(MFDA)与时域仿射运动增强模块(TACFM)。多帧动态特征聚合模块通过多尺度特征提取、通道自注意力机制与多帧多尺度小波融合策略,有效提升了帧间特征的一致性,减少了分割结果中的跳跃性变化,确保了动态场景中的稳定性。时域仿射运动增强模块结合仿射运动补偿技术,精确对齐非关键帧特征,并利用运行位移信息强化了边缘结构的感知能力,提高模型在快速变化场景中的准确性。通过在VSPW数据集和某轨道交通供电网监控视频数据集上的实验证明,STDA-Net在分割精度和时序一致性方面取得了显著提升。在VSPW数据集中,STDA-Net的VmIoU和mVC分别达到40.5%和86.4%,较基准模型TCBst-ppm分别提升4.0%和1.1%,较CFFM分别提升1.3%和1.2%。在轨道交通数据集中,STDA-Net的VmIoU和mVC分别达到67.5%和91.7%,较TCBst-ppm分别提高4.1%和6.2%,较CFFM分别提高2.3%和6.4%。实验结果充分验证了STDA-Net在复杂动态场景中的优势,显著提高了分割精度和一致性,展示了其在轨道交通等复杂动态环境下的适应性和优越性。

关键词: 关键词: 视频语义分割, 多尺度特征融合, 时序一致性, 运动补偿, 特征对齐

CLC Number: