Video Semantic Segmentation Algorithm with Inter-frame Temporal Fusion and Motion Compensation

doi:10.11772/j.issn.1001-9081.2025020176

Abstract

Abstract: To address the challenges faced by current video semantic segmentation techniques in complex outdoor dynamic scenes, such as the lack of temporal multi-frame consistency and feature misalignment between adjacent frames, we propose a novel video semantic segmentation framework called STDA-Net (Spatiotemporal and Temporal Dynamics Alignment Network). In dynamic video analysis, the absence of temporal feature consistency leads to discontinuous segmentation results across adjacent frames, causing abrupt changes that compromise the semantic stability of dynamic scenes. Meanwhile, feature misalignment across multiple frames is often affected by occlusions, lighting variations, and other disturbances, making precise alignment difficult. This misalignment introduces noise during feature fusion, thereby reducing segmentation accuracy.To mitigate these issues, STDA-Net incorporates two key modules: the Multi-Frame Dynamic Feature Aggregation Module (MFDA) and the Temporal Affine Motion Enhancement Module (TACFM). The MFDA module enhances inter-frame feature consistency by leveraging multi-scale feature extraction, channel self-attention mechanisms, and a multi-frame multi-scale wavelet fusion strategy, effectively reducing abrupt changes in segmentation results and ensuring stability in dynamic scenes. The TACFM module integrates affine motion compensation techniques to precisely align non-keyframe features and strengthens edge structure perception using motion displacement information, improving segmentation accuracy in rapidly changing environments.Extensive experiments on the VSPW dataset and a rail transit power grid surveillance video dataset demonstrate that STDA-Net significantly improves segmentation accuracy and temporal consistency. On VSPW, STDA-Net achieves a VmIoU of 40.5% and an mVC of 86.4%, outperforming the baseline TCBst-ppm by 4.0% and 1.1%, and surpassing CFFM by 1.3% and 1.2%, respectively. On the rail transit dataset, STDA-Net attains a VmIoU of 67.5% and an mVC of 91.7%, with improvements of 4.1% and 6.2% over TCBst-ppm, and 2.3% and 6.4% over CFFM. These results strongly validate the effectiveness of STDA-Net in complex dynamic environments, demonstrating significant improvements in both segmentation accuracy and consistency. The framework exhibits strong adaptability and robustness, making it well-suited for applications in rail transit and other challenging dynamic scenarios.

Key words: video semantic segmentation, multi-scale feature fusion, temporal consistency, motion compensation, feature alignment

摘要： 针对当前视频语义分割技术在复杂户外动态场景中面临的时序多帧一致性缺失和相邻帧特征对齐偏差等问题，提出了一种新型的视频语义分割框架——STDA-Net（Spatiotemporal and Temporal Dynamics Alignment Network）。在动态视频分析中，时序特征一致性缺失导致相邻帧的分割结果不连贯，出现跳跃性变化，影响了动态场景的语义稳定性；而多帧特征对齐偏差则由于视频帧受到遮挡、光照变化等因素干扰，难以确保准确对齐，进而在特征融合时引入噪声，降低了分割效果。为了解决这些问题，STDA-Net引入了两个关键模块：多帧动态特征聚合模块(MFDA)与时域仿射运动增强模块(TACFM)。多帧动态特征聚合模块通过多尺度特征提取、通道自注意力机制与多帧多尺度小波融合策略，有效提升了帧间特征的一致性，减少了分割结果中的跳跃性变化，确保了动态场景中的稳定性。时域仿射运动增强模块结合仿射运动补偿技术，精确对齐非关键帧特征，并利用运行位移信息强化了边缘结构的感知能力，提高模型在快速变化场景中的准确性。通过在VSPW数据集和某轨道交通供电网监控视频数据集上的实验证明，STDA-Net在分割精度和时序一致性方面取得了显著提升。在VSPW数据集中，STDA-Net的VmIoU和mVC分别达到40.5%和86.4%，较基准模型TCBst-ppm分别提升4.0%和1.1%，较CFFM分别提升1.3%和1.2%。在轨道交通数据集中，STDA-Net的VmIoU和mVC分别达到67.5%和91.7%，较TCBst-ppm分别提高4.1%和6.2%，较CFFM分别提高2.3%和6.4%。实验结果充分验证了STDA-Net在复杂动态场景中的优势，显著提高了分割精度和一致性，展示了其在轨道交通等复杂动态环境下的适应性和优越性。

关键词: 关键词: 视频语义分割, 多尺度特征融合, 时序一致性, 运动补偿, 特征对齐

CLC Number:

TP391.4

韦绍运成苗. 基于帧间时序融合和运动补偿的视频语义分割算法[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2025020176.

[1]	Hanqing LIU, Guoming SANG, Yijia ZHANG. Remote sensing image captioning model combining dense multi-scale feature fusion and feature knowledge-enhanced Transformer [J]. Journal of Computer Applications, 2026, 46(3): 741-749.
[2]	Yiming LIANG, Jing FAN, Wenze CHAI. Multi-scale feature fusion sentiment classification based on bidirectional cross attention [J]. Journal of Computer Applications, 2025, 45(9): 2773-2782.
[3]	Liang CHEN, Xuan WANG, Kun LEI. Helmet wearing detection algorithm for complex scenarios based on cross-layer multi-scale feature fusion [J]. Journal of Computer Applications, 2025, 45(7): 2333-2341.
[4]	Xiang WANG, Qianqian CUI, Xiaoming ZHANG, Jianchao WANG, Zhenzhou WANG, Jialin SONG. Wireless capsule endoscopy image classification model based on improved ConvNeXt [J]. Journal of Computer Applications, 2025, 45(6): 2016-2024.
[5]	Shiyue GUO, Jianwu DANG, Yangping WANG, Jiu YONG. 3D hand pose estimation combining attention mechanism and multi-scale feature fusion [J]. Journal of Computer Applications, 2025, 45(4): 1293-1299.
[6]	Zhongwei ZHANG, Jun WANG, Shudong LIU, Zhiheng WANG. Object detection in remote sensing image based on multi-scale feature fusion and weighted boxes fusion [J]. Journal of Computer Applications, 2025, 45(2): 633-639.
[7]	Xuehui YIN, Linlin FU, Shangbo ZHOU. Concrete pavement crack detection network with progressive context interaction and attention mechanism [J]. Journal of Computer Applications, 2025, 45(10): 3353-3362.
[8]	Shang LIU, Yuwei ZHOU, Rao DAI, Linfang DONG, Meng LIU. Small target detection algorithm in remote sensing images integrating attention and contextual information [J]. Journal of Computer Applications, 2025, 45(1): 292-300.
[9]	Pengcheng SONG, Lijun GUO, Rong ZHANG. Weakly supervised video anomaly detection with local-global temporal dependency [J]. Journal of Computer Applications, 2025, 45(1): 240-246.
[10]	Hongtian LI, Xinhao SHI, Weiguo PAN, Cheng XU, Bingxin XU, Jiazheng YUAN. Few-shot object detection via fusing multi-scale and attention mechanism [J]. Journal of Computer Applications, 2024, 44(5): 1437-1444.
[11]	Yang LIU, Rong LIU, Ke FANG, Xinyue ZHANG, Guangxu WANG. Video super-resolution reconstruction network based on frame straddling optical flow [J]. Journal of Computer Applications, 2024, 44(4): 1277-1284.
[12]	Zhanjun JIANG, Baijing WU, Long MA, Jing LIAN. Faster-RCNN water-floating garbage recognition based on multi-scale feature and polarized self-attention [J]. Journal of Computer Applications, 2024, 44(3): 938-944.
[13]	Hao YANG, Yi ZHANG. Feature pyramid network algorithm based on context information and multi-scale fusion importance awareness [J]. Journal of Computer Applications, 2023, 43(9): 2727-2734.
[14]	Shuai ZHENG, Xiaolong ZHANG, He DENG, Hongwei REN. 3D liver image segmentation method based on multi-scale feature fusion and grid attention mechanism [J]. Journal of Computer Applications, 2023, 43(7): 2303-2310.
[15]	Zhangjian JI, Ming ZHANG, Zilong WANG. High-precision object detection algorithm based on improved VarifocalNet [J]. Journal of Computer Applications, 2023, 43(7): 2147-2154.

Video Semantic Segmentation Algorithm with Inter-frame Temporal Fusion and Motion Compensation

基于帧间时序融合和运动补偿的视频语义分割算法

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics