《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (5): 1379-1386.DOI: 10.11772/j.issn.1001-9081.2024060802

• 第十届中国数据挖掘会议 • 上一篇    

基于时空解耦和区域鲁棒性增强的半监督视频目标分割方法

陈鹏宇1, 聂秀山1,2, 李南君2,3, 李拓2,3   

  1. 1.山东建筑大学 计算机科学与技术学院,济南 250101
    2.山东云海国创云计算装备产业创新中心有限公司,济南 250013
    3.高效能服务器和存储技术国家重点实验室(浪潮集团有限公司),济南 250013
  • 收稿日期:2024-06-21 修回日期:2024-07-19 接受日期:2024-07-23 发布日期:2024-08-19 出版日期:2025-05-10
  • 通讯作者: 聂秀山
  • 作者简介:陈鹏宇(2000—),男,山东德州人,硕士研究生,CCF会员,主要研究方向:计算机视觉
    聂秀山(1981—),男,江苏徐州人,教授,博士,CCF杰出会员,主要研究方向:人工智能、智能媒体分析
    李南君(1994—),男,山东莱芜人,工程师,博士,CCF会员,主要研究方向:智能视频分析、人工智能芯片设计
    李拓(1986—),男,湖南益阳人,高级工程师,硕士,主要研究方向:计算机系统架构、芯片设计。
  • 基金资助:
    国家自然科学基金资助项目(62176141);山东省杰出青年自然科学基金资助项目(ZR2021JQ26)

Semi-supervised video object segmentation method based on spatio-temporal decoupling and regional robustness enhancement

Pengyu CHEN1, Xiushan NIE1,2, Nanjun LI2,3, Tuo LI2,3   

  1. 1.School of Computer Science and Technology,Shandong Jianzhu University,Jinan Shandong 250101,China
    2.Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Company Limited,Jinan Shandong 250013,China
    3.State Key Laboratory of High-end Server & Storage Technology (Inspur Group Company Limited),Jinan Shandong 250013,China
  • Received:2024-06-21 Revised:2024-07-19 Accepted:2024-07-23 Online:2024-08-19 Published:2025-05-10
  • Contact: Xiushan NIE
  • About author:CHEN Pengyu, born in 2000, M. S. candidate. His research interests include computer vision.
    NIE Xiushan, born in 1981, Ph. D., professor. His research interests include artificial intelligence, intelligent media analysis.
    LI Nanjun, born in 1994, Ph. D., engineer. His research interests include intelligent video analysis, artificial intelligence chip design.
    LI Tuo, born in 1986, M. S., senior engineer. His research interests include computer system architecture, chip design.
  • Supported by:
    National Natural Science Foundation of China(62176141);Shandong Provincial Natural Science Foundation for Distinguished Young Scholars(ZR2021JQ26)

摘要:

针对半监督视频目标分割(VOS)领域中基于记忆的方法存在由于目标交互造成的物体遮挡以及背景中类似对象或噪声的干扰等问题,提出一种基于时空解耦和区域鲁棒性增强的半监督VOS方法。首先,构建一个结构化Transformer架构去除所有像素共有的特征信息,突出每个像素之间的差异,深入挖掘视频帧中目标的关键特征;其次,解耦当前帧与长期记忆帧之间的相似性,区分为时空相关性和目标重要性2个关键维度,使得对像素级时空特征和目标特征的分析更精确,从而解决由目标交互造成的物体遮挡问题;最后,设计一个区域条形注意力(RSA)模块,利用长期记忆中的目标位置信息增强对前景区域的关注度并抑制背景噪声。实验结果表明,所提方法在DAVIS 2017验证集上比重新训练的AOT(Associating Objects with Transformers)模型的J&F指标高1.7个百分点,在YouTube-VOS 2019验证集上比重新训练的AOT模型的总分高1.6个百分点。可见所提方法可有效解决半监督VOS存在的问题。

关键词: 视频目标分割, 时空解耦, 半监督学习, Transformer, 条形注意力

Abstract:

In response to issues faced by memory-based methods in semi-supervised Video Object Segmentation (VOS), such as object occlusion caused by inter-object interactions and interference from similar objects or background noise, a semi-supervised VOS method based on spatio-temporal decoupling and regional robustness enhancement was proposed. Firstly, a structural Transformer architecture was employed to eliminate shared feature information across all pixels, emphasizing the differences among pixels and thoroughly exploring the key features of objects in video frames. Secondly, the similarity between the current frame and the long-term memory frames was decoupled into two critical dimensions: spatio-temporal correlation and object importance. This decoupling allowed for a more precise analysis of pixel-level spatio-temporal and object features, thereby solving the issue of object occlusion caused by inter-object interactions. Finally, a Regional Strip Attention (RSA) module was designed to enhance focus to the foreground region and suppress background noise by utilizing the object location information from long-term memory. Experimental results indicate that the proposed method outperforms the retrained AOT (Associating Objects with Transformers) model on DAVIS 2017 validation set by 1.7 percentage points in J&F, and achieves a 1.6 percentage points improvement compared to the retrained AOT model in overall score on YouTube-VOS 2019 validation set, indicating that the proposed method effectively addresses existing challenges in semi-supervised VOS.

Key words: Video Object Segmentation (VOS), spatio-temporal decoupling, semi-supervised learning, Transformer, strip attention

中图分类号: