Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (7): 2192-2199.DOI: 10.11772/j.issn.1001-9081.2023070926

• Multimedia computing and computer simulation • Previous Articles     Next Articles

Lightweight video salient object detection network based on spatiotemporal information

Song XU1, Wenbo ZHANG1, Yifan WANG2()   

  1. 1.School of Information and Communication Engineering,Dalian University of Technology,Dalian Liaoning 116000,China
    2.School of Innovation and Entrepreneurship,Dalian University of Technology,Dalian Liaoning 116000,China
  • Received:2023-07-09 Revised:2023-10-11 Accepted:2023-10-13 Online:2023-10-26 Published:2024-07-10
  • Contact: Yifan WANG
  • About author:XU Song, born in 2000, M. S. candidate. His research interests include video salient object detection, weakly supervised salient object detection.
    ZHANG Wenbo, born in 1999, M. S. candidate. His research interests include weakly supervised salient object detection, continuous semantic segmentation.
    First author contact:WANG Yifan, born in 1990, Ph. D., lecturer. Her research interests include image and video segmentation, weakly supervised learning, unsupervised learning.
  • Supported by:
    Fundamental Research Funds for Central Universities(DUT22LAB124)

基于时空信息的轻量视频显著性目标检测网络

徐松1, 张文博1, 王一帆2()   

  1. 1.大连理工大学 信息与通信工程学院, 辽宁 大连 116000
    2.大连理工大学 创新创业学院, 辽宁 大连 116000
  • 通讯作者: 王一帆
  • 作者简介:徐松(2000—),男,安徽宿州人,硕士研究生,主要研究方向:视频显著性目标检测、弱监督显著性目标检测;
    张文博(1999—),男,辽宁盘锦人,硕士研究生,主要研究方向:弱监督显著性目标检测、持续语义分割;
    第一联系人:王一帆(1990—),女,辽宁大连人,讲师,博士,CCF会员,主要研究方向:图像与视频分割、弱监督学习、无监督学习。
  • 基金资助:
    中央高校基本科研业务费专项基金资助项目(DUT22LAB124)

Abstract:

There are two issues faced by existing Video Salient Object Detection (VSOD) networks: first, considerable computational overhead associated with acquisition of temporal information impedes the viable deployment of the network on edge devices; second, relatively constrained generalization capacity of the network renderes it inadequately equipped to effectively address challenge scenarios characterized by occlusion and motion blur within video content. Consequently, an innovative and resource-efficient VSOD network founded upon principles of dynamic filtering and contrastive learning was proposed. To begin with, a preliminary foreground feature sampling was performed on each frame to compute the similarity matrix, which was leveraged for weighted manipulation to effectively eliminate noise-related features. Following this, denoised foreground features were utilized for generation of parameters of the dynamic filter, which was then employed to execute convolutional operations on the original feature maps, thereby facilitating the extraction of foreground objects. Lastly, during training phase, a contrastive learning module was designed to aid network’s learning process, and notably, this module did not introduce supplementary computational overhead during inference phase. Extensive experimentations were conducted on three datasets including DAVIS, DAVSOD, and VOS. Experimental results show that the proposed network has close performance in F-measure, S-measure and Mean Absolute Error (MAE), and the frame rate is increased from 28 frame/s to 38 frame/s which is increased by 35.7% compared with DCFNet (Dynamic Context-sensitive Filtering Network for video salient object detection). The number of network parameters only have 15.6×106, which is more conducive to deploy on the edge side in practical applications.

Key words: video salient object detection, dynamic filter, attention mechanism, contrastive learning, deep learning

摘要:

现有视频显著性目标检测(VSOD)网络面临2个问题:一是在捕获时间信息时计算成本过大,导致网络难以在移动端实际应用;二是网络泛化能力较弱,难以处理视频中诸如遮挡、运动模糊等挑战性场景。因此,提出一种基于动态滤波器和对比学习思想的轻量视频显著性目标检测网络。首先,对连续帧的每帧图像进行粗略的前景特征点采样并进行相似度矩阵的计算,利用相似度矩阵进行加权从而滤除存在的噪声特征;其次,用滤波后的前景特征生成动态滤波器参数,对原始特征图执行卷积操作以提取前景物体;同时在训练阶段设计了一个对比学习模块帮助网络学习,在推理阶段并不会引入额外的计算量。在三个数据集DAVIS、DAVSOD和VOS上进行了广泛实验,实验结果表明,所提网络相较于DCFNet (Dynamic Context-sensitive Filtering Network for video salient object detection),在F-measure、S-measure以及平均绝对误差(MAE)3个指标上性能接近,帧率从28 frame/s提升到38 frame/s,提升了35.7%,同时网络参数量仅有15.6×106,更有利于实际应用中在边缘侧进行部署。

关键词: 视频显著性目标检测, 动态滤波器, 注意力机制, 对比学习, 深度学习

CLC Number: