基于时空信息的轻量视频显著性目标检测网络

doi:10.11772/j.issn.1001-9081.2023070926

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (7): 2192-2199.DOI: 10.11772/j.issn.1001-9081.2023070926

• 多媒体计算与计算机仿真 • 上一篇下一篇

基于时空信息的轻量视频显著性目标检测网络

徐松¹, 张文博¹, 王一帆²()

^1.大连理工大学信息与通信工程学院, 辽宁大连 116000
^2.大连理工大学创新创业学院, 辽宁大连 116000

收稿日期:2023-07-09 修回日期:2023-10-11 接受日期:2023-10-13 发布日期:2023-10-26 出版日期:2024-07-10
通讯作者: 王一帆
作者简介:徐松（2000—），男，安徽宿州人，硕士研究生，主要研究方向：视频显著性目标检测、弱监督显著性目标检测；
张文博（1999—），男，辽宁盘锦人，硕士研究生，主要研究方向：弱监督显著性目标检测、持续语义分割；
第一联系人：王一帆（1990—），女，辽宁大连人，讲师，博士，CCF会员，主要研究方向：图像与视频分割、弱监督学习、无监督学习。
基金资助:
中央高校基本科研业务费专项基金资助项目(DUT22LAB124)

Lightweight video salient object detection network based on spatiotemporal information

Song XU¹, Wenbo ZHANG¹, Yifan WANG²()

^1.School of Information and Communication Engineering，Dalian University of Technology，Dalian Liaoning 116000，China
^2.School of Innovation and Entrepreneurship，Dalian University of Technology，Dalian Liaoning 116000，China

Received:2023-07-09 Revised:2023-10-11 Accepted:2023-10-13 Online:2023-10-26 Published:2024-07-10
Contact: Yifan WANG
About author:XU Song， born in 2000， M. S. candidate. His research interests include video salient object detection， weakly supervised salient object detection.
ZHANG Wenbo， born in 1999， M. S. candidate. His research interests include weakly supervised salient object detection， continuous semantic segmentation.
First author contact:WANG Yifan， born in 1990， Ph. D.， lecturer. Her research interests include image and video segmentation， weakly supervised learning， unsupervised learning.
Supported by:
Fundamental Research Funds for Central Universities(DUT22LAB124)

摘要/Abstract

摘要：

现有视频显著性目标检测（VSOD）网络面临2个问题：一是在捕获时间信息时计算成本过大，导致网络难以在移动端实际应用；二是网络泛化能力较弱，难以处理视频中诸如遮挡、运动模糊等挑战性场景。因此，提出一种基于动态滤波器和对比学习思想的轻量视频显著性目标检测网络。首先，对连续帧的每帧图像进行粗略的前景特征点采样并进行相似度矩阵的计算，利用相似度矩阵进行加权从而滤除存在的噪声特征；其次，用滤波后的前景特征生成动态滤波器参数，对原始特征图执行卷积操作以提取前景物体；同时在训练阶段设计了一个对比学习模块帮助网络学习，在推理阶段并不会引入额外的计算量。在三个数据集DAVIS、DAVSOD和VOS上进行了广泛实验，实验结果表明，所提网络相较于DCFNet （Dynamic Context-sensitive Filtering Network for video salient object detection），在F-measure、S-measure以及平均绝对误差（MAE）3个指标上性能接近，帧率从28 frame/s提升到38 frame/s，提升了35.7%，同时网络参数量仅有15.6×10⁶，更有利于实际应用中在边缘侧进行部署。

关键词: 视频显著性目标检测, 动态滤波器, 注意力机制, 对比学习, 深度学习

Abstract:

There are two issues faced by existing Video Salient Object Detection （VSOD） networks： first， considerable computational overhead associated with acquisition of temporal information impedes the viable deployment of the network on edge devices； second， relatively constrained generalization capacity of the network renderes it inadequately equipped to effectively address challenge scenarios characterized by occlusion and motion blur within video content. Consequently， an innovative and resource-efficient VSOD network founded upon principles of dynamic filtering and contrastive learning was proposed. To begin with， a preliminary foreground feature sampling was performed on each frame to compute the similarity matrix， which was leveraged for weighted manipulation to effectively eliminate noise-related features. Following this， denoised foreground features were utilized for generation of parameters of the dynamic filter， which was then employed to execute convolutional operations on the original feature maps， thereby facilitating the extraction of foreground objects. Lastly， during training phase， a contrastive learning module was designed to aid network’s learning process， and notably， this module did not introduce supplementary computational overhead during inference phase. Extensive experimentations were conducted on three datasets including DAVIS， DAVSOD， and VOS. Experimental results show that the proposed network has close performance in F-measure， S-measure and Mean Absolute Error （MAE）， and the frame rate is increased from 28 frame/s to 38 frame/s which is increased by 35.7% compared with DCFNet （Dynamic Context-sensitive Filtering Network for video salient object detection）. The number of network parameters only have 15.6×10⁶， which is more conducive to deploy on the edge side in practical applications.

Key words: video salient object detection, dynamic filter, attention mechanism, contrastive learning, deep learning

中图分类号:

T391.4

徐松, 张文博, 王一帆. 基于时空信息的轻量视频显著性目标检测网络[J]. 计算机应用, 2024, 44(7): 2192-2199.

Song XU, Wenbo ZHANG, Yifan WANG. Lightweight video salient object detection network based on spatiotemporal information[J]. Journal of Computer Applications, 2024, 44(7): 2192-2199.

图/表 12

参考文献 26

1	TAN M， PANG R， LE Q V. EfficientDet： scalable and efficient object detection ［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 10778-10787.
2	PAN Y， YAO T， LI H， et al. Video captioning with transferred semantic attributes ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 6504-6512.
3	JI W， YU S， WU J， et al. Learning calibrated medical image segmentation via multi-rater agreement modeling ［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 12336-12346.
4	ITTI L. Automatic foveation for video compression using a neurobiological model of visual attention ［J］. IEEE Transactions on Image Processing， 2004， 13（10）： 1304-1318.
5	HADIZADEH H， BAJIĆ I V. Saliency-aware video compression ［J］. IEEE Transactions on Image Processing， 2014， 23（1）： 19-33.
6	WU H， LI G， LUO X. Weighted attentional blocks for probabilistic object tracking ［J］. The Visual Computer， 2014， 30： 229-243.
7	YAN P， LI G， XIE Y， et al. Semi-supervised video salient object detection using pseudo-labels ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 7283-7292.
8	GU Y， WANG L， WANG Z， et al. Pyramid constrained self-attention network for fast video salient object detection ［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2020， 34（7）： 10869-10876.
9	YANG Z， WANG Q， BERTINETTO L， et al. Anchor diffusion for unsupervised video object segmentation ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 931-940.
10	YANG S， ZHANG L， QI J， et al. Learning motion-appearance co-attention for zero-shot video object segmentation ［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 1544-1553.
11	WANG W， LU X， SHEN J， et al. Zero-shot video object segmentation via attentive graph neural networks ［C］// Proceedings of the 2019 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2019： 9236-9245.
12	YANG B， BENDER G， LE Q V， et al. CondConv： conditionally parameterized convolutions for efficient inference ［C］// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2019： 1307-1318.
13	DIBA A， SHARMA V， VAN GOOL L， et al. DynamoNet： dynamic action and motion network ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 6191-6200.
14	ZHOU S， ZHANG J， PAN J， et al. Spatio-temporal filter adaptive network for video deblurring ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 2482-2491.
15	HE J， DENG Z， QIAO Y. Dynamic multi-scale filters for semantic segmentation ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 3561-3571.
16	PANG Y， ZHANG L， ZHAO X， et al. Hierarchical dynamic filtering network for RGB-D salient object detection ［C］// Proceedings of the 16th European Conference on Computer Vision. Cham： Springer， 2020： 235-252.
17	YU S， XIAO J， ZHANG B， et al. Democracy does matter： comprehensive feature mining for co-salient object detection ［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 969-978.
18	ZHANG M， LIU J， WANG Y， et al. Dynamic context-sensitive filtering network for video salient object detection ［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 1533-1543.
19	CHEN Y， ZOU W， TANG Y， et al. SCOM： spatiotemporal constrained optimization for salient object detection ［J］. IEEE Transactions on Image Processing， 2018， 27（7）： 3345-3357.
20	WANG W， SHEN J， SHAO L. Video salient object detection via fully convolutional networks ［J］. IEEE Transactions on Image Processing， 2017， 27（1）： 38-49.
21	SONG H， WANG W， ZHAO S， et al. Pyramid dilated deeper ConvLSTM for video salient object detection ［C］// Proceedings of the 15th European Conference on Computer Vision. Cham： Springer， 2018： 715-731.
22	FAN D-P， WANG W， CHENG M-M， et al. Shifting more attention to video salient object detection ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 8546-8556.
23	CHEN C， WANG G， PENG C， et al. Exploring rich and efficient spatial temporal interactions for real-time video salient object detection ［J］. IEEE Transactions on Image Processing， 2021， 30： 3995-4007.
24	JI Y， ZHANG H， JIE Z， et al. CASNet： a cross-attention Siamese network for video salient object detection ［J］. IEEE Transactions on Neural Networks and Learning Systems， 2020， 32（6）： 2676-2690.
25	LI H， CHEN G， LI G， et al. Motion guided attention for video salient object detection ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 7273-7282.
26	JI G-P， FU K， WU Z， et al. Full-duplex strategy for video object segmentation ［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 4902-4913.

网络	骨干网络	光流	CRF	DAVIS			DAVSOD			VOS
网络	骨干网络	光流	CRF	F-measure/%	S-measure/%	MAE	F-measure/%	S-measure/%	MAE	F-measure/%	S-measure/%	MAE
SCOM^［19］	—	—	—	78.30	83.20	0.048	46.40	59.90	0.220	69.00	71.20	0.162
DLVS^［20］	—	—	—	70.80	79.40	0.061	52.10	65.70	0.129	67.50	76.00	0.099
PDB^［21］	ResNet50	√	√	85.50	88.20	0.028	57.20	69.80	0.116	74.20	81.80	0.078
SSAV^［22］	ResNet50	√	√	86.10	89.30	0.028	60.30	72.40	0.090	74.20	81.90	0.073
STFA^［23］	ResNet50	√	√	86.50	89.20	0.023	65.10	74.60	0.086	79.10	85.00	0.058
CAS^［24］	ResNet50	√	√	86.00	87.30	0.032	60.80	69.90	0.086	77.40	80.80	0.051
RCRNet^［7］	ResNet50	√	√	84.80	88.60	0.027	65.30	74.10	0.087	83.30	87.30	0.051
MGA^［25］	ResNet50	√	√	89.20	91.20	0.022	65.60	75.10	0.081	73.50	79.20	0.075
FSNet^［26］	ResNet50	√	√	90.70	92.00	0.020	68.50	77.30	0.072	—	—	—
DCFNet^［18］	ResNet101	×	×	91.00	91.40	0.016	66.00	74.10	0.074	79.10	84.60	0.060
PCSA^［8］	MobilenetV3	×	×	88.00	90.20	0.022	65.50	74.10	0.086	74.70	82.70	0.065
本文网络	MobilenetV3	×	×	89.70	90.90	0.018	65.80	74.00	0.080	78.20	83.80	0.062

网络	骨干网络	光流	CRF	DAVIS			DAVSOD			VOS
网络	骨干网络	光流	CRF	F-measure/%	S-measure/%	MAE	F-measure/%	S-measure/%	MAE	F-measure/%	S-measure/%	MAE
SCOM^［19］	—	—	—	78.30	83.20	0.048	46.40	59.90	0.220	69.00	71.20	0.162
DLVS^［20］	—	—	—	70.80	79.40	0.061	52.10	65.70	0.129	67.50	76.00	0.099
PDB^［21］	ResNet50	√	√	85.50	88.20	0.028	57.20	69.80	0.116	74.20	81.80	0.078
SSAV^［22］	ResNet50	√	√	86.10	89.30	0.028	60.30	72.40	0.090	74.20	81.90	0.073
STFA^［23］	ResNet50	√	√	86.50	89.20	0.023	65.10	74.60	0.086	79.10	85.00	0.058
CAS^［24］	ResNet50	√	√	86.00	87.30	0.032	60.80	69.90	0.086	77.40	80.80	0.051
RCRNet^［7］	ResNet50	√	√	84.80	88.60	0.027	65.30	74.10	0.087	83.30	87.30	0.051
MGA^［25］	ResNet50	√	√	89.20	91.20	0.022	65.60	75.10	0.081	73.50	79.20	0.075
FSNet^［26］	ResNet50	√	√	90.70	92.00	0.020	68.50	77.30	0.072	—	—	—
DCFNet^［18］	ResNet101	×	×	91.00	91.40	0.016	66.00	74.10	0.074	79.10	84.60	0.060
PCSA^［8］	MobilenetV3	×	×	88.00	90.20	0.022	65.50	74.10	0.086	74.70	82.70	0.065
本文网络	MobilenetV3	×	×	89.70	90.90	0.018	65.80	74.00	0.080	78.20	83.80	0.062

网络	参数量/10⁶	推理时间/s
SSAV	81.2	0.450
AGNN	82.3	0.550
MGA	254.0	0.290
AnDiff^［8］	79.3	0.360
DCFNet	274.0	0.036
本文网络	15.6	0.026

网络	参数量/10⁶	推理时间/s
SSAV	81.2	0.450
AGNN	82.3	0.550
MGA	254.0	0.290
AnDiff^［8］	79.3	0.360
DCFNet	274.0	0.036
本文网络	15.6	0.026

超参数K	F-measure/%	S-measure/%	MAE
0	86.00	88.00	0.027
10	88.89	90.00	0.020
20	89.51	90.43	0.019
50	89.70	90.91	0.018
100	89.64	90.94	0.018
300	89.71	90.90	0.018

基于时空信息的轻量视频显著性目标检测网络

Lightweight video salient object detection network based on spatiotemporal information

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 26

相关文章 15

编辑推荐

Metrics

选点方式	F-measure/%	S-measure/%	MAE
阈值+等概率采样	89.06	90.22	0.019
自适应最大池化	88.89	90.00	0.020
Softmax+置信度	89.70	90.91	0.018

融合方式	F-measure/%	S-measure/%	MAE
直接加	89.59	90.84	0.019
进行拼接	89.63	90.84	0.018
动态信息融合	89.70	90.91	0.018

[1]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[2]	王熙源, 张战成, 徐少康, 张宝成, 罗晓清, 胡伏原. 面向手术导航3D/2D配准的无监督跨域迁移网络[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2911-2918.
[3]	李力铤, 华蓓, 贺若舟, 徐况. 基于解耦注意力机制的多变量时序预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2732-2738.
[4]	杨兴耀, 陈羽, 于炯, 张祖莲, 陈嘉颖, 王东晓. 结合自我特征和对比学习的推荐模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2704-2710.
[5]	李顺勇, 李师毅, 胥瑞, 赵兴旺. 基于自注意力融合的不完整多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2696-2703.
[6]	潘烨新, 杨哲. 基于多级特征双向融合的小目标检测优化模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2871-2877.
[7]	赵志强, 马培红, 黑新宏. 基于双重注意力机制的人群计数方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2886-2892.
[8]	黄云川, 江永全, 黄骏涛, 杨燕. 基于元图同构网络的分子毒性预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2964-2969.
[9]	薛凯鹏, 徐涛, 廖春节. 融合自监督和多层交叉注意力的多模态情感分析网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2387-2392.
[10]	汪雨晴, 朱广丽, 段文杰, 李书羽, 周若彤. 基于交互注意力机制的心理咨询文本情感分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2393-2399.
[11]	高鹏淇, 黄鹤鸣, 樊永红. 融合坐标与多头注意力机制的交互语音情感识别[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2400-2406.
[12]	刘禹含, 吉根林, 张红苹. 基于骨架图与混合注意力的视频行人异常检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2551-2557.
[13]	李钟华, 白云起, 王雪津, 黄雷雷, 林初俊, 廖诗宇. 基于图像增强的低照度人脸检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2588-2594.
[14]	莫尚斌, 王文君, 董凌, 高盛祥, 余正涛. 基于多路信息聚合协同解码的单通道语音增强[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2611-2617.
[15]	顾焰杰, 张英俊, 刘晓倩, 周围, 孙威. 基于时空多图融合的交通流量预测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2618-2625.