Lightweight video salient object detection network based on spatiotemporal information

doi:10.11772/j.issn.1001-9081.2023070926

Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (7): 2192-2199.DOI: 10.11772/j.issn.1001-9081.2023070926

• Multimedia computing and computer simulation • Previous Articles Next Articles

Lightweight video salient object detection network based on spatiotemporal information

Song XU¹, Wenbo ZHANG¹, Yifan WANG²()

^1.School of Information and Communication Engineering，Dalian University of Technology，Dalian Liaoning 116000，China
^2.School of Innovation and Entrepreneurship，Dalian University of Technology，Dalian Liaoning 116000，China

Received:2023-07-09 Revised:2023-10-11 Accepted:2023-10-13 Online:2023-10-26 Published:2024-07-10
Contact: Yifan WANG
About author:XU Song， born in 2000， M. S. candidate. His research interests include video salient object detection， weakly supervised salient object detection.
ZHANG Wenbo， born in 1999， M. S. candidate. His research interests include weakly supervised salient object detection， continuous semantic segmentation.
First author contact:WANG Yifan， born in 1990， Ph. D.， lecturer. Her research interests include image and video segmentation， weakly supervised learning， unsupervised learning.
Supported by:
Fundamental Research Funds for Central Universities(DUT22LAB124)

基于时空信息的轻量视频显著性目标检测网络

徐松¹, 张文博¹, 王一帆²()

^1.大连理工大学信息与通信工程学院, 辽宁大连 116000
^2.大连理工大学创新创业学院, 辽宁大连 116000

通讯作者: 王一帆
作者简介:徐松（2000—），男，安徽宿州人，硕士研究生，主要研究方向：视频显著性目标检测、弱监督显著性目标检测；
张文博（1999—），男，辽宁盘锦人，硕士研究生，主要研究方向：弱监督显著性目标检测、持续语义分割；
第一联系人：王一帆（1990—），女，辽宁大连人，讲师，博士，CCF会员，主要研究方向：图像与视频分割、弱监督学习、无监督学习。
基金资助:
中央高校基本科研业务费专项基金资助项目(DUT22LAB124)

Abstract

Abstract:

There are two issues faced by existing Video Salient Object Detection （VSOD） networks： first， considerable computational overhead associated with acquisition of temporal information impedes the viable deployment of the network on edge devices； second， relatively constrained generalization capacity of the network renderes it inadequately equipped to effectively address challenge scenarios characterized by occlusion and motion blur within video content. Consequently， an innovative and resource-efficient VSOD network founded upon principles of dynamic filtering and contrastive learning was proposed. To begin with， a preliminary foreground feature sampling was performed on each frame to compute the similarity matrix， which was leveraged for weighted manipulation to effectively eliminate noise-related features. Following this， denoised foreground features were utilized for generation of parameters of the dynamic filter， which was then employed to execute convolutional operations on the original feature maps， thereby facilitating the extraction of foreground objects. Lastly， during training phase， a contrastive learning module was designed to aid network’s learning process， and notably， this module did not introduce supplementary computational overhead during inference phase. Extensive experimentations were conducted on three datasets including DAVIS， DAVSOD， and VOS. Experimental results show that the proposed network has close performance in F-measure， S-measure and Mean Absolute Error （MAE）， and the frame rate is increased from 28 frame/s to 38 frame/s which is increased by 35.7% compared with DCFNet （Dynamic Context-sensitive Filtering Network for video salient object detection）. The number of network parameters only have 15.6×10⁶， which is more conducive to deploy on the edge side in practical applications.

Key words: video salient object detection, dynamic filter, attention mechanism, contrastive learning, deep learning

摘要：

现有视频显著性目标检测（VSOD）网络面临2个问题：一是在捕获时间信息时计算成本过大，导致网络难以在移动端实际应用；二是网络泛化能力较弱，难以处理视频中诸如遮挡、运动模糊等挑战性场景。因此，提出一种基于动态滤波器和对比学习思想的轻量视频显著性目标检测网络。首先，对连续帧的每帧图像进行粗略的前景特征点采样并进行相似度矩阵的计算，利用相似度矩阵进行加权从而滤除存在的噪声特征；其次，用滤波后的前景特征生成动态滤波器参数，对原始特征图执行卷积操作以提取前景物体；同时在训练阶段设计了一个对比学习模块帮助网络学习，在推理阶段并不会引入额外的计算量。在三个数据集DAVIS、DAVSOD和VOS上进行了广泛实验，实验结果表明，所提网络相较于DCFNet （Dynamic Context-sensitive Filtering Network for video salient object detection），在F-measure、S-measure以及平均绝对误差（MAE）3个指标上性能接近，帧率从28 frame/s提升到38 frame/s，提升了35.7%，同时网络参数量仅有15.6×10⁶，更有利于实际应用中在边缘侧进行部署。

关键词: 视频显著性目标检测, 动态滤波器, 注意力机制, 对比学习, 深度学习

CLC Number:

T391.4

Song XU, Wenbo ZHANG, Yifan WANG. Lightweight video salient object detection network based on spatiotemporal information[J]. Journal of Computer Applications, 2024, 44(7): 2192-2199.

徐松, 张文博, 王一帆. 基于时空信息的轻量视频显著性目标检测网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2192-2199.

Figures/Tables 12

References 26

1	TAN M， PANG R， LE Q V. EfficientDet： scalable and efficient object detection ［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 10778-10787.
2	PAN Y， YAO T， LI H， et al. Video captioning with transferred semantic attributes ［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 6504-6512.
3	JI W， YU S， WU J， et al. Learning calibrated medical image segmentation via multi-rater agreement modeling ［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2021： 12336-12346.
4	ITTI L. Automatic foveation for video compression using a neurobiological model of visual attention ［J］. IEEE Transactions on Image Processing， 2004， 13（10）： 1304-1318.
5	HADIZADEH H， BAJIĆ I V. Saliency-aware video compression ［J］. IEEE Transactions on Image Processing， 2014， 23（1）： 19-33.
6	WU H， LI G， LUO X. Weighted attentional blocks for probabilistic object tracking ［J］. The Visual Computer， 2014， 30： 229-243.
7	YAN P， LI G， XIE Y， et al. Semi-supervised video salient object detection using pseudo-labels ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 7283-7292.
8	GU Y， WANG L， WANG Z， et al. Pyramid constrained self-attention network for fast video salient object detection ［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2020， 34（7）： 10869-10876.
9	YANG Z， WANG Q， BERTINETTO L， et al. Anchor diffusion for unsupervised video object segmentation ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 931-940.
10	YANG S， ZHANG L， QI J， et al. Learning motion-appearance co-attention for zero-shot video object segmentation ［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 1544-1553.
11	WANG W， LU X， SHEN J， et al. Zero-shot video object segmentation via attentive graph neural networks ［C］// Proceedings of the 2019 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2019： 9236-9245.
12	YANG B， BENDER G， LE Q V， et al. CondConv： conditionally parameterized convolutions for efficient inference ［C］// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2019： 1307-1318.
13	DIBA A， SHARMA V， VAN GOOL L， et al. DynamoNet： dynamic action and motion network ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 6191-6200.
14	ZHOU S， ZHANG J， PAN J， et al. Spatio-temporal filter adaptive network for video deblurring ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 2482-2491.
15	HE J， DENG Z， QIAO Y. Dynamic multi-scale filters for semantic segmentation ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 3561-3571.
16	PANG Y， ZHANG L， ZHAO X， et al. Hierarchical dynamic filtering network for RGB-D salient object detection ［C］// Proceedings of the 16th European Conference on Computer Vision. Cham： Springer， 2020： 235-252.
17	YU S， XIAO J， ZHANG B， et al. Democracy does matter： comprehensive feature mining for co-salient object detection ［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2022： 969-978.
18	ZHANG M， LIU J， WANG Y， et al. Dynamic context-sensitive filtering network for video salient object detection ［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 1533-1543.
19	CHEN Y， ZOU W， TANG Y， et al. SCOM： spatiotemporal constrained optimization for salient object detection ［J］. IEEE Transactions on Image Processing， 2018， 27（7）： 3345-3357.
20	WANG W， SHEN J， SHAO L. Video salient object detection via fully convolutional networks ［J］. IEEE Transactions on Image Processing， 2017， 27（1）： 38-49.
21	SONG H， WANG W， ZHAO S， et al. Pyramid dilated deeper ConvLSTM for video salient object detection ［C］// Proceedings of the 15th European Conference on Computer Vision. Cham： Springer， 2018： 715-731.
22	FAN D-P， WANG W， CHENG M-M， et al. Shifting more attention to video salient object detection ［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 8546-8556.
23	CHEN C， WANG G， PENG C， et al. Exploring rich and efficient spatial temporal interactions for real-time video salient object detection ［J］. IEEE Transactions on Image Processing， 2021， 30： 3995-4007.
24	JI Y， ZHANG H， JIE Z， et al. CASNet： a cross-attention Siamese network for video salient object detection ［J］. IEEE Transactions on Neural Networks and Learning Systems， 2020， 32（6）： 2676-2690.
25	LI H， CHEN G， LI G， et al. Motion guided attention for video salient object detection ［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 7273-7282.
26	JI G-P， FU K， WU Z， et al. Full-duplex strategy for video object segmentation ［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2021： 4902-4913.

网络	骨干网络	光流	CRF	DAVIS			DAVSOD			VOS
网络	骨干网络	光流	CRF	F-measure/%	S-measure/%	MAE	F-measure/%	S-measure/%	MAE	F-measure/%	S-measure/%	MAE
SCOM^［19］	—	—	—	78.30	83.20	0.048	46.40	59.90	0.220	69.00	71.20	0.162
DLVS^［20］	—	—	—	70.80	79.40	0.061	52.10	65.70	0.129	67.50	76.00	0.099
PDB^［21］	ResNet50	√	√	85.50	88.20	0.028	57.20	69.80	0.116	74.20	81.80	0.078
SSAV^［22］	ResNet50	√	√	86.10	89.30	0.028	60.30	72.40	0.090	74.20	81.90	0.073
STFA^［23］	ResNet50	√	√	86.50	89.20	0.023	65.10	74.60	0.086	79.10	85.00	0.058
CAS^［24］	ResNet50	√	√	86.00	87.30	0.032	60.80	69.90	0.086	77.40	80.80	0.051
RCRNet^［7］	ResNet50	√	√	84.80	88.60	0.027	65.30	74.10	0.087	83.30	87.30	0.051
MGA^［25］	ResNet50	√	√	89.20	91.20	0.022	65.60	75.10	0.081	73.50	79.20	0.075
FSNet^［26］	ResNet50	√	√	90.70	92.00	0.020	68.50	77.30	0.072	—	—	—
DCFNet^［18］	ResNet101	×	×	91.00	91.40	0.016	66.00	74.10	0.074	79.10	84.60	0.060
PCSA^［8］	MobilenetV3	×	×	88.00	90.20	0.022	65.50	74.10	0.086	74.70	82.70	0.065
本文网络	MobilenetV3	×	×	89.70	90.90	0.018	65.80	74.00	0.080	78.20	83.80	0.062

网络	骨干网络	光流	CRF	DAVIS			DAVSOD			VOS
网络	骨干网络	光流	CRF	F-measure/%	S-measure/%	MAE	F-measure/%	S-measure/%	MAE	F-measure/%	S-measure/%	MAE
SCOM^［19］	—	—	—	78.30	83.20	0.048	46.40	59.90	0.220	69.00	71.20	0.162
DLVS^［20］	—	—	—	70.80	79.40	0.061	52.10	65.70	0.129	67.50	76.00	0.099
PDB^［21］	ResNet50	√	√	85.50	88.20	0.028	57.20	69.80	0.116	74.20	81.80	0.078
SSAV^［22］	ResNet50	√	√	86.10	89.30	0.028	60.30	72.40	0.090	74.20	81.90	0.073
STFA^［23］	ResNet50	√	√	86.50	89.20	0.023	65.10	74.60	0.086	79.10	85.00	0.058
CAS^［24］	ResNet50	√	√	86.00	87.30	0.032	60.80	69.90	0.086	77.40	80.80	0.051
RCRNet^［7］	ResNet50	√	√	84.80	88.60	0.027	65.30	74.10	0.087	83.30	87.30	0.051
MGA^［25］	ResNet50	√	√	89.20	91.20	0.022	65.60	75.10	0.081	73.50	79.20	0.075
FSNet^［26］	ResNet50	√	√	90.70	92.00	0.020	68.50	77.30	0.072	—	—	—
DCFNet^［18］	ResNet101	×	×	91.00	91.40	0.016	66.00	74.10	0.074	79.10	84.60	0.060
PCSA^［8］	MobilenetV3	×	×	88.00	90.20	0.022	65.50	74.10	0.086	74.70	82.70	0.065
本文网络	MobilenetV3	×	×	89.70	90.90	0.018	65.80	74.00	0.080	78.20	83.80	0.062

网络	参数量/10⁶	推理时间/s
SSAV	81.2	0.450
AGNN	82.3	0.550
MGA	254.0	0.290
AnDiff^［8］	79.3	0.360
DCFNet	274.0	0.036
本文网络	15.6	0.026

网络	参数量/10⁶	推理时间/s
SSAV	81.2	0.450
AGNN	82.3	0.550
MGA	254.0	0.290
AnDiff^［8］	79.3	0.360
DCFNet	274.0	0.036
本文网络	15.6	0.026

超参数K	F-measure/%	S-measure/%	MAE
0	86.00	88.00	0.027
10	88.89	90.00	0.020
20	89.51	90.43	0.019
50	89.70	90.91	0.018
100	89.64	90.94	0.018
300	89.71	90.90	0.018

Lightweight video salient object detection network based on spatiotemporal information

基于时空信息的轻量视频显著性目标检测网络

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 12

References 26

Related Articles 15

Recommended Articles

Metrics

选点方式	F-measure/%	S-measure/%	MAE
阈值+等概率采样	89.06	90.22	0.019
自适应最大池化	88.89	90.00	0.020
Softmax+置信度	89.70	90.91	0.018

融合方式	F-measure/%	S-measure/%	MAE
直接加	89.59	90.84	0.019
进行拼接	89.63	90.84	0.018
动态信息融合	89.70	90.91	0.018

[1]	Xingyao YANG, Yu CHEN, Jiong YU, Zulian ZHANG, Jiaying CHEN, Dongxiao WANG. Recommendation model combining self-features and contrastive learning [J]. Journal of Computer Applications, 2024, 44(9): 2704-2710.
[2]	Shunyong LI, Shiyi LI, Rui XU, Xingwang ZHAO. Incomplete multi-view clustering algorithm based on self-attention fusion [J]. Journal of Computer Applications, 2024, 44(9): 2696-2703.
[3]	Yunchuan HUANG, Yongquan JIANG, Juntao HUANG, Yan YANG. Molecular toxicity prediction based on meta graph isomorphism network [J]. Journal of Computer Applications, 2024, 44(9): 2964-2969.
[4]	Jing QIN, Zhiguang QIN, Fali LI, Yueheng PENG. Diagnosis of major depressive disorder based on probabilistic sparse self-attention neural network [J]. Journal of Computer Applications, 2024, 44(9): 2970-2974.
[5]	Xiyuan WANG, Zhancheng ZHANG, Shaokang XU, Baocheng ZHANG, Xiaoqing LUO, Fuyuan HU. Unsupervised cross-domain transfer network for 3D/2D registration in surgical navigation [J]. Journal of Computer Applications, 2024, 44(9): 2911-2918.
[6]	Liting LI, Bei HUA, Ruozhou HE, Kuang XU. Multivariate time series prediction model based on decoupled attention mechanism [J]. Journal of Computer Applications, 2024, 44(9): 2732-2738.
[7]	Yexin PAN, Zhe YANG. Optimization model for small object detection based on multi-level feature bidirectional fusion [J]. Journal of Computer Applications, 2024, 44(9): 2871-2877.
[8]	Zhiqiang ZHAO, Peihong MA, Xinhong HEI. Crowd counting method based on dual attention mechanism [J]. Journal of Computer Applications, 2024, 44(9): 2886-2892.
[9]	Kaipeng XUE, Tao XU, Chunjie LIAO. Multimodal sentiment analysis network with self-supervision and multi-layer cross attention [J]. Journal of Computer Applications, 2024, 44(8): 2387-2392.
[10]	Pengqi GAO, Heming HUANG, Yonghong FAN. Fusion of coordinate and multi-head attention mechanisms for interactive speech emotion recognition [J]. Journal of Computer Applications, 2024, 44(8): 2400-2406.
[11]	Yuhan LIU, Genlin JI, Hongping ZHANG. Video pedestrian anomaly detection method based on skeleton graph and mixed attention [J]. Journal of Computer Applications, 2024, 44(8): 2551-2557.
[12]	Zhonghua LI, Yunqi BAI, Xuejin WANG, Leilei HUANG, Chujun LIN, Shiyu LIAO. Low illumination face detection based on image enhancement [J]. Journal of Computer Applications, 2024, 44(8): 2588-2594.
[13]	Shangbin MO, Wenjun WANG, Ling DONG, Shengxiang GAO, Zhengtao YU. Single-channel speech enhancement based on multi-channel information aggregation and collaborative decoding [J]. Journal of Computer Applications, 2024, 44(8): 2611-2617.
[14]	Yanjie GU, Yingjun ZHANG, Xiaoqian LIU, Wei ZHOU, Wei SUN. Traffic flow forecasting via spatial-temporal multi-graph fusion [J]. Journal of Computer Applications, 2024, 44(8): 2618-2625.
[15]	Qianhong SHI, Yan YANG, Yongquan JIANG, Xiaocao OUYANG, Wubo FAN, Qiang CHEN, Tao JIANG, Yuan LI. Multi-granularity abrupt change fitting network for air quality prediction [J]. Journal of Computer Applications, 2024, 44(8): 2643-2650.