Semi-supervised video object segmentation via deep and shallow representations fusion

doi:10.11772/j.issn.1001-9081.2021091636

Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (12): 3884-3890.DOI: 10.11772/j.issn.1001-9081.2021091636

• Multimedia computing and computer simulation • Previous Articles Next Articles

Semi-supervised video object segmentation via deep and shallow representations fusion

Xiao LYU¹, Huihui SONG²(), Jiaqing FAN¹

^1.Jiangsu Key Laboratory of Big Data Analysis Technology （Nanjing University of Information Science and Technology），Nanjing Jiangsu 210044，China
^2.Jiangsu Collaborative Innovation Center on Atmospheric Environment and Equipment Technology （Nanjing University of Information Science and Technology），Nanjing Jiangsu 210044，China

Received:2021-09-17 Revised:2022-01-11 Accepted:2022-01-19 Online:2022-12-21 Published:2022-12-10
Contact: Huihui SONG
About author:LYU Xiao， born in 1996， M. S. candidate. His research interests include video object segmentation， video object tracking.
FAN Jiaqing， born in 1994， Ph. D. candidate. His research interests include video object tracking.
Supported by:
National Natural Science Foundation of China(61872189);Natural Science Foundation of Jiangsu Province(BK20191397)

深浅层表示融合的半监督视频目标分割

吕潇¹, 宋慧慧²(), 樊佳庆¹

^1.江苏省大数据分析技术重点实验室(南京信息工程大学), 南京 210044
^2.江苏省大气环境与装备技术协同创新中心(南京信息工程大学), 南京 210044

通讯作者: 宋慧慧
作者简介:吕潇（1996—），男，江苏泰州人，硕士研究生，主要研究方向：视频目标分割、视频目标跟踪
樊佳庆（1994—），男，江苏南通人，博士研究生，主要研究方向：视频目标跟踪。
基金资助:
国家自然科学基金资助项目(61872189);江苏省自然科学基金资助项目(BK20191397)

Abstract

Abstract:

In order to solve the problems that the segmentation accuracy and speed are difficult to balance and the algorithm cannot effectively distinguish similar foreground and background objects in the task of semi-supervised video object segmentation， a semi-supervised video object segmentation algorithm was proposed on the basis of deep and shallow feature fusion. Firstly， a pre-generated rough mask was used to process image features， thereby achieving more robust features. Secondly， deep semantic information was extracted by the attention model. Finally， deep semantic information and shallow position information were fused to obtain more accurate segmentation results. Experiments were conducted on multiple popular datasets. The experiment results demonstrate that the proposed algorithm improves the Jaccard （J） index by 1.8 percentage points and improves the comprehensive evaluation index mean of J and F?score J&F by 2.3 percentage points compared with Learning Fast and Robust Target Models for Video Object Segmentation （FRTM） algorithm on DAVIS 2016 dataset. Meanwhile， on DAVIS 2017 dataset， the proposed algorithm improves J index by 1.2 percentage points and improves the comprehensive evaluation index J&F by 1.1 percentage points compared with FRTM algorithm. The above results fully prove that the proposed algorithm can achieve higher segmentation accuracy with fast speed， and effectively distinguish background and foreground objects with strong robustness. It can be seen that the proposed algorithm has superior performance in balancing speed and accuracy and effectively distinguishing foreground and background.

Key words: video object segmentation, attention, fusion, deep semantic information, shallow position information

摘要：

为了解决半监督视频目标分割任务中，分割精度与分割速度难以兼顾以及无法对视频中与前景相似的背景目标做出有效区分的问题，提出一种基于深浅层特征融合的半监督视频目标分割算法。首先，利用预先生成的粗糙掩膜对图像特征进行处理，以获取更鲁棒的特征；然后，通过注意力模型提取深层语义信息；最后，将深层语义信息与浅层位置信息进行融合，从而得到更加精确的分割结果。在多个流行的数据集上进行了实验，实验结果表明：在分割运行速度基本不变的情况下，所提算法在DAVIS 2016数据集上的雅卡尔（J）指标相较于学习快速鲁棒目标模型的视频目标分割（FRTM）算法提高了1.8个百分点，综合评价指标为J和F得分的均值J&F相较于FRTM提高了2.3个百分点；同时，在DAVIS 2017数据集上，所提算法的J指标比FRTM提升了1.2个百分点，综合评价指标J&F比FRTM提升了1.1个百分点。以上结果充分说明所提算法能够在保持较快分割速度的情况下实现更高的分割精度，并且能够有效区别相似的前景与背景目标，具有较强的鲁棒性。可见所提算法在平衡速度与精度以及有效区分前景背景方面的优越性能。

关键词: 视频目标分割, 注意力, 融合, 深层语义信息, 浅层位置信息

CLC Number:

TP391.4

Xiao LYU, Huihui SONG, Jiaqing FAN. Semi-supervised video object segmentation via deep and shallow representations fusion[J]. Journal of Computer Applications, 2022, 42(12): 3884-3890.

吕潇, 宋慧慧, 樊佳庆. 深浅层表示融合的半监督视频目标分割[J]. 《计算机应用》唯一官方网站, 2022, 42(12): 3884-3890.

Figures/Tables 12

References 33

1	GUO J M， LI Z W， CHEONG L F， et al. Video co-segmentation for meaningful action extraction［C］// Proceedings of the 2013 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2013： 2232-2239. 10.1109/iccv.2013.278
2	杨天明，陈志，岳文静. 基于视频深度学习的时空双流人物动作识别模型［J］. 计算机应用， 2018， 38（3）： 895-899， 915. 10.11772/j.issn.1001-9081.2017071740
	YANG T M， CHEN Z， YUE W J. Spatio-temporal two-stream human action recognition model based on video deep learning［J］. Journal of Computer Applications， 2018， 38（3）： 895-899， 915. 10.11772/j.issn.1001-9081.2017071740
3	胡学敏，童秀迟，郭琳，等. 基于深度视觉注意神经网络的端到端自动驾驶模型［J］. 计算机应用， 2020， 40（7）： 1926-1931. 10.11772/j.issn.1001-9081.2019112054
	HU X M， TONG X C， GUO L， et al. End-to-end autonomous driving model based on deep visual attention neural network［J］. Journal of Computer Applications， 2020， 40（7）： 1926-1931. 10.11772/j.issn.1001-9081.2019112054
4	SALEH K， HOSSNY M， NAHAVANDI S. Kangaroo vehicle collision detection using deep semantic segmentation convolutional neural network［C］// Proceedings of the 2016 International Conference on Digital Image Computing： Techniques and Applications. Piscataway： IEEE， 2016： 1-7. 10.1109/dicta.2016.7797057
5	OH S W， LEE J Y， SUNKAVALLI K， et al. Fast video object segmentation by reference-guided mask propagation［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 7376-7385. 10.1109/cvpr.2018.00770
6	CAELLES S， MANINIS K K， PONT-TUSET J， et al. One-shot video object segmentation［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 5320-5329. 10.1109/cvpr.2017.565
7	MANINIS K K， CAELLES S， CHEN Y H， et al. Video object segmentation without temporal information［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2019， 41（6）： 1515-1530. 10.1109/tpami.2018.2838670
8	PERAZZI F， KHOREVA A， BENENSON R， et al. Learning video object segmentation from static images［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 3491-3500. 10.1109/cvpr.2017.372
9	XU N， YANG L J， FAN Y C， et al. YouTube-VOS： sequence-to-sequence video object segmentation［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11209. Cham： Springer， 2018： 603-619.
10	VOIGTLAENDER P， LEIBE B. Online adaptation of convolutional neural networks for video object segmentation［C］// Proceedings of the 2017 British Machine Vision Conference. Durham： BMVA Press， 2017： No.116. 10.5244/c.31.116
11	OH S W， LEE J Y， XU N， et al. Video object segmentation using space-time memory networks［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 9225-9234. 10.1109/iccv.2019.00932
12	VOIGTLAENDER P， CHAI Y N， SCHROFF F， et al. FEELVOS： fast end-to-end embedding learning for video object segmentation［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 9473-9482. 10.1109/cvpr.2019.00971
13	HU Y T， HUANG J B， SCHWING A G. VideoMatch： matching based video object segmentation［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11212. Cham： Springer， 2018： 56-73.
14	王宁，宋慧慧，张开华. 基于距离加权重叠度估计与椭圆拟合优化的精确目标跟踪算法［J］. 计算机应用， 2021， 41（4）： 1100-1105. 10.11772/j.issn.1001-9081.2020060869
	WANG N， SONG H H， ZHANG K H. Accurate object tracking algorithm based on distance weighting overlap prediction and ellipse fitting optimization［J］. Journal of Computer Applications， 2021， 41（4）： 1100-1150. 10.11772/j.issn.1001-9081.2020060869
15	WANG Q， ZHANG L， BERTINETTO L， et al. Fast online object tracking and segmentation： a unifying approach［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 1328-1338. 10.1109/cvpr.2019.00142
16	LI B， YAN J J， WU W， et al. High performance visual tracking with Siamese region proposal network［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 8971-8980. 10.1109/cvpr.2018.00935
17	PERAZZI F， PONT-TUSET J， McWILLIAMS B， et al. A benchmark dataset and evaluation methodology for video object segmentation［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 724-732. 10.1109/cvpr.2016.85
18	PONT-TUSET J， PERAZZI F， CAELLES S， et al. The 2017 DAVIS challenge on video object segmentation［EB/OL］. （2018-03-01）［2021-04-03］.. 10.1109/cvpr.2017.565
19	CHEN X， LI Z X， YUAN Y， et al. State-aware tracker for real-time video object segmentation［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 9381-9390. 10.1109/cvpr42600.2020.00940
20	ROBINSON A， JÄREMO LAWIN F， DANELLJAN M， et al. Learning fast and robust target models for video object segmentation［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 7404-7413. 10.1109/cvpr42600.2020.00743
21	HE K M， ZHANG X Y， REN S Q， et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778. 10.1109/cvpr.2016.90
22	DANELLJAN M， BHAT G， KHAN F S， et al. ATOM： accurate tracking by overlap maximization［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 4655-4664. 10.1109/cvpr.2019.00479
23	CHEN B H， DENG W H， HU J N. Mixed high-order attention network for person re-identification［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 371-381. 10.1109/iccv.2019.00046
24	LI W， ZHU X T， GONG S G. Person re-identification by deep joint learning of multi-loss classification［C］// Proceedings of the 26th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2017： 2194-2200. 10.24963/ijcai.2017/305
25	LIU J X， NI B B， YAN Y C， et al. Pose transferrable person re-identification［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 4099-4108. 10.1109/cvpr.2018.00431
26	ZHONG Z， ZHENG L， CAO D L， et al. Re-ranking person re-identification with k-reciprocal encoding［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 3652-3661. 10.1109/cvpr.2017.389
27	XIANG X Y， TIAN Y P， ZHANG Y L， et al. Zooming Slow-Mo： fast and accurate one-stage space-time video super-resolution［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 3367-3376. 10.1109/cvpr42600.2020.00343
28	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017：6000-6010.
29	XU N， YANG L J， FAN Y C， et al. YouTube-VOS： a large-scale video object segmentation benchmark［EB/OL］. （2018-09-06）［2021-08-22］.. 10.1007/978-3-030-01228-1_36
30	KINGMA D P， BA J L. Adam： a method for stochastic optimization［EB/OL］. （2017-01-30）［2021-08-22］..
31	JOHNANDER J， DANELLJAN M， BRISSMAN E， et al. A generative appearance model for end-to-end video object segmentation［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 8945-8954. 10.1109/cvpr.2019.00916
32	YANG L J， WANG Y R， XIONG X H， et al. Efficient video object segmentation via network modulation［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6499-6507. 10.1109/cvpr.2018.00680
33	LUITEN J， VOIGTLAENDER P， LEIBE B. PReMVOS： proposal-generation， refinement and merging for video object segmentation［C］// Proceedings of the 2018 Asian Conference on Computer Vision， LNCS 11364. Cham： Springer， 2019： 565-580.

算法	J/%	F/%	J&F/%	帧率/ （frame·s^-1）	帧率（2080Ti）/ （frame·s^-1）
文献［31］算法	81.4	82.1	81.8	14.30	―
文献［11］算法	88.7	89.9	89.3	6.25	―
文献［12］算法	81.1	82.2	81.7	2.20	―
文献［32］算法	74.0	72.9	73.5	7.14	―
文献［33］算法	84.9	88.6	86.8	0.01	―
文献［7］算法	85.6	87.5	86.6	0.22	―
文献［10］算法	86.1	84.9	85.5	0.08	―
文献［19］算法	82.6	83.6	83.1	39.00	―
FRTM	83.7	83.4	83.6	21.90	18.17
本文算法	85.5	86.3	85.9	―	17.76

算法	J/%	F/%	J&F/%	帧率/ （frame·s^-1）	帧率（2080Ti）/ （frame·s^-1）
文献［31］算法	81.4	82.1	81.8	14.30	―
文献［11］算法	88.7	89.9	89.3	6.25	―
文献［12］算法	81.1	82.2	81.7	2.20	―
文献［32］算法	74.0	72.9	73.5	7.14	―
文献［33］算法	84.9	88.6	86.8	0.01	―
文献［7］算法	85.6	87.5	86.6	0.22	―
文献［10］算法	86.1	84.9	85.5	0.08	―
文献［19］算法	82.6	83.6	83.1	39.00	―
FRTM	83.7	83.4	83.6	21.90	18.17
本文算法	85.5	86.3	85.9	―	17.76

算法	J/%	F/%	J&F/%	帧率/ （frame·s^-1）	帧率（2080Ti）/ （frame·s^-1）
文献［31］算法	67.2	72.7	70.0	14.30	―
文献［11］算法	79.2	84.3	81.8	6.25	―
文献［12］算法	69.1	74.0	71.5	2.20	―
文献［32］算法	52.5	57.1	54.8	7.14	―
文献［33］算法	73.9	81.7	77.8	0.01	―
文献［7］算法	64.7	71.3	68.0	0.22	―
文献［10］算法	64.5	71.2	67.9	0.08	―
文献［19］算法	68.6	76.0	72.3	39.00	―
FRTMT	73.8	79.6	76.7	21.90	18.17
本文算法	75.0	80.5	77.8	―	17.76

算法	J/%	F/%	J&F/%	帧率/ （frame·s^-1）	帧率（2080Ti）/ （frame·s^-1）
文献［31］算法	67.2	72.7	70.0	14.30	―
文献［11］算法	79.2	84.3	81.8	6.25	―
文献［12］算法	69.1	74.0	71.5	2.20	―
文献［32］算法	52.5	57.1	54.8	7.14	―
文献［33］算法	73.9	81.7	77.8	0.01	―
文献［7］算法	64.7	71.3	68.0	0.22	―
文献［10］算法	64.5	71.2	67.9	0.08	―
文献［19］算法	68.6	76.0	72.3	39.00	―
FRTMT	73.8	79.6	76.7	21.90	18.17
本文算法	75.0	80.5	77.8	―	17.76

算法	J		F		综合指标g
算法	可见	未见	可见	未见	综合指标g
文献［31］算法	67.8	60.8	69.5	66.2	66.1
文献［10］算法	60.1	46.1	62.7	51.4	55.2
文献［33］算法	71.4	56.5	―	―	66.9
文献［11］算法	―	―	―	―	68.2
本文算法	68.0	60.7	71.3	68.4	67.1

Semi-supervised video object segmentation via deep and shallow representations fusion

深浅层表示融合的半监督视频目标分割

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 12

References 33

Related Articles 15

Recommended Articles

Metrics

模型	λ₁	λ₂	λ₃	J&F/%
HOA	―	―	―	85.0
EHOA	1.0	0.0	0.0	84.1
	0.0	1.0	0.0	84.6
	0.0	0.0	1.0	84.8
	0.1	0.2	0.7	85.1
	0.2	0.3	0.5	85.9
	0.3	0.3	0.4	85.3
	0.6	0.3	0.1	85.0

[1]	Jing QIN, Zhiguang QIN, Fali LI, Yueheng PENG. Diagnosis of major depressive disorder based on probabilistic sparse self-attention neural network [J]. Journal of Computer Applications, 2024, 44(9): 2970-2974.
[2]	Xiyuan WANG, Zhancheng ZHANG, Shaokang XU, Baocheng ZHANG, Xiaoqing LUO, Fuyuan HU. Unsupervised cross-domain transfer network for 3D/2D registration in surgical navigation [J]. Journal of Computer Applications, 2024, 44(9): 2911-2918.
[3]	Liting LI, Bei HUA, Ruozhou HE, Kuang XU. Multivariate time series prediction model based on decoupled attention mechanism [J]. Journal of Computer Applications, 2024, 44(9): 2732-2738.
[4]	Hang YANG, Wanggen LI, Gensheng ZHANG, Zhige WANG, Xin KAI. Multi-layer information interactive fusion algorithm based on graph neural network for session-based recommendation [J]. Journal of Computer Applications, 2024, 44(9): 2719-2725.
[5]	Ying HUANG, Jiayu YANG, Jiahao JIN, Bangrui WAN. Siamese mixed information fusion algorithm for RGBT tracking [J]. Journal of Computer Applications, 2024, 44(9): 2878-2885.
[6]	Shunyong LI, Shiyi LI, Rui XU, Xingwang ZHAO. Incomplete multi-view clustering algorithm based on self-attention fusion [J]. Journal of Computer Applications, 2024, 44(9): 2696-2703.
[7]	Liehong REN, Lyuwen HUANG, Xu TIAN, Fei DUAN. Multivariate long-term series forecasting method with DFT-based frequency-sensitive dual-branch Transformer [J]. Journal of Computer Applications, 2024, 44(9): 2739-2746.
[8]	Na WANG, Lin JIANG, Yuancheng LI, Yun ZHU. Optimization of tensor virtual machine operator fusion based on graph rewriting and fusion exploration [J]. Journal of Computer Applications, 2024, 44(9): 2802-2809.
[9]	Yexin PAN, Zhe YANG. Optimization model for small object detection based on multi-level feature bidirectional fusion [J]. Journal of Computer Applications, 2024, 44(9): 2871-2877.
[10]	Zhiqiang ZHAO, Peihong MA, Xinhong HEI. Crowd counting method based on dual attention mechanism [J]. Journal of Computer Applications, 2024, 44(9): 2886-2892.
[11]	Yeheng LI, Guangsheng LUO, Qianmin SU. Logo detection algorithm based on improved YOLOv5 [J]. Journal of Computer Applications, 2024, 44(8): 2580-2587.
[12]	Kaipeng XUE, Tao XU, Chunjie LIAO. Multimodal sentiment analysis network with self-supervision and multi-layer cross attention [J]. Journal of Computer Applications, 2024, 44(8): 2387-2392.
[13]	Yuqing WANG, Guangli ZHU, Wenjie DUAN, Shuyu LI, Ruotong ZHOU. Sentiment classification model of psychological counseling text based on attention over attention mechanism [J]. Journal of Computer Applications, 2024, 44(8): 2393-2399.
[14]	Pengqi GAO, Heming HUANG, Yonghong FAN. Fusion of coordinate and multi-head attention mechanisms for interactive speech emotion recognition [J]. Journal of Computer Applications, 2024, 44(8): 2400-2406.
[15]	Tong CHEN, Fengyu YANG, Yu XIONG, Hong YAN, Fuxing QIU. Construction method of voiceprint library based on multi-scale frequency-channel attention fusion [J]. Journal of Computer Applications, 2024, 44(8): 2407-2413.

层特征	J&F/%	层特征	J&F/%
Layer2	85.9	Layer4	85.0
Layer3	85.4	Layer5	84.8

层特征	J&F/%	层特征	J&F/%
Layer2	85.9	Layer4	85.0
Layer3	85.4	Layer5	84.8