深浅层表示融合的半监督视频目标分割

doi:10.11772/j.issn.1001-9081.2021091636

《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (12): 3884-3890.DOI: 10.11772/j.issn.1001-9081.2021091636

• 多媒体计算与计算机仿真 • 上一篇下一篇

深浅层表示融合的半监督视频目标分割

吕潇¹, 宋慧慧²(), 樊佳庆¹

^1.江苏省大数据分析技术重点实验室(南京信息工程大学), 南京 210044
^2.江苏省大气环境与装备技术协同创新中心(南京信息工程大学), 南京 210044

收稿日期:2021-09-17 修回日期:2022-01-11 接受日期:2022-01-19 发布日期:2022-12-21 出版日期:2022-12-10
通讯作者: 宋慧慧
作者简介:吕潇（1996—），男，江苏泰州人，硕士研究生，主要研究方向：视频目标分割、视频目标跟踪
樊佳庆（1994—），男，江苏南通人，博士研究生，主要研究方向：视频目标跟踪。
基金资助:
国家自然科学基金资助项目(61872189);江苏省自然科学基金资助项目(BK20191397)

Semi-supervised video object segmentation via deep and shallow representations fusion

Xiao LYU¹, Huihui SONG²(), Jiaqing FAN¹

^1.Jiangsu Key Laboratory of Big Data Analysis Technology （Nanjing University of Information Science and Technology），Nanjing Jiangsu 210044，China
^2.Jiangsu Collaborative Innovation Center on Atmospheric Environment and Equipment Technology （Nanjing University of Information Science and Technology），Nanjing Jiangsu 210044，China

Received:2021-09-17 Revised:2022-01-11 Accepted:2022-01-19 Online:2022-12-21 Published:2022-12-10
Contact: Huihui SONG
About author:LYU Xiao， born in 1996， M. S. candidate. His research interests include video object segmentation， video object tracking.
FAN Jiaqing， born in 1994， Ph. D. candidate. His research interests include video object tracking.
Supported by:
National Natural Science Foundation of China(61872189);Natural Science Foundation of Jiangsu Province(BK20191397)

摘要/Abstract

摘要：

为了解决半监督视频目标分割任务中，分割精度与分割速度难以兼顾以及无法对视频中与前景相似的背景目标做出有效区分的问题，提出一种基于深浅层特征融合的半监督视频目标分割算法。首先，利用预先生成的粗糙掩膜对图像特征进行处理，以获取更鲁棒的特征；然后，通过注意力模型提取深层语义信息；最后，将深层语义信息与浅层位置信息进行融合，从而得到更加精确的分割结果。在多个流行的数据集上进行了实验，实验结果表明：在分割运行速度基本不变的情况下，所提算法在DAVIS 2016数据集上的雅卡尔（J）指标相较于学习快速鲁棒目标模型的视频目标分割（FRTM）算法提高了1.8个百分点，综合评价指标为J和F得分的均值J&F相较于FRTM提高了2.3个百分点；同时，在DAVIS 2017数据集上，所提算法的J指标比FRTM提升了1.2个百分点，综合评价指标J&F比FRTM提升了1.1个百分点。以上结果充分说明所提算法能够在保持较快分割速度的情况下实现更高的分割精度，并且能够有效区别相似的前景与背景目标，具有较强的鲁棒性。可见所提算法在平衡速度与精度以及有效区分前景背景方面的优越性能。

关键词: 视频目标分割, 注意力, 融合, 深层语义信息, 浅层位置信息

Abstract:

In order to solve the problems that the segmentation accuracy and speed are difficult to balance and the algorithm cannot effectively distinguish similar foreground and background objects in the task of semi-supervised video object segmentation， a semi-supervised video object segmentation algorithm was proposed on the basis of deep and shallow feature fusion. Firstly， a pre-generated rough mask was used to process image features， thereby achieving more robust features. Secondly， deep semantic information was extracted by the attention model. Finally， deep semantic information and shallow position information were fused to obtain more accurate segmentation results. Experiments were conducted on multiple popular datasets. The experiment results demonstrate that the proposed algorithm improves the Jaccard （J） index by 1.8 percentage points and improves the comprehensive evaluation index mean of J and F?score J&F by 2.3 percentage points compared with Learning Fast and Robust Target Models for Video Object Segmentation （FRTM） algorithm on DAVIS 2016 dataset. Meanwhile， on DAVIS 2017 dataset， the proposed algorithm improves J index by 1.2 percentage points and improves the comprehensive evaluation index J&F by 1.1 percentage points compared with FRTM algorithm. The above results fully prove that the proposed algorithm can achieve higher segmentation accuracy with fast speed， and effectively distinguish background and foreground objects with strong robustness. It can be seen that the proposed algorithm has superior performance in balancing speed and accuracy and effectively distinguishing foreground and background.

Key words: video object segmentation, attention, fusion, deep semantic information, shallow position information

中图分类号:

TP391.4

吕潇, 宋慧慧, 樊佳庆. 深浅层表示融合的半监督视频目标分割[J]. 计算机应用, 2022, 42(12): 3884-3890.

Xiao LYU, Huihui SONG, Jiaqing FAN. Semi-supervised video object segmentation via deep and shallow representations fusion[J]. Journal of Computer Applications, 2022, 42(12): 3884-3890.

图/表 12

参考文献 33

1	GUO J M， LI Z W， CHEONG L F， et al. Video co-segmentation for meaningful action extraction［C］// Proceedings of the 2013 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2013： 2232-2239. 10.1109/iccv.2013.278
2	杨天明，陈志，岳文静. 基于视频深度学习的时空双流人物动作识别模型［J］. 计算机应用， 2018， 38（3）： 895-899， 915. 10.11772/j.issn.1001-9081.2017071740
	YANG T M， CHEN Z， YUE W J. Spatio-temporal two-stream human action recognition model based on video deep learning［J］. Journal of Computer Applications， 2018， 38（3）： 895-899， 915. 10.11772/j.issn.1001-9081.2017071740
3	胡学敏，童秀迟，郭琳，等. 基于深度视觉注意神经网络的端到端自动驾驶模型［J］. 计算机应用， 2020， 40（7）： 1926-1931. 10.11772/j.issn.1001-9081.2019112054
	HU X M， TONG X C， GUO L， et al. End-to-end autonomous driving model based on deep visual attention neural network［J］. Journal of Computer Applications， 2020， 40（7）： 1926-1931. 10.11772/j.issn.1001-9081.2019112054
4	SALEH K， HOSSNY M， NAHAVANDI S. Kangaroo vehicle collision detection using deep semantic segmentation convolutional neural network［C］// Proceedings of the 2016 International Conference on Digital Image Computing： Techniques and Applications. Piscataway： IEEE， 2016： 1-7. 10.1109/dicta.2016.7797057
5	OH S W， LEE J Y， SUNKAVALLI K， et al. Fast video object segmentation by reference-guided mask propagation［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 7376-7385. 10.1109/cvpr.2018.00770
6	CAELLES S， MANINIS K K， PONT-TUSET J， et al. One-shot video object segmentation［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 5320-5329. 10.1109/cvpr.2017.565
7	MANINIS K K， CAELLES S， CHEN Y H， et al. Video object segmentation without temporal information［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2019， 41（6）： 1515-1530. 10.1109/tpami.2018.2838670
8	PERAZZI F， KHOREVA A， BENENSON R， et al. Learning video object segmentation from static images［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 3491-3500. 10.1109/cvpr.2017.372
9	XU N， YANG L J， FAN Y C， et al. YouTube-VOS： sequence-to-sequence video object segmentation［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11209. Cham： Springer， 2018： 603-619.
10	VOIGTLAENDER P， LEIBE B. Online adaptation of convolutional neural networks for video object segmentation［C］// Proceedings of the 2017 British Machine Vision Conference. Durham： BMVA Press， 2017： No.116. 10.5244/c.31.116
11	OH S W， LEE J Y， XU N， et al. Video object segmentation using space-time memory networks［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 9225-9234. 10.1109/iccv.2019.00932
12	VOIGTLAENDER P， CHAI Y N， SCHROFF F， et al. FEELVOS： fast end-to-end embedding learning for video object segmentation［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 9473-9482. 10.1109/cvpr.2019.00971
13	HU Y T， HUANG J B， SCHWING A G. VideoMatch： matching based video object segmentation［C］// Proceedings of the 2018 European Conference on Computer Vision， LNCS 11212. Cham： Springer， 2018： 56-73.
14	王宁，宋慧慧，张开华. 基于距离加权重叠度估计与椭圆拟合优化的精确目标跟踪算法［J］. 计算机应用， 2021， 41（4）： 1100-1105. 10.11772/j.issn.1001-9081.2020060869
	WANG N， SONG H H， ZHANG K H. Accurate object tracking algorithm based on distance weighting overlap prediction and ellipse fitting optimization［J］. Journal of Computer Applications， 2021， 41（4）： 1100-1150. 10.11772/j.issn.1001-9081.2020060869
15	WANG Q， ZHANG L， BERTINETTO L， et al. Fast online object tracking and segmentation： a unifying approach［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 1328-1338. 10.1109/cvpr.2019.00142
16	LI B， YAN J J， WU W， et al. High performance visual tracking with Siamese region proposal network［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 8971-8980. 10.1109/cvpr.2018.00935
17	PERAZZI F， PONT-TUSET J， McWILLIAMS B， et al. A benchmark dataset and evaluation methodology for video object segmentation［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 724-732. 10.1109/cvpr.2016.85
18	PONT-TUSET J， PERAZZI F， CAELLES S， et al. The 2017 DAVIS challenge on video object segmentation［EB/OL］. （2018-03-01）［2021-04-03］.. 10.1109/cvpr.2017.565
19	CHEN X， LI Z X， YUAN Y， et al. State-aware tracker for real-time video object segmentation［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 9381-9390. 10.1109/cvpr42600.2020.00940
20	ROBINSON A， JÄREMO LAWIN F， DANELLJAN M， et al. Learning fast and robust target models for video object segmentation［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 7404-7413. 10.1109/cvpr42600.2020.00743
21	HE K M， ZHANG X Y， REN S Q， et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778. 10.1109/cvpr.2016.90
22	DANELLJAN M， BHAT G， KHAN F S， et al. ATOM： accurate tracking by overlap maximization［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 4655-4664. 10.1109/cvpr.2019.00479
23	CHEN B H， DENG W H， HU J N. Mixed high-order attention network for person re-identification［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway： IEEE， 2019： 371-381. 10.1109/iccv.2019.00046
24	LI W， ZHU X T， GONG S G. Person re-identification by deep joint learning of multi-loss classification［C］// Proceedings of the 26th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2017： 2194-2200. 10.24963/ijcai.2017/305
25	LIU J X， NI B B， YAN Y C， et al. Pose transferrable person re-identification［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 4099-4108. 10.1109/cvpr.2018.00431
26	ZHONG Z， ZHENG L， CAO D L， et al. Re-ranking person re-identification with k-reciprocal encoding［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2017： 3652-3661. 10.1109/cvpr.2017.389
27	XIANG X Y， TIAN Y P， ZHANG Y L， et al. Zooming Slow-Mo： fast and accurate one-stage space-time video super-resolution［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2020： 3367-3376. 10.1109/cvpr42600.2020.00343
28	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017：6000-6010.
29	XU N， YANG L J， FAN Y C， et al. YouTube-VOS： a large-scale video object segmentation benchmark［EB/OL］. （2018-09-06）［2021-08-22］.. 10.1007/978-3-030-01228-1_36
30	KINGMA D P， BA J L. Adam： a method for stochastic optimization［EB/OL］. （2017-01-30）［2021-08-22］..
31	JOHNANDER J， DANELLJAN M， BRISSMAN E， et al. A generative appearance model for end-to-end video object segmentation［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2019： 8945-8954. 10.1109/cvpr.2019.00916
32	YANG L J， WANG Y R， XIONG X H， et al. Efficient video object segmentation via network modulation［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2018： 6499-6507. 10.1109/cvpr.2018.00680
33	LUITEN J， VOIGTLAENDER P， LEIBE B. PReMVOS： proposal-generation， refinement and merging for video object segmentation［C］// Proceedings of the 2018 Asian Conference on Computer Vision， LNCS 11364. Cham： Springer， 2019： 565-580.

算法	J/%	F/%	J&F/%	帧率/ （frame·s^-1）	帧率（2080Ti）/ （frame·s^-1）
文献［31］算法	81.4	82.1	81.8	14.30	―
文献［11］算法	88.7	89.9	89.3	6.25	―
文献［12］算法	81.1	82.2	81.7	2.20	―
文献［32］算法	74.0	72.9	73.5	7.14	―
文献［33］算法	84.9	88.6	86.8	0.01	―
文献［7］算法	85.6	87.5	86.6	0.22	―
文献［10］算法	86.1	84.9	85.5	0.08	―
文献［19］算法	82.6	83.6	83.1	39.00	―
FRTM	83.7	83.4	83.6	21.90	18.17
本文算法	85.5	86.3	85.9	―	17.76

算法	J/%	F/%	J&F/%	帧率/ （frame·s^-1）	帧率（2080Ti）/ （frame·s^-1）
文献［31］算法	81.4	82.1	81.8	14.30	―
文献［11］算法	88.7	89.9	89.3	6.25	―
文献［12］算法	81.1	82.2	81.7	2.20	―
文献［32］算法	74.0	72.9	73.5	7.14	―
文献［33］算法	84.9	88.6	86.8	0.01	―
文献［7］算法	85.6	87.5	86.6	0.22	―
文献［10］算法	86.1	84.9	85.5	0.08	―
文献［19］算法	82.6	83.6	83.1	39.00	―
FRTM	83.7	83.4	83.6	21.90	18.17
本文算法	85.5	86.3	85.9	―	17.76

算法	J/%	F/%	J&F/%	帧率/ （frame·s^-1）	帧率（2080Ti）/ （frame·s^-1）
文献［31］算法	67.2	72.7	70.0	14.30	―
文献［11］算法	79.2	84.3	81.8	6.25	―
文献［12］算法	69.1	74.0	71.5	2.20	―
文献［32］算法	52.5	57.1	54.8	7.14	―
文献［33］算法	73.9	81.7	77.8	0.01	―
文献［7］算法	64.7	71.3	68.0	0.22	―
文献［10］算法	64.5	71.2	67.9	0.08	―
文献［19］算法	68.6	76.0	72.3	39.00	―
FRTMT	73.8	79.6	76.7	21.90	18.17
本文算法	75.0	80.5	77.8	―	17.76

算法	J/%	F/%	J&F/%	帧率/ （frame·s^-1）	帧率（2080Ti）/ （frame·s^-1）
文献［31］算法	67.2	72.7	70.0	14.30	―
文献［11］算法	79.2	84.3	81.8	6.25	―
文献［12］算法	69.1	74.0	71.5	2.20	―
文献［32］算法	52.5	57.1	54.8	7.14	―
文献［33］算法	73.9	81.7	77.8	0.01	―
文献［7］算法	64.7	71.3	68.0	0.22	―
文献［10］算法	64.5	71.2	67.9	0.08	―
文献［19］算法	68.6	76.0	72.3	39.00	―
FRTMT	73.8	79.6	76.7	21.90	18.17
本文算法	75.0	80.5	77.8	―	17.76

算法	J		F		综合指标g
算法	可见	未见	可见	未见	综合指标g
文献［31］算法	67.8	60.8	69.5	66.2	66.1
文献［10］算法	60.1	46.1	62.7	51.4	55.2
文献［33］算法	71.4	56.5	―	―	66.9
文献［11］算法	―	―	―	―	68.2
本文算法	68.0	60.7	71.3	68.4	67.1

深浅层表示融合的半监督视频目标分割

Semi-supervised video object segmentation via deep and shallow representations fusion

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 33

相关文章 15

编辑推荐

Metrics

模型	λ₁	λ₂	λ₃	J&F/%
HOA	―	―	―	85.0
EHOA	1.0	0.0	0.0	84.1
	0.0	1.0	0.0	84.6
	0.0	0.0	1.0	84.8
	0.1	0.2	0.7	85.1
	0.2	0.3	0.5	85.9
	0.3	0.3	0.4	85.3
	0.6	0.3	0.1	85.0

[1]	王娜, 蒋林, 李远成, 朱筠. 基于图形重写和融合探索的张量虚拟机算符融合优化[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2802-2809.
[2]	潘烨新, 杨哲. 基于多级特征双向融合的小目标检测优化模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2871-2877.
[3]	赵志强, 马培红, 黑新宏. 基于双重注意力机制的人群计数方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2886-2892.
[4]	黄颖, 杨佳宇, 金家昊, 万邦睿. 用于RGBT跟踪的孪生混合信息融合算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2878-2885.
[5]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[6]	王熙源, 张战成, 徐少康, 张宝成, 罗晓清, 胡伏原. 面向手术导航3D/2D配准的无监督跨域迁移网络[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2911-2918.
[7]	李力铤, 华蓓, 贺若舟, 徐况. 基于解耦注意力机制的多变量时序预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2732-2738.
[8]	杨航, 李汪根, 张根生, 王志格, 开新. 基于图神经网络的多层信息交互融合算法用于会话推荐[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2719-2725.
[9]	李顺勇, 李师毅, 胥瑞, 赵兴旺. 基于自注意力融合的不完整多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2696-2703.
[10]	任烈弘, 黄铝文, 田旭, 段飞. 基于DFT的频率敏感双分支Transformer多变量长时间序列预测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2739-2746.
[11]	薛凯鹏, 徐涛, 廖春节. 融合自监督和多层交叉注意力的多模态情感分析网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2387-2392.
[12]	汪雨晴, 朱广丽, 段文杰, 李书羽, 周若彤. 基于交互注意力机制的心理咨询文本情感分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2393-2399.
[13]	高鹏淇, 黄鹤鸣, 樊永红. 融合坐标与多头注意力机制的交互语音情感识别[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2400-2406.
[14]	陈彤, 杨丰玉, 熊宇, 严荭, 邱福星. 基于多尺度频率通道注意力融合的声纹库构建方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2407-2413.
[15]	汪才钦, 周渝皓, 张顺香, 王琰慧, 王小龙. 基于语境增强的新能源汽车投诉文本方面-观点对抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2430-2436.

层特征	J&F/%	层特征	J&F/%
Layer2	85.9	Layer4	85.0
Layer3	85.4	Layer5	84.8

层特征	J&F/%	层特征	J&F/%
Layer2	85.9	Layer4	85.0
Layer3	85.4	Layer5	84.8