Mask-Free Video Instance Segmentation

Lei Ke, Martin Danelljan, Henghui Ding, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu
CVPR 2023

Mask-Free Video Instance Segmentation

Abstract

The recent advancement in Video Instance Segmentation (VIS) has largely been driven by the use of deeper and increasingly data-hungry transformer-based models. However, video masks are tedious and expensive to annotate, limiting the scale and diversity of existing VIS datasets. In this work, we aim to remove the mask-annotation requirement. We propose MaskFreeVIS, achieving highly competitive VIS performance, while only using bounding box annotations for the object state. We leverage the rich temporal mask consistency constraints in videos by introducing the Temporal KNN-patch Loss (TK-Loss), providing strong mask supervision without any labels. Our TK-Loss finds one-to-many matches across frames, through an efficient patch-matching step followed by a K-nearest neighbor selection. A consistency loss is then enforced on the found matches. Our mask-free objective is simple to implement, has no trainable parameters, is computationally efficient, yet outperforms baselines employing, e.g., state-of-the-art optical flow to enforce temporal mask consistency. We validate MaskFreeVIS on the YouTube-VIS 2019/2021, OVIS and BDD100K MOTS benchmarks. The results clearly demonstrate the efficacy of our method by drastically narrowing the gap between fully and weakly-supervised VIS performance.

Highlight

poster

Video Intro

Method

poster TK-Loss has four steps: 1) Patch Candidate Extraction: Patch candidates searching across frames with radius R. 2) Temporal KNN-Matching: Match k high-confidence candidates by patch affinities. 3) KNN-Consistency Loss: Enforce mask consistency objective among the matches. 4) Cyclic Tube Connection: Temporal loss aggregation.

Video Results Comparison

Web Videos Results of MaskFreeVIS

Paper

Code

The code and models of our MaskFreeVIS:

paper
github.com/SysCV/maskfreevis

Citation

@inproceedings{maskfreevis,
    author={Ke, Lei and Danelljan, Martin and Ding, Henghui and Tai, Yu-Wing and Tang, Chi-Keung and Yu, Fisher},
    title={Mask-Free Video Instance Segmentation},
    booktitle = {CVPR},
    year = {2023}
}  

Related


Segment Anything Meets Point Tracking

Segment Anything Meets Point Tracking

arXiv 2023 Extending SAM to video segmentation with point-based tracking to demonstrate strong zero-shot performance across popular video segmentation benchmarks.


Dense Prediction with Attentive Feature Aggregation

Dense Prediction with Attentive Feature Aggregation

WACV 2023 We propose Attentive Feature Aggregation (AFA) to exploit both spatial and channel information for semantic segmentation and boundary detection.


Video Mask Transfiner for High-Quality Video Instance Segmentation

Video Mask Transfiner for High-Quality Video Instance Segmentation

ECCV 2022 We propose Video Mask Transfiner (VMT) method, capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure.


Video Mask Transfiner for High-Quality Video Instance Segmentation

Video Mask Transfiner for High-Quality Video Instance Segmentation

ECCV 2022 We introduce the HQ-YTVIS dataset as long as Tube-Boundary AP, which provides training, validation and testing support to facilitate future development of VIS methods aiming at higher mask quality.


TACS: Taxonomy Adaptive Cross-Domain Semantic Segmentation

TACS: Taxonomy Adaptive Cross-Domain Semantic Segmentation

ECCV 2022 We introduce the more general taxonomy adaptive cross-domain semantic segmentation (TACS) problem, allowing for inconsistent taxonomies between the two domains.


Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation

Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation

NeurIPS 2021 Spotlight We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich spatio-temporal information for online multiple object tracking and segmentation.


Exploring Cross-Image Pixel Contrast for Semantic Segmentation

Exploring Cross-Image Pixel Contrast for Semantic Segmentation

ICCV 2021 Oral We propose a pixel-wise contrastive algorithm for semantic segmentation in the fully supervised setting.


BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning

BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning

CVPR 2020 Oral The largest driving video dataset for heterogeneous multitask learning.


Learning Saliency Propagation for Semi-Supervised Instance Segmentation

Learning Saliency Propagation for Semi-Supervised Instance Segmentation

CVPR 2020 We propose a ShapeProp module to propagate information between object detection and segmentation supervisions for Semi-Supervised Instance Segmentation.


Characterizing Adversarial Examples Based on Spatial Consistency Information for Semantic Segmentation

Characterizing Adversarial Examples Based on Spatial Consistency Information for Semantic Segmentation

ECCV 2018 We aim to characterize adversarial examples based on spatial context information in semantic segmentation.