Video Mask Transfiner for High-Quality Video Instance Segmentation

Lei Ke, Henghui Ding, Martin Danelljan, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu
ECCV 2022

Video Mask Transfiner for High-Quality Video Instance Segmentation

Abstract

While Video Instance Segmentation (VIS) has seen rapid progress, current approaches struggle to predict high-quality masks with accurate boundary details. Moreover, the predicted segmentations often fluctuate over time, suggesting that temporal consistency cues are neglected or not fully utilized. In this paper, we set out to tackle these issues, with the aim of achieving highly detailed and more temporally stable mask predictions for VIS.

We first propose the Video Mask Transfiner (VMT) method, capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure. Our VMT detects and groups sparse error-prone spatio-temporal regions of each tracklet in the video segment, which are then refined using both local and instance-level cues. Second, we identify that the coarse boundary annotations of the popular YouTube-VIS dataset constitute a major limiting factor.

Based on our VMT architecture, we therefore design an automated annotation refinement approach by iterative training and self-correction. To benchmark high-quality mask predictions for VIS, we introduce the HQ-YTVIS dataset, consisting of a manually re-annotated test set and our automatically refined training data. We compare VMT with the most recent state-of-the-art methods on the HQ-YTVIS, as well as the OVIS and BDD100K MOTS benchmarks. Experimental results clearly demonstrate the efficacy and effectiveness of our method on segmenting complex and dynamic objects, by capturing precise details.

Method

poster

Visual Results Comparison

HQ-YTVIS vs. YTVIS

Based on our VMT architecture, we design an automated annotation refinement approach by iterative training and self-correction. To benchmark high-quality mask predictions for VIS, we introduce the HQ-YTVIS dataset and Tube-Boundary AP in ECCV 2022. Based on YTVIS, HQ-YTVIS consists of a manually re-annotated test set and automatically refined training data, which provides training, validation and testing support to facilitate future development of VIS methods aiming at higher mask quality. We only highlight the instance mask for one object per video for easy comparison. For more details on the HQ-YTVIS benchmark, please refer to the dataset page.

Paper

Please refer to our paper for more details of VMT and the HQ-YTVIS dataset.

Code

The code and models of the VMT method, and our Tube-Boundary AP evaluation:

paper
github.com/SysCV/vmt

License

HQ-YTVIS labels are provided under CC BY-SA 4.0 License. Please refer to YouTube-VOS terms for use of the videos.

Citation

@inproceedings{vmt,
    title = {Video Mask Transfiner for High-Quality Video Instance Segmentation},
    author = {Ke, Lei and Ding, Henghui and Danelljan, Martin and Tai, Yu-Wing and Tang, Chi-Keung and Yu, Fisher},
    booktitle = {European Conference on Computer Vision (ECCV)},
    year = {2022}
} 

Related


Video Mask Transfiner for High-Quality Video Instance Segmentation

Video Mask Transfiner for High-Quality Video Instance Segmentation

ECCV 2022 We introduce the HQ-YTVIS dataset as long as Tube-Boundary AP, which provides training, validation and testing support to facilitate future development of VIS methods aiming at higher mask quality.


SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation

SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation

CVPR 2022 We introduce the largest synthetic dataset for autonomous driving to study continuous domain adaptation and multi-task perception.


Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation

Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation

NeurIPS 2021 Spotlight We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich spatio-temporal information for online multiple object tracking and segmentation.


BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning

BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning

CVPR 2020 Oral The largest driving video dataset for heterogeneous multitask learning.


Mask-Free Video Instance Segmentation

Mask-Free Video Instance Segmentation

CVPR 2023 We remove video and image mask annptation necessity for training highly accurate VIS models.


Dense Prediction with Attentive Feature Aggregation

Dense Prediction with Attentive Feature Aggregation

WACV 2023 We propose Attentive Feature Aggregation (AFA) to exploit both spatial and channel information for semantic segmentation and boundary detection.


CC-3DT: Panoramic 3D Object Tracking via Cross-Camera Fusion

CC-3DT: Panoramic 3D Object Tracking via Cross-Camera Fusion

CoRL 2022 We propose a method for panoramic 3D object tracking, called CC-3DT, that associates and models object trajectories both temporally and across views.


Tracking Every Thing in the Wild

Tracking Every Thing in the Wild

ECCV 2022 We introduce a new metric, Track Every Thing Accuracy (TETA), and a Track Every Thing tracker (TETer), which performs association using Class Exemplar Matching (CEM).


TACS: Taxonomy Adaptive Cross-Domain Semantic Segmentation

TACS: Taxonomy Adaptive Cross-Domain Semantic Segmentation

ECCV 2022 We introduce the more general taxonomy adaptive cross-domain semantic segmentation (TACS) problem, allowing for inconsistent taxonomies between the two domains.


Transforming Model Prediction for Tracking

Transforming Model Prediction for Tracking

CVPR 2022 We propose a tracker architecture employing a Transformer-based model prediction module.