Video Mask Transfiner for High-Quality Video Instance Segmentation

Lei Ke, Henghui Ding, Martin Danelljan, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu
ECCV 2022

Video Mask Transfiner for High-Quality Video Instance Segmentation

Why using HQ-YTVIS?

While Video Instance Segmentation (VIS) has seen rapid progress, current approaches struggle to predict high-quality masks with accurate boundary details. To tackle this issue, we identify that the coarse boundary annotations of the popular YouTube-VIS dataset constitute a major limiting factor.

Based on our VMT architecture, we therefore design an automated annotation refinement approach by iterative training and self-correction. To benchmark high-quality mask predictions for VIS, we introduce the HQ-YTVIS dataset as well as Tube-Boundary AP in ECCV 2022. HQ-YTVIS consists of a manually re-annotated test set and our automatically refined training data, which provides training, validation and testing support to facilitate future development of VIS methods aiming at higher mask quality.

Annotation Comparison

We only highlight the instance mask for one object per video for easy comparison.

Dataset Description

For HQ-YTVIS, we split the original YTVIS training set (2238 videos) into a new train (1678 videos, 75%), val (280 videos 12.5%) and test (280 videos 12.5%) sets following the ratios in YTVIS. The masks annotations on the train subset of HQ-YTVIS is self-corrected by VMT through iterative training, while the smaller sets of val and test are carefully checked and relabeled by human annotators to ensure high mask boundary quality.

Tube-Mask AP vs. Tube-Boundary AP

We employ both the standard tube mask AP in YouTube-VIS and our proposed Tube-Boundary AP as evaluation metrics. Tube-Boundary AP not only focuses on the boundary quality of the objects, but also considers spatio-temporal consistency between the predicted and ground truth object masks.

Dataset Download

Download Link

Paper

Please refer to our paper for more details of the HQ-YTVIS dataset.

Manual re-labeling on Val/Test Sets

Code

The code of data loading and our Tube-Boundary AP evaluation:

paper
github.com/SysCV/vmt

License

HQ-YTVIS labels are provided under CC BY-SA 4.0 License. Please refer to YouTube-VOS terms for use of the videos.

Citation

@inproceedings{vmt,
    title = {Video Mask Transfiner for High-Quality Video Instance Segmentation},
    author = {Ke, Lei and Ding, Henghui and Danelljan, Martin and Tai, Yu-Wing and Tang, Chi-Keung and Yu, Fisher},
    booktitle = {European Conference on Computer Vision (ECCV)},
    year = {2022}
} 

Contact

For questions and suggestions, please contact Lei Ke (keleiwhu at gmail.com).

Related


Video Mask Transfiner for High-Quality Video Instance Segmentation

Video Mask Transfiner for High-Quality Video Instance Segmentation

ECCV 2022 We propose Video Mask Transfiner (VMT) method, capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure.


SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation

SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation

CVPR 2022 We introduce the largest synthetic dataset for autonomous driving to study continuous domain adaptation and multi-task perception.


Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation

Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation

NeurIPS 2021 Spotlight We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich spatio-temporal information for online multiple object tracking and segmentation.


BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning

BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning

CVPR 2020 Oral The largest driving video dataset for heterogeneous multitask learning.


Mask-Free Video Instance Segmentation

Mask-Free Video Instance Segmentation

CVPR 2023 We remove video and image mask annptation necessity for training highly accurate VIS models.


Dense Prediction with Attentive Feature Aggregation

Dense Prediction with Attentive Feature Aggregation

WACV 2023 We propose Attentive Feature Aggregation (AFA) to exploit both spatial and channel information for semantic segmentation and boundary detection.


CC-3DT: Panoramic 3D Object Tracking via Cross-Camera Fusion

CC-3DT: Panoramic 3D Object Tracking via Cross-Camera Fusion

CoRL 2022 We propose a method for panoramic 3D object tracking, called CC-3DT, that associates and models object trajectories both temporally and across views.


Tracking Every Thing in the Wild

Tracking Every Thing in the Wild

ECCV 2022 We introduce a new metric, Track Every Thing Accuracy (TETA), and a Track Every Thing tracker (TETer), which performs association using Class Exemplar Matching (CEM).


TACS: Taxonomy Adaptive Cross-Domain Semantic Segmentation

TACS: Taxonomy Adaptive Cross-Domain Semantic Segmentation

ECCV 2022 We introduce the more general taxonomy adaptive cross-domain semantic segmentation (TACS) problem, allowing for inconsistent taxonomies between the two domains.


Transforming Model Prediction for Tracking

Transforming Model Prediction for Tracking

CVPR 2022 We propose a tracker architecture employing a Transformer-based model prediction module.