Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation

Lei Ke, Xia Li, Martin Danelljan, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu
NeurIPS 2021 Spotlight

Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation

Abstract

Multiple object tracking and segmentation requires detecting, tracking, and segmenting objects belonging to a set of given classes. Most approaches only exploit the temporal dimension to address the association problem, while relying on single frame predictions for the segmentation mask itself. We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich spatio-temporal information for online multiple object tracking and segmentation. PCAN first distills a space-time memory into a set of prototypes and then employs cross-attention to retrieve rich information from the past frames. To segment each object, PCAN adopts a prototypical appearance module to learn a set of contrastive foreground and background prototypes, which are then propagated over time. Extensive experiments demonstrate that PCAN outperforms current video instance tracking and segmentation competition winners on both Youtube-VIS and BDD100K datasets, and shows efficacy to both one-stage and two-stage segmentation frameworks.

Video

Poster

BDD100K Prediction Examples

There are the examples of running PCAN on BDD100K based on QDTrack for bounding box tracking.

Quantitative Results

Paper

Code

paper
github.com/SysCV/pcan

Citation

@inproceedings{pcan,
    author    = {Ke, Lei and Li, Xia and Danelljan, Martin and Tai, Yu-Wing and Tang, Chi-Keung and Yu, Fisher},
    booktitle = {Advances in Neural Information Processing Systems},
    title     = {Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation},
    year      = {2021}
}

Related


BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning

BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning

CVPR 2020 Oral The largest driving video dataset for heterogeneous multitask learning.


Dense Prediction with Attentive Feature Aggregation

Dense Prediction with Attentive Feature Aggregation

WACV 2023 We propose Attentive Feature Aggregation (AFA) to exploit both spatial and channel information for semantic segmentation and boundary detection.


Video Mask Transfiner for High-Quality Video Instance Segmentation

Video Mask Transfiner for High-Quality Video Instance Segmentation

ECCV 2022 We introduce the HQ-YTVIS dataset as long as Tube-Boundary AP, which provides training, validation and testing support to facilitate future development of VIS methods aiming at higher mask quality.


Video Mask Transfiner for High-Quality Video Instance Segmentation

Video Mask Transfiner for High-Quality Video Instance Segmentation

ECCV 2022 We propose Video Mask Transfiner (VMT) method, capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure.


Quasi-Dense Similarity Learning for Multiple Object Tracking

Quasi-Dense Similarity Learning for Multiple Object Tracking

CVPR 2021 Oral We propose a simple yet effective multi-object tracking method in this paper.


Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving

Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving

ICCV 2023 VTD is a promising new direction for exploring the unification of perception tasks in autonomous driving.


OVTrack: Open-Vocabulary Multiple Object Tracking

OVTrack: Open-Vocabulary Multiple Object Tracking

CVPR 2023 We introduce the first open-vocabulary multiple object tracker OVTrack trained from only static images and an evaluation benchmark.


Mask-Free Video Instance Segmentation

Mask-Free Video Instance Segmentation

CVPR 2023 We remove video and image mask annptation necessity for training highly accurate VIS models.


CC-3DT: Panoramic 3D Object Tracking via Cross-Camera Fusion

CC-3DT: Panoramic 3D Object Tracking via Cross-Camera Fusion

CoRL 2022 We propose a method for panoramic 3D object tracking, called CC-3DT, that associates and models object trajectories both temporally and across views.


Tracking Every Thing in the Wild

Tracking Every Thing in the Wild

ECCV 2022 We introduce a new metric, Track Every Thing Accuracy (TETA), and a Track Every Thing tracker (TETer), which performs association using Class Exemplar Matching (CEM).