Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation

Lei Ke, Xia Li, Martin Danelljan, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu
arXiv

Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation

Abstract

Multiple object tracking and segmentation requires detecting, tracking, and segmenting objects belonging to a set of given classes. Most approaches only exploit the temporal dimension to address the association problem, while relying on single frame predictions for the segmentation mask itself. We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich spatio-temporal information for online multiple object tracking and segmentation. PCAN first distills a space-time memory into a set of prototypes and then employs cross-attention to retrieve rich information from the past frames. To segment each object, PCAN adopts a prototypical appearance module to learn a set of contrastive foreground and background prototypes, which are then propagated over time. Extensive experiments demonstrate that PCAN outperforms current video instance tracking and segmentation competition winners on both Youtube-VIS and BDD100K datasets, and shows efficacy to both one-stage and two-stage segmentation frameworks.

Video

Paper

Code

paper
github.com/SysCV/pcan

Citation

@article{pcanmots,
      title={Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation},
      author={Ke, Lei and Li, Xia and Danelljan, Martin and Tai, Yu-Wing and Tang, Chi-Keung and Yu, Fisher},
      journal={arXiv preprint arXiv:2106.11958},
      year={2021}
}

Related


BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning

BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning

CVPR 2020 Oral The largest driving video dataset for heterogeneous multitask learning.


Quasi-Dense Similarity Learning for Multiple Object Tracking

Quasi-Dense Similarity Learning for Multiple Object Tracking

CVPR 2021 Oral We propose a simple yet effective multi-object tracking method in this paper.


Exploring Cross-Image Pixel Contrast for Semantic Segmentation

Exploring Cross-Image Pixel Contrast for Semantic Segmentation

ICCV 2021 Oral We propose a pixel-wise contrastive algorithm for semantic segmentation in the fully supervised setting.


Learning Saliency Propagation for Semi-Supervised Instance Segmentation

Learning Saliency Propagation for Semi-Supervised Instance Segmentation

CVPR 2020 We propose a ShapeProp module to propagate information between object detection and segmentation supervisions for Semi-Supervised Instance Segmentation.


Joint Monocular 3D Vehicle Detection and Tracking

Joint Monocular 3D Vehicle Detection and Tracking

ICCV 2019 We propose a novel online framework for 3D vehicle detection and tracking from monocular videos.


Characterizing Adversarial Examples Based on Spatial Consistency Information for Semantic Segmentation

Characterizing Adversarial Examples Based on Spatial Consistency Information for Semantic Segmentation

ECCV 2018 We aim to characterize adversarial examples based on spatial context information in semantic segmentation.


Deep Layer Aggregation

Deep Layer Aggregation

CVPR 2018 Oral We augment standard architectures with deeper aggregation to better fuse information across layers.


Dilated Residual Networks

Dilated Residual Networks

CVPR 2017 We show that dilated residual networks (DRNs) outperform their non-dilated counterparts in image classification without increasing the model’s depth or complexity.


FCNs in the Wild: Pixel-level Adversarial and Constraint-based Adaptation

FCNs in the Wild: Pixel-level Adversarial and Constraint-based Adaptation

arXiv 2016 We introduce the first domain adaptive semantic segmentation method, proposing an unsupervised adversarial approach to pixel prediction problems.


Multi-Scale Context Aggregation by Dilated Convolutions

Multi-Scale Context Aggregation by Dilated Convolutions

ICLR 2016 We study dilated convolution in depth. It has become a foundamental network operation.