Transforming Model Prediction for Tracking

Christoph Mayer, Martin Danelljan, Goutam Bhat, Matthieu Paul, Danda Pani Paudel, Fisher Yu, Luc Van Gool
CVPR 2022

Transforming Model Prediction for Tracking

Abstract

Optimization based tracking methods have been widely successful by integrating a target model prediction module, providing effective global reasoning by minimizing an objective function. While this inductive bias integrates valuable domain knowledge, it limits the expressivity of the tracking network. In this work, we therefore propose a tracker architecture employing a Transformer-based model prediction module. Transformers capture global relations with little inductive bias, allowing it to learn the prediction of more powerful target models. We further extend the model predictor to estimate a second set of weights that are applied for accurate bounding box regression. The resulting tracker relies on training and on test frame information in order to predict all weights transductively. We train the proposed tracker end-to-end and validate its performance by conducting comprehensive experiments on multiple tracking datasets. Our tracker sets a new state of the art on three benchmarks, achieving an AUC of 68.5% on the challenging LaSOT dataset.

Results

Paper

Code

paper
github.com/visionml/pytracking

Citation

@inproceedings{mayer2022transforming,
    author    = {Mayer, Christoph and Danelljan, Martin and Bhat, Goutam and Paul, Matthieu and Paudel, Danda Pani and Yu, Fisher and Van Gool, Luc},
    title     = {Transforming Model Prediction for Tracking},
    booktitle = {Computer Vision and Pattern Recognition},
    year      = {2022}
}

Related


Tracking Every Thing in the Wild

Tracking Every Thing in the Wild

ECCV 2022 We introduce a new metric, Track Every Thing Accuracy (TETA), and a Track Every Thing tracker (TETer), which performs association using Class Exemplar Matching (CEM).


Video Mask Transfiner for High-Quality Video Instance Segmentation

Video Mask Transfiner for High-Quality Video Instance Segmentation

ECCV 2022 We propose Video Mask Transfiner (VMT) method, capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure.


Video Mask Transfiner for High-Quality Video Instance Segmentation

Video Mask Transfiner for High-Quality Video Instance Segmentation

ECCV 2022 We introduce the HQ-YTVIS dataset as long as Tube-Boundary AP, which provides training, validation and testing support to facilitate future development of VIS methods aiming at higher mask quality.


SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation

SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation

CVPR 2022 We introduce the largest synthetic dataset for autonomous driving to study continuous domain adaptation and multi-task perception.


Monocular Quasi-Dense 3D Object Tracking

Monocular Quasi-Dense 3D Object Tracking

TPAMI 2022 We combine quasi-dense tracking on 2D images and motion prediction in 3D space to achieve significant advance in 3D object tracking from monocular videos.


Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation

Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation

NeurIPS 2021 Spotlight We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich spatio-temporal information for online multiple object tracking and segmentation.


Quasi-Dense Similarity Learning for Multiple Object Tracking

Quasi-Dense Similarity Learning for Multiple Object Tracking

CVPR 2021 Oral We propose a simple yet effective multi-object tracking method in this paper.


BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning

BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning

CVPR 2020 Oral The largest driving video dataset for heterogeneous multitask learning.


Joint Monocular 3D Vehicle Detection and Tracking

Joint Monocular 3D Vehicle Detection and Tracking

ICCV 2019 We propose a novel online framework for 3D vehicle detection and tracking from monocular videos.