Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving

Thomas E. Huang, Yifan Liu, Luc Van Gool, Fisher Yu
ICCV 2023

Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving

Abstract

Performing multiple heterogeneous visual tasks in dynamic scenes is a hallmark of human perception capability. Despite remarkable progress in image and video recognition via representation learning, current research still focuses on designing specialized networks for singular, homogeneous, or simple combination of tasks. We instead explore the construction of a unified model for major image and video recognition tasks in autonomous driving with diverse input and output structures. To enable such an investigation, we design a new challenge, Video Task Decathlon (VTD), which includes ten representative image and video tasks spanning classification, segmentation, localization, and association of objects and pixels. On VTD, we develop our unified network, VTDNet, that uses a single structure and a single set of weights for all ten tasks. VTDNet groups similar tasks and employs task interaction stages to exchange information within and between task groups. Given the impracticality of labeling all tasks on all frames, and the performance degradation associated with joint training of many tasks, we design a Curriculum training, Pseudo-labeling, and Fine-tuning (CPF) scheme to successfully train VTDNet on all tasks and mitigate performance loss. Armed with CPF, VTDNet significantly outperforms its single-task counterparts on most tasks with only 20% overall computations. VTD is a promising new direction for exploring the unification of perception tasks in autonomous driving.

Paper

Code

paper
github.com/SysCV/vtd (coming soon)

Citation

@inproceedings{huang2023vtd,
    title={Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving},
    author={Huang, Thomas E and Liu, Yifan and Van Gool, Luc and Yu, Fisher},
    journal={International Conference on Computer Vision (ICCV)},
    year={2023}
}

Related


Dense Prediction with Attentive Feature Aggregation

Dense Prediction with Attentive Feature Aggregation

WACV 2023 We propose Attentive Feature Aggregation (AFA) to exploit both spatial and channel information for semantic segmentation and boundary detection.


Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation

Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation

NeurIPS 2021 Spotlight We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich spatio-temporal information for online multiple object tracking and segmentation.


End-to-End Urban Driving by Imitating a Reinforcement Learning Coach

End-to-End Urban Driving by Imitating a Reinforcement Learning Coach

ICCV 2021 We demonstrated that an RL coach (Roach) would be a better choice to supervise imitation learning agents.


Instance-Aware Predictive Navigation in Multi-Agent Environments

Instance-Aware Predictive Navigation in Multi-Agent Environments

ICRA 2021 A new visual model-based RL method with consideration of multiple hypotheses for future object movement.


BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning

BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning

CVPR 2020 Oral The largest driving video dataset for heterogeneous multitask learning.


Semantic Predictive Control for Explainable and Efficient Policy Learning

Semantic Predictive Control for Explainable and Efficient Policy Learning

ICRA 2019 We propose a driving policy learning framework that predicts feature representations of future visual inputs.


Deep Object-Centric Policies for Autonomous Driving

Deep Object-Centric Policies for Autonomous Driving

ICRA 2019 We show that object-centric models outperform object-agnostic methods in scenes with other vehicles and pedestrians.


End-to-end Learning of Driving Models from Large-scale Video Datasets

End-to-end Learning of Driving Models from Large-scale Video Datasets

CVPR 2017 Oral We develop an end-to-end trainable architecture for learning to predict a distribution over future vehicle egomotion from instantaneous monocular camera observations and previous vehicle state.