CVPR 2023 BDD100K Challenges

blog-image

We are hosting multi-object tracking (MOT) and segmentation (MOTS) challenges based on BDD100K, the largest open driving video dataset as part of the CVPR 2023 Workshop on Autonomous Driving (WAD).

Overview

This is a large-scale tracking challenge under the most diverse driving conditions. Understanding the temporal association and shape of objects within videos is one of the fundamental yet challenging tasks for autonomous driving. The BDD100K MOT and MOTS datasets provides diverse driving scenarios with high quality instance segmentation masks under complicated occlusions and reappearing patterns, which serves as a great testbed for the reliability of the developed tracking and segmentation algorithms in real scenes. We encourage participants from both academia and industry.

Challenges

  • Multiple Object Tracking (MOT): Given a video sequence of camera images, predict 2D bounding boxes for each object and their association across frames.

  • Multiple Object Tracking and Segmentation (MOTS): In addition to MOT, also predict segmentation masks for each object.

Prizes

The top three ranked teams for each challenge will receive the following prizes:

  1. First place: 1000 CHF and will be able to present their method at the CVPR 2023 Workshop on Autonomous Driving.

  2. Second place: 500 CHF.

  3. Third place: 300 CHF.

All participants will receive certificates with their ranking, if desired.

Timeline

The challenge starts on March 21st, 2023 and will end at 5 PM GMT on June 7, 2023. You can use this tool to convert to your local time.

Data

BDD100K has been collected throughout diverse scenarios, covering New York, San Francisco Bay Area, and other regions in the US. It contains scenes in a wide variety of locations, weather conditions and day time periods, such as highways, city streets, residential areas, rainy/snowy weathers, etc. The BDD100K MOT set contains 2,000 fully annotated 40-second sequences at 5 FPS under different weather conditions, time of the day, and scene types. We use 1,400/200/400 videos for train/val/test, containing a total of 160K instances and 4M objects. The MOTS set uses a subset of the MOT videos, with 154/32/37 videos for train/val/test, containing 25K instances and 480K object masks. For all challenges, the full 100K raw video sequences at 30 FPS are also available for training.

Baselines

We provide baselines, which serve as an example on how to utilize the BDD100K data.

Submission

For submission, please follow the following formats for each challenge.

MOT Format

To evaluate your algorithms on BDD100K MOT benchmark, the submission must be in standard Scalabel format in one of these formats:

  • A zip file of a folder that contains JSON files of each video.
  • A zip file of a JSON file of the entire evaluation set.

The JSON file for each video should contain a list of per-frame result dictionaries with the following structure:

    - videoName: str, name of current sequence
    - name: str, name of current frame
    - frameIndex: int, index of current frame within sequence
    - labels []:
        - id: str, unique instance id of prediction in current sequence
        - category: str, name of the predicted category
        - box2d []:
            - x1: float
            - y1: float
            - x2: float
            - y2: float

You can find an example result file here.

MOTS Format

To evaluate your algorithms on BDD100K MOTS benchmark, the submission must be in standard Scalabel format in one of these formats:

  • A zip file of a folder that contains JSON files of each video.
  • A zip file of a JSON file of the entire evaluation set.

The JSON file for each video should contain a list of per-frame result dictionaries with the following structure:

    - videoName: str, name of current sequence
    - name: str, name of current frame
    - frameIndex: int, index of current frame within sequence
    - labels []:
        - id: str, unique instance id of prediction in current sequence
        - category: str, name of the predicted category
        - rle:
            - counts: str
            - size: (height, width)

You can find an example result file here.

Evaluation Server

You can submit your predictions to our challenge evaluation servers hosted on EvalAI:

Note that these are separate servers used specifically for the challenges. Submissions to the public MOT and MOTS servers will not be used.

Submission Policy

You can make 3 successful submissions per month (at most 1 per day) to the test set and unlimited to the validation set. You can modify the visibility of your submission to be public or private. Before the final deadline, please make your final submission public so it is visible on the public leaderboard.

Evaluation

We provide more details here regarding evaluation.

Super-category

In addition to the evaluation of all 8 classes, we also evaluate results for 3 super-categories specified below. The super-category evaluation results are provided only for the purpose of reference.

    "HUMAN":   ["pedestrian", "rider"],
    "VEHICLE": ["car", "bus", "truck", "train"],
    "BIKE":    ["motorcycle", "bicycle"]

Ignore Regions

After the bounding box matching process in evaluation, we ignore all detected false-positive boxes that have >50% overlap with the crowd region (ground-truth boxes with the “Crowd” attribute).

We also ignore object regions that are annotated as 3 distracting classes (“other person”, “trailer”, and “other vehicle”) by the same strategy of crowd regions for simplicity.

Pre-training

It is fair game to pre-train your network with ImageNet (ImageNet1K or ImageNet22K). For this challenge, we will only rank the methods that do not use external datasets (except ImageNet). Thus, datasets like COCO and Cityscapes are not allowed.

Metrics

We employ mean Track Every Thing Accuracy (TETA, mean of TETA of the 8 categories) as our primary evaluation metric for ranking. TETA proposed in [2], better handles tracking evaluation in long-tailed multi-class scenarios existing in BDD100k sequences. We also employ mean Higher Order Tracking Accuracy (HOTA) order, mean Multiple Object Tracking Accuracy (mMOTA) and mean ID F1 score (mIDF1), which are previously used as the main metrics. All metrics are detailed below. Note that the overall performance is measured for all objects without considering the category if not mentioned. For MOTS, we use the same metrics set as MOT. The only difference lies in the computation of distance matrices. In MOT, it is computed using box IoU, while for MOTS the mask IoU is used.

  • mTETA (%): mean Track Every Thing Accuracy [2] across all 8 categories.

  • mHOTA (%): mean Higher Order Tracking Accuracy [4] across all 8 categories.

  • mHOTA (%): mean Higher Order Tracking Accuracy [4] across all 8 categories.

  • mMOTA (%): mean Multiple Object Tracking Accuracy [5] across all 8 categories.

  • mIDF1 (%): mean ID F1 score [6] across all 8 categories.

  • mMOTP (%): mean Multiple Object Tracking Precision [5] across all 8 categories.

  • TETA (%): Track Every Thing Accuracy [5]. It disentangles classification from tracking evaluation which can comprehensively evaluate different aspects of proposed trackers, such as association performance.

  • HOTA (%): Higher Order Tracking Accuracy [5]. It balances the evaluation of detection and association into a single unified metric.

  • MOTA (%): Multiple Object Tracking Accuracy [5]. It measures the errors from false positives, false negatives and identity switches.

  • IDF1 (%): ID F1 score [6]. The ratio of correctly identified detections over the average number of ground-truths and detections.

  • MOTP (%): Multiple Object Tracking Precision [5]. It measures the misalignments between ground-truths and detections.

Questions

If you have any questions, please go to the BDD100K discussions board.

Organizers

Siyuan Li
Thomas E. Huang
Tobias Fischer
Fisher Yu

Citations

[1] Pang, J., Qiu, L., Li, X., Chen, H., Li, Q., Darrell, T., Yu, F.: Quasi-dense similarity learning for multiple object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 164–173 (2021)

[2] Li, Siyuan, et al. “Tracking Every Thing in the Wild.” Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII. Cham: Springer Nature Switzerland, 2022.

[3] Ke, L., Li, X., Danelljan, M., Tai, Y.W., Tang, C.K., Yu, F.: Prototypical cross- attention networks for multiple object tracking and segmentation. Advances in Neural Information Processing Systems 34 (2021)

[4] Bernardin, Keni, and Rainer Stiefelhagen. “Evaluating multiple object tracking performance: the CLEAR MOT metrics.” EURASIP Journal on Image and Video Processing 2008 (2008): 1-10

[5] Ristani, Ergys, et al. “Performance measures and a data set for multi-target, multi-camera tracking.” European Conference on Computer Vision. Springer, Cham, 2016