CVPR 2022 BDD100K Challenges

blog-image

We are hosting multi-object tracking (MOT) and segmentation (MOTS) challenges based on BDD100K, the largest open driving video dataset as part of the CVPR 2022 Workshop on Autonomous Driving (WAD).

Participation

Please first test your results on our eval.ai challenge pages (MOT or MOTS and get your performance. Then tell us your method details through this submission form before the challenge deadline. We can only consider those teams who fill in the form in the challenge ranking. This page provides more details on our challenges.

Overview

This is a large-scale tracking challenge under the most diverse driving conditions. Understanding the temporal association and shape of objects within videos is one of the fundamental yet challenging tasks for autonomous driving. The BDD100K MOT and MOTS datasets provides diverse driving scenarios with high quality instance segmentation masks under complicated occlusions and reappearing patterns, which serves as a great testbed for the reliability of the developed tracking and segmentation algorithms in real scenes. We encourage participants from both academia and industry.

Challenges

  • Multiple Object Tracking (MOT): Given a video sequence of camera images, predict 2D bounding boxes for each object and their association across frames.

  • Multiple Object Tracking and Segmentation (MOTS): In addition to MOT, also predict segmentation masks for each object.

All participants will receive certificates with their ranking, if desired.

Timeline

The challenge starts on March 21st, 2022 and will end at 5 PM GMT on June 7, 2022. You can use this tool to convert to your local time.

Data

The BDD100K MOT set contains 2,000 fully annotated 40-second sequences under different weather conditions, time of the day, and scene types. We use 1,400/200/400 videos for train/val/test, containing a total of 160K instances and 4M objects. The MOTS set uses a subset of the MOT videos, with 154/32/37 videos for train/val/test, containing 25K instances and 480K object masks.

Baselines

We provide two baselines, one for each challenge, which serve as an example on how to utilize the BDD100K data.

You can also find the baselines in the BDD100K Model Zoo.

Submission

For submission, please follow the following formats for each challenge.

MOT Format

To evaluate your algorithms on BDD100K MOT benchmark, the submission must be in standard Scalabel format in one of these formats:

  • A zip file of a folder that contains JSON files of each video.
  • A zip file of a JSON file of the entire evaluation set.

The JSON file for each video should contain a list of per-frame result dictionaries with the following structure:

    - videoName: str, name of current sequence
    - name: str, name of current frame
    - frameIndex: int, index of current frame within sequence
    - labels []:
        - id: str, unique instance id of prediction in current sequence
        - category: str, name of the predicted category
        - box2d []:
            - x1: float
            - y1: float
            - x2: float
            - y2: float

You can find an example result file here.

MOTS Format

To evaluate your algorithms on BDD100K MOTS benchmark, the submission must be in standard Scalabel format in one of these formats:

  • A zip file of a folder that contains JSON files of each video.
  • A zip file of a JSON file of the entire evaluation set.

The JSON file for each video should contain a list of per-frame result dictionaries with the following structure:

    - videoName: str, name of current sequence
    - name: str, name of current frame
    - frameIndex: int, index of current frame within sequence
    - labels []:
        - id: str, unique instance id of prediction in current sequence
        - category: str, name of the predicted category
        - rle:
            - counts: str
            - size: (height, width)

You can find an example result file here.

Evaluation Server

You can submit your predictions to our challenge evaluation servers hosted on EvalAI:

Note that these are separate servers used specifically for the challenges. Submissions to the public MOT and MOTS servers will not be used.

Submission Policy

You can make 3 successful submissions per month (at most 1 per day) to the test set and unlimited to the validation set. The leaderboard will be public.

Evaluation

We provide more details here regarding evaluation.

Super-category

In addition to the evaluation of all 8 classes, we also evaluate results for 3 super-categories specified below. The super-category evaluation results are provided only for the purpose of reference.

    "HUMAN":   ["pedestrian", "rider"],
    "VEHICLE": ["car", "bus", "truck", "train"],
    "BIKE":    ["motorcycle", "bicycle"]

Ignore Regions

After the bounding box matching process in evaluation, we ignore all detected false-positive boxes that have >50% overlap with the crowd region (ground-truth boxes with the “Crowd” attribute).

We also ignore object regions that are annotated as 3 distracting classes (“other person”, “trailer”, and “other vehicle”) by the same strategy of crowd regions for simplicity.

Pre-training

It is a fair game to pre-train your network with ImageNet, but if other datasets are used, please note in the submission description. We will rank the methods without using external datasets except ImageNet.

Metrics

We employ mean Multiple Object Tracking Accuracy (mMOTA, mean of MOTA of the 8 categories) as our primary evaluation metric for ranking. We also employ mean ID F1 score (mIDF1) to highlight the performance of tracking consistency that is crucial for object tracking. All metrics are detailed below. Note that the overall performance is measured for all objects without considering the category if not mentioned. For MOTS, we use the same metrics set as MOT. The only difference lies in the computation of distance matrices. In MOT, it is computed using box IoU, while for MOTS the mask IoU is used.

  • mMOTA (%): mean Multiple Object Tracking Accuracy across all 8 categories.

  • mIDF1 (%): mean ID F1 score across all 8 categories.

  • mMOTP (%): mean Multiple Object Tracking Precision across all 8 categories.

  • MOTA (%): Multiple Object Tracking Accuracy [3]. It measures the errors from false positives, false negatives and identity switches.

  • IDF1 (%): ID F1 score [4]. The ratio of correctly identified detections over the average number of ground-truths and detections.

  • MOTP (%): Multiple Object Tracking Precision [3]. It measures the misalignments between ground-truths and detections.

  • FP: Number of False Positives [3].

  • FN: Number of False Negatives [3].

  • IDSw: Number of Identity Switches [3]. An identity switch is counted when a ground-truth object is matched with a identity that is different from the last known assigned identity.

  • MT: Number of Mostly Tracked identities. At least 80 percent of their lifespan are tracked.

  • PT: Number of Partially Tracked identities. At least 20 percent and less than 80 percent of their lifespan are tracked.

  • ML: Number of Mostly Lost identities. Less of 20 percent of their lifespan are tracked.

  • FM: Number of FragMentations. Total number of switches from tracked to not tracked detections.

Questions

If you have any questions, please go to the BDD100K discussions board.

Organizers

Siyuan Li
Thomas E. Huang
Tobias Fischer
Fisher Yu

Citations

[1] Pang, J., Qiu, L., Li, X., Chen, H., Li, Q., Darrell, T., Yu, F.: Quasi-dense similarity learning for multiple object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 164–173 (2021)

[2] Ke, L., Li, X., Danelljan, M., Tai, Y.W., Tang, C.K., Yu, F.: Prototypical cross- attention networks for multiple object tracking and segmentation. Advances in Neural Information Processing Systems 34 (2021)

[3] Bernardin, Keni, and Rainer Stiefelhagen. “Evaluating multiple object tracking performance: the CLEAR MOT metrics.” EURASIP Journal on Image and Video Processing 2008 (2008): 1-10

[4] Ristani, Ergys, et al. “Performance measures and a data set for multi-target, multi-camera tracking.” European Conference on Computer Vision. Springer, Cham, 2016