Get Started

Overview

Our dataset is organized using a hierarchical structure, illustrated in the diagram below. The names in square brackets are abbreviations used in codes and file names.

structure

Data structure

Split

A split defines a subset of data for training, validation, or testing. In the testing split, only images are publicly accessible, while annotations are kept for online evaluation.

View

A view is a specific direction in which a camera points. We provide 6 camera views: Front, Left/Right 45°, Left/Right 90°, and one Front (Left) view. Please note that the Front and Front (Left) views can be paired for stereo vision. Further, we also provide one LiDAR view named Center view.

Data group

In each of the six views, we have different data groups. The data group is the smallest unit for download. Each data group contains one specific type of data or annotations:

Pixel-level data group. The data groups for images and pixel-level labels (i.e., semantic masks, depth maps, and optical flow) are zip/tar files that pack the data of this type in driving sequences.
Object-level data group (marked by light green in the above diagram). The data groups for object labels (i.e., 2D object labels, 3D object labels, and instance labels) are provided in Scalabel formatted json files. Please find more information below.

Sequence

A sequence is a 50s-long driving record of consecutive frames. The sequence ID is a unique 8-digit hexadecimal number. Each sequence is a folder of frames for a pixel-level data group, where the folder is named by its sequence ID.

Frame

A frame contains the data captured at a specific timestamp in sequences. It is the minimal granularity in our dataset. Files are named with the pattern of <frame_number>_<data_group>_<view>.<ext>. Here, <data_group> and <view> are abbreviations for the corresponding data group and view; and <ext> is the file extension. The file extension and format for each pixel-level data group can be found in the following table.

Pixel-level data group	Abbreviation	File extension	File format
Image	img	jpg	24bit RGB.
Semantic Segmentation	semseg	png	8bit gray, pixel-level semantic masks.
Depth Maps	depth	png	24bit RGB, pixel-level depth maps. `Depth (meters) = (256 * 256 * B + 256 * G + R) / (256 * 256 * 256 - 1) * 1000` .
Optical Flow	flow	npz	`[800, 1280, 2]` NumPy array of UV map format.

Table 1: Specification for files in pixel-level data groups.

Data conversion

By using the following commands, you can convert the downloaded data into other formats. Please refer to SHIFT DevKit for more information.

Data conversion details

Decompress videos to frames

For easier retrieval of frames during training, we recommend to decompress all video sequences into frames before training. Make sure there is enough disk space to store the decompressed frames.

To ensure reproducible decompression of videos, we recommend to use our docker image as docker run -v <dataset_path>:/data shift-devkit:latest decompress_video.sh. The docker image contains a specific version of FFmpeg which is used for video processing.
To use your local FFmpeg libraries (not recommended), please run python -m shift_dev.io.decompress_videos <dataset_path>.

Please, keep in mind that using your local FFmpeg package may decompress the videos to frames that are inconsistent with our docker image outputs, possibly making your results not comparable to other published results. For this reason, we strongly recommend to use our docker image.

Pack data groups into HDF5 files

HDF5 files are commonly used to reduce the amount of files on disk. To pack one data group into an HDF5 file, please use python -m shift_dev.io.to_hdf5 <path_data_group> <output_path_of_hdf5>.

Sensors

Camera intrinsics

All RGB cameras have the same intrinsic parameters. Our RGB cameras have resolution of 1280 * 800 pixel and field of view (FoV) of 90°. Their intrinsic parameters are (center_x, center_y) = (640, 400), (focal_x, focal_y) = (640, 640), skew = 0.

The baseline of the stereo pair, i.e., horizontal gap between the Front and the Front Left cameras, is 0.2m.

Poses of views to the ego vehicle

The relative poses of the camera views with respect to the ego vehicle are defined in sensors.yaml, where location denotes the offset on x, y and z axes from the ego vehicle’s origin (in meters) and rotation denotes the angle of rotation with Euler angles (in degrees). Please refer to the figure below.

Annotations

Scalabel format

We provide labels for 2D/3D object detection, 2D/3D multiple object tracking (MOT), and instance segmentation using the Scalabel format. Below is the format description.

Format description

Each file has the following fields.

Dataset:
- frames[ ]:                           // storing all frames from every sequences
    - frame0001: Frame
    - frame0002: Frame
    ...
- config:
    - image_size:                       // all images have the same size
        - width: 1280
        - height: 800
    - categories: ["car", "truck", ...] // define the categories of objects

Here, frames[ ] is a list of Frame object, which contains all frames from every sequences. The Frame object is defined as

Frame:
- name: string                          // e.g., "abcd-1234/00000001_img_center.png"
- videoName: string                     // e.g., "abcd-1234", unique across whole dataset
- attributes                            // for discrete domain shifts        
    - timeofday_coarse: 
        {"daytime" | "dawn/dusk" | "night"} 
    - weather_coarse: 
        {"clear" | "cloudy" | "overcast" | "rainy" | "foggy }
    - timeofday_fine: 
        {"noon" | "daytime" | "morning/afternoon" | "dawn/dusk" | "sunrise/sunset" | 
        "night" | "dark night"} 
    - weather_fine:
        {"clear" | "slight cloudy" | "partial cloudy" | "overcast" | "small rain" | 
         "mid rain" | "heavy rain" | "small fog" | "heavy fog" } 
    - town:                             // name of the town
        {"01" | "02" | "03" | "04" | "05" | "06" | "07" | "10HD"}                              
    - sun_altitude_angle                // [-5, 90] degrees
    - cloudiness                        // [0, 100] percentage
    - precipitation                     // [0, 100] percentage
    - precipitation_deposits            // [0, 100] percentage
    - wind_intensity                    // [0, 100] percentage
    - sun_azimuth_angle                 // [0, 360] degrees
    - fog_density                       // [0, 100] percentage
    - fog_distance                      // meters
    - wetness                           // [0, 100] percentage
    - fog_falloff                       // meters
- intrinsics                            // intrinsic parameters (only for cameras)
    - focal: [x, y]                     // in pixel
    - center: [x, y]                    // in pixel
- extrinsics:                           // extrinsics parameters
    - location: [x, y, z]               // in meter
    - rotation: [rot_x, rot_y, rot_z]   // in radius (XYZ Euler angle)
- timestamp: int                        // time in this video, ms
- frameIndex: int                       // frame index in this video
- size:
    - width: 1280
    - height: 800
- labels [ ]:
    - id: string                         // for tracking, unique in current sequence
    - index: int
    - category: string                   // classification
    - attributes
        - ego: bool                      // if the 3D bounding box is the ego vehicle (only for LiDAR)
    - box2d:                             // 2D bounding box (in pixel)
        - x1: float
        - y1: float
        - x2: float
        - y2: float
    - box3d:                              // 3D bounding box (in ego coordinate system)  
        - orientation: [rot_x, rot_y, rot_z]    // Euler angle, in radius
        - location: [x, y, z]                   // in meter
        - dimension: [height, width, length]    // in meter
    - rle:                                // instance mask in RLE format, for instance segmentation
        - counts: [int]
        - size: (height, width)

Sequence information

The sequence information files summarize the environment attributes of each sequence with tabular (.csv) format. Below are column definitions.

Table of column definitions for sequence info files

	Column name	Definition
0	video	name of the video sequence (e.g. `0003-17fb`)
1	view	name of the view, `{"front" \| "left_stereo" \| "left_90" \| "left_45" \| "right_90" \| "right_45"}`
2	town	name of the town, `{"01" \| "02" \| "03" \| "04" \| "05" \| "06" \| "07" \| "10HD"}`
3	start_<env_attributes>	Values of environment attributes at the end of domain shifts. shifts.
4	end_<env_attributes>	Values of environment attributes at the end of domain shifts. (Only available for continuous shift sequence.)
5	shift_type	Type of continuous domain shifts. `{"daytime_to_night" \| "clear_to_rainy" \| "clear_to_foggy"}` (Only available for continuous shift sequence.)
6	shift_length	Length of frames for continuous domain shifts. (Only available for continuous shift sequence.)

Segmentation labels

We provide semantic segmentation labels following Carla’s semantic labels. Below is a table of the 23 classes in the ground truth and their correspondence to Cityscapes that can be used in semantic segmentation experiments. We highly recommend users follow Cityscapes’s evaluation protocol, where certain classes must be ignored (defined in the last column of the table).

Table of segmentation labels

ID	Name	Color	Cityscapes equivalent	Cityscapes ignore_in_eval
0	unlabeled	`( 0, 0, 0)`	0	true
1	building	`( 70, 70, 70)`	11	false
2	fence	`(100, 40, 40)`	13	false
3	other	`( 55, 90, 80)`	0	true
4	pedestrian	`(220, 20, 60)`	24	false
5	pole	`(153, 153, 153)`	17	false
6	road line	`(157, 234, 50)`	7	false
7	road	`(128, 64, 128)`	7	false
8	sidewalk	`(244, 35, 232)`	8	false
9	vegetation	`(107, 142, 35)`	21	false
10	vehicle	`( 0, 0, 142)`	26	false
11	wall	`(102, 102, 156)`	12	false
12	traffic sign	`(220, 220, 0)`	20	false
13	sky	`( 70, 130, 180)`	23	false
14	ground	`( 81, 0, 81)`	6	true
15	bridge	`(150, 100, 100)`	15	true
16	rail track	`(230, 150, 140)`	10	true
17	guard rail	`(180, 165, 180)`	14	true
18	traffic light	`(250, 170, 30)`	19	false
19	static	`(110, 190, 160)`	4	true
20	dynamic	`(170, 120, 50)`	5	true
21	water	`( 45, 60, 150)`	0	true
22	terrain	`(145, 170, 100)`	22	false

LiDAR point clouds

LiDAR point clouds are stored in binary (.ply) format, where each point has 4 properties (x, y, z, intensity) in metric space. A Gaussian-based noise model was applied to the pointclouds along the direction of rays used in Carla.

We would like to bring to users attention that the provided pointclouds are NOT sensitive to domain shift (e.g., rain, fog, …), as the CARLA LiDAR simulation is not affected by different environmental conditions. As such, the pointclouds should not be used for domain adaptation studies of LiDAR-specific behaviors. Some reasonable applications can be 3D detection/tracking under the same domain, or usage as ground truth.

Coordinate systems and extrinsics

We follow KITTI’s convention for image- and LiDAR-based perception tasks. To see how to use extrinsics, please refer to the script sensor_pose.py.

Details of coordinate systems

Name	Coordinate system	Origin	Axes direction	Note
3D bounding boxes	Camera views (front, left_45, ...)	Center of the cameras	`x-Right, y-Down, z-Forward`	Inverse extrinsic can project bounding boxes into World system.
Camera extrinsics	World	Center of the map	`x-North, y-East, z-Up`	The pose of camera in the world system. (Degrees).

Table 2: Coordinate systems used for image-based data and annotations.

Name	Coordinate system	Origin	Axes direction	Note
Point clouds, 3D bounding boxes	Center view	Center of the LiDAR sensor	`x-Forward, y-Right, z-Up`	Inverse extrinsic can project bounding boxes into World system.
LiDAR extrinsics	World	Center of the map	`x-North, y-East, z-Up`	The pose of LiDAR sensor in the world system. (Radius).

Table 3: Coordinate systems used for LiDAR-based data and annotations.

Visualization

Below is a visualization of point clouds’ tracking using Scalabel web tools.

How to use SHIFT?

Our dataset aims to foster research in several under-explored fields for autonomous driving that are crucial to safety. We enumerate some of the possible use cases of our dataset:

Robustness and generality: Investigating how a perception systems’ performance degrades at increasing levels of domain shift.
Uncertainty estimation and calibration: Assessing and developing uncertainty estimation methods that work under realistic domain shifts.
Multi-task perception system: Studying the combination of tasks and developing multi-task models to effectively counteract domain shift.
Continual learning: Investigating how to utilize domain shifts incrementally, e.g., continual domain adaptation and curriculum learning.
Test-time learning: Developing and evaluating learning algorithms for continuously changing environments, e.g., test-time adaptation.