SHIFT Dataset /

Get Started

Overview


Our dataset is organized using a hierarchical structure, illustrated in the diagram below. The names in square brackets are abbreviations used in codes and file names.

structure

Data structure

Split

A split defines a subset of data for training, validation, or testing. In the testing split, only images are publicly accessible, while annotations are kept for online evaluation.

View

A view is a specific direction in which a camera points. We provide 6 camera views: Front, Left/Right 45°, Left/Right 90°, and one Front (Left) view. Please note that the Front and Front (Left) views can be paired for stereo vision. Further, we also provide one LiDAR view named Center view.

Data group

In each of the six views, we have different data groups. The data group is the smallest unit for download. Each data group contains one specific type of data or annotations:

  • Pixel-level data group. The data groups for images and pixel-level labels (i.e., semantic masks, depth maps, and optical flow) are zip/tar files that pack the data of this type in driving sequences.
  • Object-level data group (marked by light green in the above diagram). The data groups for object labels (i.e., 2D object labels, 3D object labels, and instance labels) are provided in Scalabel formatted json files. Please find more information below.

Sequence

A sequence is a 50s-long driving record of consecutive frames. The sequence ID is a unique 8-digit hexadecimal number. Each sequence is a folder of frames for a pixel-level data group, where the folder is named by its sequence ID.

Frame

A frame contains the data captured at a specific timestamp in sequences. It is the minimal granularity in our dataset. Files are named with the pattern of <frame_number>_<data_group>_<view>.<ext>. Here, <data_group> and <view> are abbreviations for the corresponding data group and view; and <ext> is the file extension. The file extension and format for each pixel-level data group can be found in the following table.

Pixel-level data groupAbbreviationFile extensionFile format
Imageimgjpg24bit RGB.
Semantic Segmentationsemsegpng8bit gray, pixel-level semantic masks.
Depth Mapsdepthpng24bit RGB, pixel-level depth maps.
Depth (meters) = (256 * 256 * B + 256 * G + R) / (256 * 256 * 256 - 1) * 1000 .
Optical Flowflownpz[800, 1280, 2] NumPy array of UV map format.
Table 1: Specification for files in pixel-level data groups.

Data conversion

By using the following commands, you can convert the downloaded data into other formats. Please refer to SHIFT DevKit for more information.

Data conversion details

Decompress videos to frames

For easier retrieval of frames during training, we recommend to decompress all video sequences into frames before training. Make sure there is enough disk space to store the decompressed frames.

  • To ensure reproducible decompression of videos, we recommend to use our docker image as docker run -v <dataset_path>:/data shift-devkit:latest decompress_video.sh. The docker image contains a specific version of FFmpeg which is used for video processing.
  • To use your local FFmpeg libraries (not recommended), please run python -m shift_dev.io.decompress_videos <dataset_path>.

Please, keep in mind that using your local FFmpeg package may decompress the videos to frames that are inconsistent with our docker image outputs, possibly making your results not comparable to other published results. For this reason, we strongly recommend to use our docker image.

Pack data groups into HDF5 files

HDF5 files are commonly used to reduce the amount of files on disk. To pack one data group into an HDF5 file, please use python -m shift_dev.io.to_hdf5 <path_data_group> <output_path_of_hdf5>.

Sensors


Camera intrinsics

All RGB cameras have the same intrinsic parameters. Our RGB cameras have resolution of 1280 * 800 pixel and field of view (FoV) of 90°. Their intrinsic parameters are (center_x, center_y) = (640, 400), (focal_x, focal_y) = (640, 640), skew = 0.

The baseline of the stereo pair, i.e., horizontal gap between the Front and the Front Left cameras, is 0.2m.

Poses of views to the ego vehicle

The relative poses of the camera views with respect to the ego vehicle are defined in sensors.yaml, where location denotes the offset on x, y and z axes from the ego vehicle’s origin (in meters) and rotation denotes the angle of rotation with Euler angles (in degrees). Please refer to the figure below.

Annotations


Scalabel format

We provide labels for 2D/3D object detection, 2D/3D multiple object tracking (MOT), and instance segmentation using the Scalabel format. Below is the format description.

Format description
Each file has the following fields.
Dataset:
- frames[ ]:                           // storing all frames from every sequences
    - frame0001: Frame
    - frame0002: Frame
    ...
- config:
    - image_size:                       // all images have the same size
        - width: 1280
        - height: 800
    - categories: ["car", "truck", ...] // define the categories of objects

Here, frames[ ] is a list of Frame object, which contains all frames from every sequences. The Frame object is defined as

Frame:
- name: string                          // e.g., "abcd-1234/00000001_img_center.png"
- videoName: string                     // e.g., "abcd-1234", unique across whole dataset
- attributes                            // for discrete domain shifts        
    - timeofday_coarse: 
        {"daytime" | "dawn/dusk" | "night"} 
    - weather_coarse: 
        {"clear" | "cloudy" | "overcast" | "rainy" | "foggy }
    - timeofday_fine: 
        {"noon" | "daytime" | "morning/afternoon" | "dawn/dusk" | "sunrise/sunset" | 
        "night" | "dark night"} 
    - weather_fine:
        {"clear" | "slight cloudy" | "partial cloudy" | "overcast" | "small rain" | 
         "mid rain" | "heavy rain" | "small fog" | "heavy fog" } 
    - town:                             // name of the town
        {"01" | "02" | "03" | "04" | "05" | "06" | "07" | "10HD"}                              
    - sun_altitude_angle                // [-5, 90] degrees
    - cloudiness                        // [0, 100] percentage
    - precipitation                     // [0, 100] percentage
    - precipitation_deposits            // [0, 100] percentage
    - wind_intensity                    // [0, 100] percentage
    - sun_azimuth_angle                 // [0, 360] degrees
    - fog_density                       // [0, 100] percentage
    - fog_distance                      // meters
    - wetness                           // [0, 100] percentage
    - fog_falloff                       // meters
- intrinsics                            // intrinsic parameters (only for cameras)
    - focal: [x, y]                     // in pixel
    - center: [x, y]                    // in pixel
- extrinsics:                           // extrinsics parameters
    - location: [x, y, z]               // in meter
    - rotation: [rot_x, rot_y, rot_z]   // in radius (XYZ Euler angle)
- timestamp: int                        // time in this video, ms
- frameIndex: int                       // frame index in this video
- size:
    - width: 1280
    - height: 800
- labels [ ]:
    - id: string                         // for tracking, unique in current sequence
    - index: int
    - category: string                   // classification
    - attributes
        - ego: bool                      // if the 3D bounding box is the ego vehicle (only for LiDAR)
    - box2d:                             // 2D bounding box (in pixel)
        - x1: float
        - y1: float
        - x2: float
        - y2: float
    - box3d:                              // 3D bounding box (in ego coordinate system)  
        - orientation: [rot_x, rot_y, rot_z]    // Euler angle, in radius
        - location: [x, y, z]                   // in meter
        - dimension: [height, width, length]    // in meter
    - rle:                                // instance mask in RLE format, for instance segmentation
        - counts: [int]
        - size: (height, width)          

Sequence information

The sequence information files summarize the environment attributes of each sequence with tabular (.csv) format. Below are column definitions.

Table of column definitions for sequence info files
Column nameDefinition
0videoname of the video sequence (e.g. 0003-17fb)
1viewname of the view, {"front" | "left_stereo" | "left_90" | "left_45" | "right_90" | "right_45"}
2townname of the town, {"01" | "02" | "03" | "04" | "05" | "06" | "07" | "10HD"}
3start_<env_attributes>Values of environment attributes at the end of domain shifts. shifts.
4end_<env_attributes>Values of environment attributes at the end of domain shifts. (Only available for continuous shift sequence.)
5shift_typeType of continuous domain shifts. {"daytime_to_night" | "clear_to_rainy" | "clear_to_foggy"} (Only available for continuous shift sequence.)
6shift_lengthLength of frames for continuous domain shifts. (Only available for continuous shift sequence.)

Segmentation labels

We provide semantic segmentation labels following Carla’s semantic labels. Below is a table of the 23 classes in the ground truth and their correspondence to Cityscapes that can be used in semantic segmentation experiments. We highly recommend users follow Cityscapes’s evaluation protocol, where certain classes must be ignored (defined in the last column of the table).

Table of segmentation labels
IDNameColorCityscapes equivalentCityscapes ignore_in_eval
0unlabeled( 0, 0, 0)0true
1building( 70, 70, 70)11false
2fence(100, 40, 40)13false
3other( 55, 90, 80)0true
4pedestrian(220, 20, 60)24false
5pole(153, 153, 153)17false
6road line(157, 234, 50)7false
7road(128, 64, 128)7false
8sidewalk(244, 35, 232)8false
9vegetation(107, 142, 35)21false
10vehicle( 0, 0, 142)26false
11wall(102, 102, 156)12false
12traffic sign(220, 220, 0)20false
13sky( 70, 130, 180)23false
14ground( 81, 0, 81)6true
15bridge(150, 100, 100)15true
16rail track(230, 150, 140)10true
17guard rail(180, 165, 180)14true
18traffic light(250, 170, 30)19false
19static(110, 190, 160)4true
20dynamic(170, 120, 50)5true
21water( 45, 60, 150)0true
22terrain(145, 170, 100)22false

LiDAR point clouds

LiDAR point clouds are stored in binary (.ply) format, where each point has 4 properties (x, y, z, intensity) in metric space. A Gaussian-based noise model was applied to the pointclouds along the direction of rays used in Carla.

We would like to bring to users attention that the provided pointclouds are NOT sensitive to domain shift (e.g., rain, fog, …), as the CARLA LiDAR simulation is not affected by different environmental conditions. As such, the pointclouds should not be used for domain adaptation studies of LiDAR-specific behaviors. Some reasonable applications can be 3D detection/tracking under the same domain, or usage as ground truth.

Coordinate systems and extrinsics

We follow KITTI’s convention for image- and LiDAR-based perception tasks. To see how to use extrinsics, please refer to the script sensor_pose.py.

Details of coordinate systems
NameCoordinate systemOriginAxes directionNote
3D bounding boxesCamera views (front, left_45, ...)Center of the camerasx-Right, y-Down, z-ForwardInverse extrinsic can project bounding boxes into World system.
Camera extrinsicsWorldCenter of the mapx-North, y-East, z-UpThe pose of camera in the world system. (Degrees).
Table 2: Coordinate systems used for image-based data and annotations.
NameCoordinate systemOriginAxes directionNote
Point clouds, 3D bounding boxesCenter viewCenter of the LiDAR sensorx-Forward, y-Right, z-UpInverse extrinsic can project bounding boxes into World system.
LiDAR extrinsicsWorldCenter of the mapx-North, y-East, z-UpThe pose of LiDAR sensor in the world system. (Radius).
Table 3: Coordinate systems used for LiDAR-based data and annotations.

Visualization


Below is a visualization of point clouds’ tracking using Scalabel web tools.

How to use SHIFT?


Our dataset aims to foster research in several under-explored fields for autonomous driving that are crucial to safety. We enumerate some of the possible use cases of our dataset:

  • Robustness and generality: Investigating how a perception systems’ performance degrades at increasing levels of domain shift.
  • Uncertainty estimation and calibration: Assessing and developing uncertainty estimation methods that work under realistic domain shifts.
  • Multi-task perception system: Studying the combination of tasks and developing multi-task models to effectively counteract domain shift.
  • Continual learning: Investigating how to utilize domain shifts incrementally, e.g., continual domain adaptation and curriculum learning.
  • Test-time learning: Developing and evaluating learning algorithms for continuously changing environments, e.g., test-time adaptation.