video video | label class label |
|---|---|
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 | |
0chunk-000 |
This dataset was created using LeRobot.
IMPORTANT: Standard LeRobot (huggingface/lerobot) does not currently support depth videos or point clouds. To load this dataset, you must use the modified codebase: ZibinDong/lerobotdataset3d, which adds full decoding support for depth videos (H.265/H.264/FFV1) and quantized point clouds.
Dataset Description
droid_3d is a large-scale robot manipulation dataset collected with the DROID data collection platform. It contains multi-view RGB videos, depth videos, point clouds, robot actions, and natural language task descriptions. The dataset is designed for training vision-language-action models and 3D-aware robot policies.
- Homepage: https://github.com/ZibinDong/lerobotdataset3d
- Paper: EmbodiedMAE: A Unified 3D Multi-Modal Representation for Robot Manipulation
- License: MIT
Dataset Summary
The dataset comprises 58,201 episodes with a total of 18,083,626 frames, recorded at 15 FPS using a multi-camera setup (wrist + 2 external cameras). Each frame captures:
- RGB videos from 3 camera views (wrist, external_0, external_1) at 224×398 resolution, encoded with AV1.
- Depth videos from the same 3 views at 224×398 resolution, encoded with H.265 (h265_uint12), with a depth range of 2000 mm.
- Point clouds (max 2048 points) for each camera, with quantized XYZ coordinates in the ranges: x∈[−1.0, 1.0], y∈[−1.0, 1.0], z∈[0.0, 1.6].
- 8-dimensional action vectors (float32).
- Up to 3 natural language annotations per episode describing the task.
The dataset covers 23,858 distinct tasks and is split into training data only (no evaluation split). Total dataset size is approximately 1.3 TB.
Dataset Structure
{
"codebase_version": "v3.0",
"fps": 15,
"features": {
"observation.images.wrist": {
"dtype": "video",
"shape": [224, 398, 3],
"info": {
"video.height": 224,
"video.width": 398,
"video.codec": "av1",
"video.pix_fmt": "yuv420p",
"video.is_depth_map": false,
"video.fps": 15,
"video.channels": 3,
"has_audio": false
}
},
"observation.images.external_0": {
"dtype": "video",
"shape": [224, 398, 3],
"info": {
"video.height": 224,
"video.width": 398,
"video.codec": "av1",
"video.pix_fmt": "yuv420p",
"video.is_depth_map": false,
"video.fps": 15,
"video.channels": 3,
"has_audio": false
}
},
"observation.images.external_1": {
"dtype": "video",
"shape": [224, 398, 3],
"info": {
"video.height": 224,
"video.width": 398,
"video.codec": "av1",
"video.pix_fmt": "yuv420p",
"video.is_depth_map": false,
"video.fps": 15,
"video.channels": 3,
"has_audio": false
}
},
"observation.depth.wrist": {
"dtype": "depth_video",
"shape": [224, 398, 1],
"scale": "uint12_mm",
"encoding": "h265_uint12",
"depth_range_mm": 2000.0
},
"observation.depth.external_0": {
"dtype": "depth_video",
"shape": [224, 398, 1],
"scale": "uint12_mm",
"encoding": "h265_uint12",
"depth_range_mm": 2000.0
},
"observation.depth.external_1": {
"dtype": "depth_video",
"shape": [224, 398, 1],
"scale": "uint12_mm",
"encoding": "h265_uint12",
"depth_range_mm": 2000.0
},
"observation.pointcloud.wrist": {
"dtype": "pointcloud",
"shape": [null, 3],
"max_points": 2048,
"features": [],
"quantize_xyz": true,
"xyz_range_x": [-1.0, 1.0],
"xyz_range_y": [-1.0, 1.0],
"xyz_range_z": [0.0, 1.6]
},
"observation.pointcloud.external_0": {
"dtype": "pointcloud",
"shape": [null, 3],
"max_points": 2048,
"features": [],
"quantize_xyz": true,
"xyz_range_x": [-1.0, 1.0],
"xyz_range_y": [-1.0, 1.0],
"xyz_range_z": [0.0, 1.6]
},
"observation.pointcloud.external_1": {
"dtype": "pointcloud",
"shape": [null, 3],
"max_points": 2048,
"features": [],
"quantize_xyz": true,
"xyz_range_x": [-1.0, 1.0],
"xyz_range_y": [-1.0, 1.0],
"xyz_range_z": [0.0, 1.6]
},
"action": {
"dtype": "float32",
"shape": [8]
},
"language_1": {
"dtype": "string",
"shape": [1]
},
"language_2": {
"dtype": "string",
"shape": [1]
},
"language_3": {
"dtype": "string",
"shape": [1]
},
"timestamp": {
"dtype": "float32",
"shape": [1],
"names": null
},
"frame_index": {
"dtype": "int64",
"shape": [1],
"names": null
},
"episode_index": {
"dtype": "int64",
"shape": [1],
"names": null
},
"index": {
"dtype": "int64",
"shape": [1],
"names": null
},
"task_index": {
"dtype": "int64",
"shape": [1],
"names": null
}
},
"total_episodes": 58201,
"total_frames": 18083626,
"total_tasks": 23858,
"chunks_size": 1000,
"data_files_size_in_mb": 100,
"video_files_size_in_mb": 200,
"data_path": "data/chunk-{chunk_index:03d}/file-{file_index:03d}.parquet",
"video_path": "videos/{video_key}/chunk-{chunk_index:03d}/file-{file_index:03d}.mp4",
"robot_type": null,
"splits": {
"train": "0:58201"
}
}
Data Format
The dataset follows the LeRobot format (v3.0) with the following directory layout:
droid_3d/
├── data/
│ └── chunk-000/
│ └── file-{000-999}.parquet # Action and metadata (636 MB)
├── videos/
│ ├── observation.images.wrist/
│ │ └── chunk-000/ # Wrist RGB videos
│ ├── observation.images.external_0/
│ │ └── chunk-000/ # External camera 0 RGB videos
│ ├── observation.images.external_1/
│ │ └── chunk-000/ # External camera 1 RGB videos
│ ├── observation.depth.wrist/
│ │ └── chunk-000/ # Wrist depth videos (H.265)
│ ├── observation.depth.external_0/
│ │ └── chunk-000/ # External camera 0 depth videos (H.265)
│ └── observation.depth.external_1/
│ └── chunk-000/ # External camera 1 depth videos (H.265)
├── pointclouds/
│ ├── observation.pointcloud.wrist/
│ │ └── chunk-000/ # Wrist point clouds (Parquet)
│ ├── observation.pointcloud.external_0/
│ │ └── chunk-000/ # External camera 0 point clouds (Parquet)
│ └── observation.pointcloud.external_1/
│ └── chunk-000/ # External camera 1 point clouds (Parquet)
├── meta/
│ ├── info.json # Dataset metadata
│ ├── stats.json # Dataset statistics
│ ├── tasks.parquet # Task definitions
│ └── episodes/ # Episode metadata
└── README.md
Features
| Feature | Dtype | Shape | Details |
|---|---|---|---|
observation.images.wrist |
video | (224, 398, 3) | AV1, 15 FPS, yuv420p |
observation.images.external_0 |
video | (224, 398, 3) | AV1, 15 FPS, yuv420p |
observation.images.external_1 |
video | (224, 398, 3) | AV1, 15 FPS, yuv420p |
observation.depth.wrist |
depth_video | (224, 398, 1) | H.265 uint12, scale=uint12_mm, range=2000 mm |
observation.depth.external_0 |
depth_video | (224, 398, 1) | H.265 uint12, scale=uint12_mm, range=2000 mm |
observation.depth.external_1 |
depth_video | (224, 398, 1) | H.265 uint12, scale=uint12_mm, range=2000 mm |
observation.pointcloud.wrist |
pointcloud | (2048, 3) | Quantized XYZ, x∈[−1,1], y∈[−1,1], z∈[0,1.6] |
observation.pointcloud.external_0 |
pointcloud | (2048, 3) | Quantized XYZ, x∈[−1,1], y∈[−1,1], z∈[0,1.6] |
observation.pointcloud.external_1 |
pointcloud | (2048, 3) | Quantized XYZ, x∈[−1,1], y∈[−1,1], z∈[0,1.6] |
action |
float32 | (8,) | Robot action vector |
language_1 |
string | (1,) | First language annotation |
language_2 |
string | (1,) | Second language annotation |
language_3 |
string | (1,) | Third language annotation |
timestamp |
float32 | (1,) | Frame timestamp |
frame_index |
int64 | (1,) | Frame index within episode |
episode_index |
int64 | (1,) | Episode identifier |
index |
int64 | (1,) | Global frame index |
task_index |
int64 | (1,) | Task identifier |
Splits
| Split | Episodes | Frames |
|---|---|---|
| train | 58,201 | 18,083,626 |
Dataset Size
- Videos: 823 GB
- Point clouds: 461 GB
- Parquet data: 636 MB
- Total: ~1.3 TB
Usage
⚠️ Because standard LeRobot lacks depth video and point cloud decoding, you must install and use lerobotdataset3d to load this dataset.
Install
pip install git+https://github.com/ZibinDong/lerobotdataset3d.git
Load the dataset
from lerobotdataset3d import LeRobotDatasetDepthPointcloud
dataset = LeRobotDatasetDepthPointcloud(
repo_id="ZibinDong/droid_3d",
root="/local_path/to/droid_3d",
)
item = dataset[0]
# RGB video frames: (3, H, W) float32 in [0, 255]
item["observation.images.wrist"].shape # torch.Size([3, 224, 398])
item["observation.images.external_0"]
# Depth frames: (1, H, W) float32 in meters
item["observation.depth.wrist"].shape # torch.Size([1, 224, 398])
item["observation.depth.external_0"]
# Point clouds: (max_points, 3) float32 in meters
item["observation.pointcloud.wrist"].shape # torch.Size([2048, 3])
item["observation.pointcloud.external_0"]
# Actions, language, and metadata
item["action"] # torch.Size([8])
item["language_1"] # str
item["language_2"]
item["language_3"]
item["episode_index"]
item["frame_index"]
item["index"]
item["timestamp"]
item["task_index"]
Advanced: temporal window sampling
from lerobotdataset3d import LeRobotDatasetDepthPointcloud
dataset = LeRobotDatasetDepthPointcloud(
repo_id="ZibinDong/droid_3d",
root="/local_path/to/droid_3d",
delta_timestamps={
"observation.images.wrist": [-0.1, 0.0, 0.1],
"action": [-0.1, 0.0, 0.1, 0.2, 0.3],
},
)
Visualize online
Citation
@article{dong2025embodiedmae,
title = {EmbodiedMAE: A Unified 3D Multi-Modal Representation for Robot Manipulation},
author = {Dong, Zibin and Ni, Fei and Yuan, Yifu and Li, Yinchuan and Hao, Jianye},
journal = {arXiv preprint arXiv:2505.10105},
year = {2025}
}
- Downloads last month
- 6,744
