Voozh

Dataset Viewer

Search is not available for this dataset

video video	label class label
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000
0chunk-000

End of preview. Expand in Data Studio

This dataset was created using LeRobot.

IMPORTANT: Standard LeRobot (huggingface/lerobot) does not currently support depth videos or point clouds. To load this dataset, you must use the modified codebase: ZibinDong/lerobotdataset3d, which adds full decoding support for depth videos (H.265/H.264/FFV1) and quantized point clouds.

Dataset Description

droid_3d is a large-scale robot manipulation dataset collected with the DROID data collection platform. It contains multi-view RGB videos, depth videos, point clouds, robot actions, and natural language task descriptions. The dataset is designed for training vision-language-action models and 3D-aware robot policies.

Homepage: https://github.com/ZibinDong/lerobotdataset3d
Paper: EmbodiedMAE: A Unified 3D Multi-Modal Representation for Robot Manipulation
License: MIT

Dataset Summary

The dataset comprises 58,201 episodes with a total of 18,083,626 frames, recorded at 15 FPS using a multi-camera setup (wrist + 2 external cameras). Each frame captures:

RGB videos from 3 camera views (wrist, external_0, external_1) at 224×398 resolution, encoded with AV1.
Depth videos from the same 3 views at 224×398 resolution, encoded with H.265 (h265_uint12), with a depth range of 2000 mm.
Point clouds (max 2048 points) for each camera, with quantized XYZ coordinates in the ranges: x∈[−1.0, 1.0], y∈[−1.0, 1.0], z∈[0.0, 1.6].
8-dimensional action vectors (float32).
Up to 3 natural language annotations per episode describing the task.

The dataset covers 23,858 distinct tasks and is split into training data only (no evaluation split). Total dataset size is approximately 1.3 TB.

Dataset Structure

meta/info.json:

{
 "codebase_version": "v3.0",
 "fps": 15,
 "features": {
 "observation.images.wrist": {
 "dtype": "video",
 "shape": [224, 398, 3],
 "info": {
 "video.height": 224,
 "video.width": 398,
 "video.codec": "av1",
 "video.pix_fmt": "yuv420p",
 "video.is_depth_map": false,
 "video.fps": 15,
 "video.channels": 3,
 "has_audio": false
 }
 },
 "observation.images.external_0": {
 "dtype": "video",
 "shape": [224, 398, 3],
 "info": {
 "video.height": 224,
 "video.width": 398,
 "video.codec": "av1",
 "video.pix_fmt": "yuv420p",
 "video.is_depth_map": false,
 "video.fps": 15,
 "video.channels": 3,
 "has_audio": false
 }
 },
 "observation.images.external_1": {
 "dtype": "video",
 "shape": [224, 398, 3],
 "info": {
 "video.height": 224,
 "video.width": 398,
 "video.codec": "av1",
 "video.pix_fmt": "yuv420p",
 "video.is_depth_map": false,
 "video.fps": 15,
 "video.channels": 3,
 "has_audio": false
 }
 },
 "observation.depth.wrist": {
 "dtype": "depth_video",
 "shape": [224, 398, 1],
 "scale": "uint12_mm",
 "encoding": "h265_uint12",
 "depth_range_mm": 2000.0
 },
 "observation.depth.external_0": {
 "dtype": "depth_video",
 "shape": [224, 398, 1],
 "scale": "uint12_mm",
 "encoding": "h265_uint12",
 "depth_range_mm": 2000.0
 },
 "observation.depth.external_1": {
 "dtype": "depth_video",
 "shape": [224, 398, 1],
 "scale": "uint12_mm",
 "encoding": "h265_uint12",
 "depth_range_mm": 2000.0
 },
 "observation.pointcloud.wrist": {
 "dtype": "pointcloud",
 "shape": [null, 3],
 "max_points": 2048,
 "features": [],
 "quantize_xyz": true,
 "xyz_range_x": [-1.0, 1.0],
 "xyz_range_y": [-1.0, 1.0],
 "xyz_range_z": [0.0, 1.6]
 },
 "observation.pointcloud.external_0": {
 "dtype": "pointcloud",
 "shape": [null, 3],
 "max_points": 2048,
 "features": [],
 "quantize_xyz": true,
 "xyz_range_x": [-1.0, 1.0],
 "xyz_range_y": [-1.0, 1.0],
 "xyz_range_z": [0.0, 1.6]
 },
 "observation.pointcloud.external_1": {
 "dtype": "pointcloud",
 "shape": [null, 3],
 "max_points": 2048,
 "features": [],
 "quantize_xyz": true,
 "xyz_range_x": [-1.0, 1.0],
 "xyz_range_y": [-1.0, 1.0],
 "xyz_range_z": [0.0, 1.6]
 },
 "action": {
 "dtype": "float32",
 "shape": [8]
 },
 "language_1": {
 "dtype": "string",
 "shape": [1]
 },
 "language_2": {
 "dtype": "string",
 "shape": [1]
 },
 "language_3": {
 "dtype": "string",
 "shape": [1]
 },
 "timestamp": {
 "dtype": "float32",
 "shape": [1],
 "names": null
 },
 "frame_index": {
 "dtype": "int64",
 "shape": [1],
 "names": null
 },
 "episode_index": {
 "dtype": "int64",
 "shape": [1],
 "names": null
 },
 "index": {
 "dtype": "int64",
 "shape": [1],
 "names": null
 },
 "task_index": {
 "dtype": "int64",
 "shape": [1],
 "names": null
 }
 },
 "total_episodes": 58201,
 "total_frames": 18083626,
 "total_tasks": 23858,
 "chunks_size": 1000,
 "data_files_size_in_mb": 100,
 "video_files_size_in_mb": 200,
 "data_path": "data/chunk-{chunk_index:03d}/file-{file_index:03d}.parquet",
 "video_path": "videos/{video_key}/chunk-{chunk_index:03d}/file-{file_index:03d}.mp4",
 "robot_type": null,
 "splits": {
 "train": "0:58201"
 }
}

Data Format

The dataset follows the LeRobot format (v3.0) with the following directory layout:

droid_3d/
├── data/
│ └── chunk-000/
│ └── file-{000-999}.parquet # Action and metadata (636 MB)
├── videos/
│ ├── observation.images.wrist/
│ │ └── chunk-000/ # Wrist RGB videos
│ ├── observation.images.external_0/
│ │ └── chunk-000/ # External camera 0 RGB videos
│ ├── observation.images.external_1/
│ │ └── chunk-000/ # External camera 1 RGB videos
│ ├── observation.depth.wrist/
│ │ └── chunk-000/ # Wrist depth videos (H.265)
│ ├── observation.depth.external_0/
│ │ └── chunk-000/ # External camera 0 depth videos (H.265)
│ └── observation.depth.external_1/
│ └── chunk-000/ # External camera 1 depth videos (H.265)
├── pointclouds/
│ ├── observation.pointcloud.wrist/
│ │ └── chunk-000/ # Wrist point clouds (Parquet)
│ ├── observation.pointcloud.external_0/
│ │ └── chunk-000/ # External camera 0 point clouds (Parquet)
│ └── observation.pointcloud.external_1/
│ └── chunk-000/ # External camera 1 point clouds (Parquet)
├── meta/
│ ├── info.json # Dataset metadata
│ ├── stats.json # Dataset statistics
│ ├── tasks.parquet # Task definitions
│ └── episodes/ # Episode metadata
└── README.md

Features

Feature	Dtype	Shape	Details
`observation.images.wrist`	video	(224, 398, 3)	AV1, 15 FPS, yuv420p
`observation.images.external_0`	video	(224, 398, 3)	AV1, 15 FPS, yuv420p
`observation.images.external_1`	video	(224, 398, 3)	AV1, 15 FPS, yuv420p
`observation.depth.wrist`	depth_video	(224, 398, 1)	H.265 uint12, scale=uint12_mm, range=2000 mm
`observation.depth.external_0`	depth_video	(224, 398, 1)	H.265 uint12, scale=uint12_mm, range=2000 mm
`observation.depth.external_1`	depth_video	(224, 398, 1)	H.265 uint12, scale=uint12_mm, range=2000 mm
`observation.pointcloud.wrist`	pointcloud	(2048, 3)	Quantized XYZ, x∈[−1,1], y∈[−1,1], z∈[0,1.6]
`observation.pointcloud.external_0`	pointcloud	(2048, 3)	Quantized XYZ, x∈[−1,1], y∈[−1,1], z∈[0,1.6]
`observation.pointcloud.external_1`	pointcloud	(2048, 3)	Quantized XYZ, x∈[−1,1], y∈[−1,1], z∈[0,1.6]
`action`	float32	(8,)	Robot action vector
`language_1`	string	(1,)	First language annotation
`language_2`	string	(1,)	Second language annotation
`language_3`	string	(1,)	Third language annotation
`timestamp`	float32	(1,)	Frame timestamp
`frame_index`	int64	(1,)	Frame index within episode
`episode_index`	int64	(1,)	Episode identifier
`index`	int64	(1,)	Global frame index
`task_index`	int64	(1,)	Task identifier

Splits

Split	Episodes	Frames
train	58,201	18,083,626

Dataset Size

Videos: 823 GB
Point clouds: 461 GB
Parquet data: 636 MB
Total: ~1.3 TB

Usage

⚠️ Because standard LeRobot lacks depth video and point cloud decoding, you must install and use lerobotdataset3d to load this dataset.

Install

pip install git+https://github.com/ZibinDong/lerobotdataset3d.git

Load the dataset

from lerobotdataset3d import LeRobotDatasetDepthPointcloud

dataset = LeRobotDatasetDepthPointcloud(
 repo_id="ZibinDong/droid_3d",
 root="/local_path/to/droid_3d",
)

item = dataset[0]

# RGB video frames: (3, H, W) float32 in [0, 255]
item["observation.images.wrist"].shape # torch.Size([3, 224, 398])
item["observation.images.external_0"]

# Depth frames: (1, H, W) float32 in meters
item["observation.depth.wrist"].shape # torch.Size([1, 224, 398])
item["observation.depth.external_0"]

# Point clouds: (max_points, 3) float32 in meters
item["observation.pointcloud.wrist"].shape # torch.Size([2048, 3])
item["observation.pointcloud.external_0"]

# Actions, language, and metadata
item["action"] # torch.Size([8])
item["language_1"] # str
item["language_2"]
item["language_3"]
item["episode_index"]
item["frame_index"]
item["index"]
item["timestamp"]
item["task_index"]

Advanced: temporal window sampling

from lerobotdataset3d import LeRobotDatasetDepthPointcloud

dataset = LeRobotDatasetDepthPointcloud(
 repo_id="ZibinDong/droid_3d",
 root="/local_path/to/droid_3d",
 delta_timestamps={
 "observation.images.wrist": [-0.1, 0.0, 0.1],
 "action": [-0.1, 0.0, 0.1, 0.2, 0.3],
 },
)

Visualize online

👁 Visualize this dataset

Citation

@article{dong2025embodiedmae,
 title = {EmbodiedMAE: A Unified 3D Multi-Modal Representation for Robot Manipulation},
 author = {Dong, Zibin and Ni, Fei and Yuan, Yifu and Li, Yinchuan and Hao, Jianye},
 journal = {arXiv preprint arXiv:2505.10105},
 year = {2025}
}

Downloads last month: 6,744

Paper for ZibinDong/droid_3d

Paper • 2505.10105 • Published May 15, 2025

URL: https://huggingface.co/datasets/ZibinDong/droid_3d

⇱ ZibinDong/droid_3d · Datasets at Hugging Face