VOOZH about

URL: https://huggingface.co/datasets/microsoft/VITRA-TeleData

⇱ microsoft/VITRA-TeleData Β· Datasets at Hugging Face


Dataset Viewer
Duplicate

VITRA Teleoperation Dataset

Dataset Summary

This dataset contains real-world robot teleoperation demonstrations collected using a 7-DoF robotic arm equipped with a dexterous hand and a head-mounted RGB camera. Each episode provides synchronized numerical state/action data and video recordings. The dataset is used for finetuning in the project VITRA: Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos

Project page: https://microsoft.github.io/VITRA/


Hardware Setup


Data Modalities and Files

Each episode consists of two synchronized files:

  • <episode_id>.h5 β€” numerical data including robot states, actions, kinematics, and metadata
  • <episode_id>.mp4 β€” RGB video stream recorded from the head-mounted camera

The two files correspond one-to-one and share the same episode identifier.


Coordinate Frames

The dataset uses the following coordinate frames:

  • arm_base
    Root frame of the arm kinematic chain, defined in the URDF.
  • ee_urdf
    End-effector frame defined in the URDF (joint7).
  • hand_mount
    Rigid mounting frame of the dexterous hand, including flange offset.
    This frame is rotationally aligned with the human hand axis illustrated in Figure 1 (identity rotation).
  • head_camera
    Optical center of the head-mounted RGB camera.

πŸ‘ Image

Figure 1. The hand_mount frame axes. Axis directions follow the human hand definition illustrated in the figure.


Arm Availability and Masks

The dataset format is compatible with both right-arm-only episodes and dual-arm episodes. The currently released dataset contains only right-arm data.

  • Missing arms/hands are filled with zeros to keep array shapes consistent.
  • Availability is indicated by:
    • /meta/has_left, /meta/has_right (episode-level)
    • /mask/* (frame-level)

HDF5 File Structure

Each .h5 file follows the structure below:

/
β”œβ”€β”€ meta/
β”‚ β”œβ”€β”€ instruction string
β”‚ β”œβ”€β”€ video_path string
β”‚ β”œβ”€β”€ frame_count int # T
β”‚ β”œβ”€β”€ fps float
β”‚ β”œβ”€β”€ has_left bool
β”‚ β”œβ”€β”€ has_right bool
β”‚
β”œβ”€β”€ kinematics/
β”‚ β”œβ”€β”€ left_ee_urdf_to_hand_mount (4, 4) float64
β”‚ β”œβ”€β”€ right_ee_urdf_to_hand_mount (4, 4) float64
β”‚ β”œβ”€β”€ head_camera_to_left_arm_base (4, 4) float64
β”‚ └── head_camera_to_right_arm_base (4, 4) float64
β”‚
β”œβ”€β”€ observation/
β”‚ └── camera/
β”‚ └── intrinsics (3, 3) float64
β”‚
β”œβ”€β”€ state/
β”‚ β”œβ”€β”€ left_arm_joint (T, Na) float64 # joint positions (rad)
β”‚ β”œβ”€β”€ right_arm_joint (T, Na) float64
β”‚ β”œβ”€β”€ left_hand_mount_pose (T, 6) float64 # hand_mount pose in arm_base: [x,y,z,rx,ry,rz]
β”‚ β”œβ”€β”€ right_hand_mount_pose (T, 6) float64 # hand_mount pose in arm_base: [x,y,z,rx,ry,rz]
| β”œβ”€β”€ left_hand_mount_pose_in_cam (T, 6) float64 # hand_mount pose in head_camera: [x,y,z,rx,ry,rz]
| β”œβ”€β”€ right_hand_mount_pose_in_cam (T, 6) float64 # hand_mount pose in head_camera: [x,y,z,rx,ry,rz]
β”‚ β”œβ”€β”€ left_hand_joint (T, Nh) float64
β”‚ └── right_hand_joint (T, Nh) float64
β”‚
β”œβ”€β”€ action/
β”‚ β”œβ”€β”€ left_arm_joint (T, Na) float64 # target joint positions (rad)
β”‚ β”œβ”€β”€ right_arm_joint (T, Na) float64 # target joint positions (rad)
β”‚ β”œβ”€β”€ left_hand_joint (T, Nh) float64 # target joint positions (rad)
β”‚ └── right_hand_joint (T, Nh) float64 # target joint positions (rad)
β”‚
└── mask/
 β”œβ”€β”€ left_arm (T,) bool
 β”œβ”€β”€ right_arm (T,) bool
 β”œβ”€β”€ left_hand (T,) bool
 └── right_hand (T,) bool

Pose Representation

For all *_hand_mount_pose entries, poses are represented as:

[x, y, z, rx, ry, rz]

where:

  • (x, y, z) denotes the position of the hand_mount frame expressed in arm_base (meters)
  • (rx, ry, rz) denotes the rotation vector in axis–angle representation (radians)

Transformation Notation

A homogeneous transformation matrix is denoted by T (4Γ—4).

  • Subscript: reference frame (the coordinate system used for expression)
  • Superscript: target frame (the frame being described)

All subscripts and superscripts are written on the right-hand side of T.

Example: T^{hand\_mount}_{arm\_base} represents the pose of hand_mount expressed in the arm_base frame.


Kinematic Relations and Episode-Specific Transforms

Different flange hardware or camera mounting configurations may be used across episodes or arms. As a result:

All kinematic and extrinsic transforms must be read from the current episode and must not be assumed constant.

The hand mounting pose expressed in arm_base is computed as:

where:

  • T^{ee\_urdf}_{arm\_base} is obtained via forward kinematics (FK) from the arm joint positions, corresponding to the URDF end-effector frame (joint7).
  • T^{hand\_mount}_{ee\_urdf} is a fixed, episode-specific transform provided under /kinematics/*_ee_urdf_to_hand_mount.

Camera extrinsics may also vary across episodes.
Transforms under /kinematics/head_camera_to_*_arm_base should likewise be read from the current episode and must not be assumed constant. The hand mounting pose expressed in head_camera frame (i.e. *_hand_mount_pose_in_cam) is:

where:

  • T^{head\_camera}_{arm\_base} is episode-specific transform provided under /kinematics/head_camera_to_*_arm_base

Downloads last month
563

Paper for microsoft/VITRA-TeleData