VOOZH about

URL: https://huggingface.co/datasets/ldkong/EgoMM

⇱ ldkong/EgoMM · Datasets at Hugging Face


Dataset Viewer

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

EgoMM: Tri-Modal Egocentric Dataset (Video + Audio + IMU)

EgoMM is a large-scale tri-modal egocentric dataset combining Video, Audio, and IMU data from head-mounted Meta Aria glasses. Built from EgoLife and Ego-Exo4D sources.

Dataset Summary

Split EgoLife EgoExo4D Total Narration files
narrated (val/test) 4,689 17,367 22,056 19,933
raw (train) 27,119 6,357 33,476
Total 31,808 23,724 55,532 19,933
  • Clip duration: 30 seconds (fixed)
  • Total hours: 463h
  • Modalities: Video (MP4) + Audio (MP3) + IMU left/right (NPZ)
  • Video resolution: EgoLife 768×768, EgoExo4D 448×448

Structure

EgoMM/
├── narrated/ (val/test: clips with human annotations)
│ ├── egolife/{participant}/DAY1/{clip_id}/
│ │ ├── video.mp4 (768×768, 20fps, video-only)
│ │ ├── audio.mp3 (separate audio track)
│ │ ├── imu_left.npz (800Hz, 6-axis: accel_xyz + gyro_xyz)
│ │ ├── imu_right.npz (1000Hz, 6-axis)
│ │ └── narration.json (clip-relative timestamps)
│ └── egoexo4d/{activity}/{take_name}/{clip_id}/
│ ├── video.mp4 (448×448, video-only)
│ ├── audio.mp3
│ ├── imu_left.npz
│ ├── imu_right.npz
│ └── narration.json
├── raw/ (train: clips without annotations)
│ ├── egolife/{participant}/{DAY2-7}/{clip_id}/
│ │ ├── video.mp4
│ │ ├── audio.mp3
│ │ ├── imu_left.npz
│ │ └── imu_right.npz
│ └── egoexo4d/{activity}/{take_name}/{clip_id}/
│ └── ...
└── metadata/
 ├── egolife/ (DenseCaption SRTs, Transcript, Caption/QA JSONs)
 └── egoexo4d/ (atomic descriptions, expert commentary, splits)

Download

from huggingface_hub import snapshot_download

# Download narrated set only (val/test, ~120 GB)
snapshot_download(
 repo_id="ldkong/EgoMM",
 repo_type="dataset",
 local_dir="./EgoMM",
 allow_patterns=["narrated/**"]
)

# Download raw set only (train, ~370 GB)
snapshot_download(
 repo_id="ldkong/EgoMM",
 repo_type="dataset",
 local_dir="./EgoMM",
 allow_patterns=["raw/**"]
)

# Download metadata only
snapshot_download(
 repo_id="ldkong/EgoMM",
 repo_type="dataset",
 local_dir="./EgoMM",
 allow_patterns=["metadata/**"]
)

# Download a specific activity
snapshot_download(
 repo_id="ldkong/EgoMM",
 repo_type="dataset",
 local_dir="./EgoMM",
 allow_patterns=["narrated/egoexo4d/cooking/**"]
)

Narration Format

Each narration.json contains clip-relative annotations:

{
 "clip_id": "upenn_0714_Cooking_7_2_0090000",
 "narrations": [
 {"timestamp": 2.4, "text": "C scrolls through the mobile phone screen", "type": "atomic"},
 {"timestamp": 5.1, "text": "C places the phone down", "type": "atomic"}
 ],
 "expert_commentary": [
 {"timestamp": 0.0, "text": "He needs to give himself more room...", "type": "expert"}
 ],
 "sequence_info": {
 "take_name": "upenn_0714_Cooking_7_2",
 "clip_index": 3,
 "total_clips_in_take": 12,
 "start_sec_in_take": 90.0,
 "end_sec_in_take": 120.0
 }
}

Reconstructing Long Sequences

Clips can be combined into longer sequences using sequence_info:

import json
from pathlib import Path

# Load all clips from a take
take_name = "upenn_0714_Cooking_7_2"
clips = sorted(Path("narrated/egoexo4d/cooking").glob(f"{take_name}/*/narration.json"))

# Reconstruct full timeline
for clip_path in clips:
 clip = json.loads(clip_path.read_text())
 offset = clip["sequence_info"]["start_sec_in_take"]
 for n in clip["narrations"]:
 abs_time = offset + n["timestamp"]
 print(f"{abs_time:.1f}s: {n['text']}")

IMU Format

Each NPZ file contains 7 arrays at 800Hz (left) or 1000Hz (right):

  • timestamp: int64 nanoseconds
  • accel_x, accel_y, accel_z: float64, m/s²
  • gyro_x, gyro_y, gyro_z: float64, rad/s
import numpy as np
data = np.load("imu_left.npz")
# Duration: (data['timestamp'][-1] - data['timestamp'][0]) / 1e9 ≈ 30.0s
# Gravity magnitude: ~9.8 m/s²

Activities (EgoExo4D)

Activity Narrated clips Raw clips
cooking ~7,800 ~3,600
music ~1,850 ~1,030
health ~1,950 ~560
bike_repair ~1,470 ~260
dance ~1,200 ~380
rock_climbing ~1,040 ~295
basketball ~935 ~316
soccer ~740 ~244

Participants (EgoLife)

Participant DAY1 (narrated) DAY2-7 (raw)
A1_JAKE ~780 clips ~4,500 clips
A2_ALICE ~780 clips ~4,200 clips
A3_TASHA ~780 clips ~4,800 clips
A4_LUCIA ~780 clips ~4,600 clips
A5_KATRINA ~780 clips ~4,100 clips
A6_SHURE ~780 clips ~4,900 clips

Hardware

All data recorded with Meta Aria glasses:

  • Video: 1408×1408 fisheye RGB (downscaled to 768×768 for EgoLife, 448×448 for EgoExo4D)
  • Audio: built-in microphone
  • IMU: dual 6-axis sensors (left 800Hz, right 1000Hz)
  • Same hardware across EgoLife and EgoExo4D — models transfer between datasets

Sources

  • EgoLife — 6 participants × 7 days of continuous daily recording
  • Ego-Exo4D — 4,168 takes of skilled activities (8 categories)

License

Please refer to the original dataset licenses:

  • EgoLife: S-Lab License 1.0
  • Ego-Exo4D: Ego-Exo4D Dataset License
Downloads last month
42