Dataset Viewer
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
EgoMM: Tri-Modal Egocentric Dataset (Video + Audio + IMU)
EgoMM is a large-scale tri-modal egocentric dataset combining Video, Audio, and IMU data from head-mounted Meta Aria glasses. Built from EgoLife and Ego-Exo4D sources.
Dataset Summary
| Split | EgoLife | EgoExo4D | Total | Narration files |
|---|---|---|---|---|
| narrated (val/test) | 4,689 | 17,367 | 22,056 | 19,933 |
| raw (train) | 27,119 | 6,357 | 33,476 | — |
| Total | 31,808 | 23,724 | 55,532 | 19,933 |
- Clip duration: 30 seconds (fixed)
- Total hours: 463h
- Modalities: Video (MP4) + Audio (MP3) + IMU left/right (NPZ)
- Video resolution: EgoLife 768×768, EgoExo4D 448×448
Structure
EgoMM/
├── narrated/ (val/test: clips with human annotations)
│ ├── egolife/{participant}/DAY1/{clip_id}/
│ │ ├── video.mp4 (768×768, 20fps, video-only)
│ │ ├── audio.mp3 (separate audio track)
│ │ ├── imu_left.npz (800Hz, 6-axis: accel_xyz + gyro_xyz)
│ │ ├── imu_right.npz (1000Hz, 6-axis)
│ │ └── narration.json (clip-relative timestamps)
│ └── egoexo4d/{activity}/{take_name}/{clip_id}/
│ ├── video.mp4 (448×448, video-only)
│ ├── audio.mp3
│ ├── imu_left.npz
│ ├── imu_right.npz
│ └── narration.json
├── raw/ (train: clips without annotations)
│ ├── egolife/{participant}/{DAY2-7}/{clip_id}/
│ │ ├── video.mp4
│ │ ├── audio.mp3
│ │ ├── imu_left.npz
│ │ └── imu_right.npz
│ └── egoexo4d/{activity}/{take_name}/{clip_id}/
│ └── ...
└── metadata/
├── egolife/ (DenseCaption SRTs, Transcript, Caption/QA JSONs)
└── egoexo4d/ (atomic descriptions, expert commentary, splits)
Download
from huggingface_hub import snapshot_download
# Download narrated set only (val/test, ~120 GB)
snapshot_download(
repo_id="ldkong/EgoMM",
repo_type="dataset",
local_dir="./EgoMM",
allow_patterns=["narrated/**"]
)
# Download raw set only (train, ~370 GB)
snapshot_download(
repo_id="ldkong/EgoMM",
repo_type="dataset",
local_dir="./EgoMM",
allow_patterns=["raw/**"]
)
# Download metadata only
snapshot_download(
repo_id="ldkong/EgoMM",
repo_type="dataset",
local_dir="./EgoMM",
allow_patterns=["metadata/**"]
)
# Download a specific activity
snapshot_download(
repo_id="ldkong/EgoMM",
repo_type="dataset",
local_dir="./EgoMM",
allow_patterns=["narrated/egoexo4d/cooking/**"]
)
Narration Format
Each narration.json contains clip-relative annotations:
{
"clip_id": "upenn_0714_Cooking_7_2_0090000",
"narrations": [
{"timestamp": 2.4, "text": "C scrolls through the mobile phone screen", "type": "atomic"},
{"timestamp": 5.1, "text": "C places the phone down", "type": "atomic"}
],
"expert_commentary": [
{"timestamp": 0.0, "text": "He needs to give himself more room...", "type": "expert"}
],
"sequence_info": {
"take_name": "upenn_0714_Cooking_7_2",
"clip_index": 3,
"total_clips_in_take": 12,
"start_sec_in_take": 90.0,
"end_sec_in_take": 120.0
}
}
Reconstructing Long Sequences
Clips can be combined into longer sequences using sequence_info:
import json
from pathlib import Path
# Load all clips from a take
take_name = "upenn_0714_Cooking_7_2"
clips = sorted(Path("narrated/egoexo4d/cooking").glob(f"{take_name}/*/narration.json"))
# Reconstruct full timeline
for clip_path in clips:
clip = json.loads(clip_path.read_text())
offset = clip["sequence_info"]["start_sec_in_take"]
for n in clip["narrations"]:
abs_time = offset + n["timestamp"]
print(f"{abs_time:.1f}s: {n['text']}")
IMU Format
Each NPZ file contains 7 arrays at 800Hz (left) or 1000Hz (right):
timestamp: int64 nanosecondsaccel_x,accel_y,accel_z: float64, m/s²gyro_x,gyro_y,gyro_z: float64, rad/s
import numpy as np
data = np.load("imu_left.npz")
# Duration: (data['timestamp'][-1] - data['timestamp'][0]) / 1e9 ≈ 30.0s
# Gravity magnitude: ~9.8 m/s²
Activities (EgoExo4D)
| Activity | Narrated clips | Raw clips |
|---|---|---|
| cooking | ~7,800 | ~3,600 |
| music | ~1,850 | ~1,030 |
| health | ~1,950 | ~560 |
| bike_repair | ~1,470 | ~260 |
| dance | ~1,200 | ~380 |
| rock_climbing | ~1,040 | ~295 |
| basketball | ~935 | ~316 |
| soccer | ~740 | ~244 |
Participants (EgoLife)
| Participant | DAY1 (narrated) | DAY2-7 (raw) |
|---|---|---|
| A1_JAKE | ~780 clips | ~4,500 clips |
| A2_ALICE | ~780 clips | ~4,200 clips |
| A3_TASHA | ~780 clips | ~4,800 clips |
| A4_LUCIA | ~780 clips | ~4,600 clips |
| A5_KATRINA | ~780 clips | ~4,100 clips |
| A6_SHURE | ~780 clips | ~4,900 clips |
Hardware
All data recorded with Meta Aria glasses:
- Video: 1408×1408 fisheye RGB (downscaled to 768×768 for EgoLife, 448×448 for EgoExo4D)
- Audio: built-in microphone
- IMU: dual 6-axis sensors (left 800Hz, right 1000Hz)
- Same hardware across EgoLife and EgoExo4D — models transfer between datasets
Sources
- EgoLife — 6 participants × 7 days of continuous daily recording
- Ego-Exo4D — 4,168 takes of skilled activities (8 categories)
License
Please refer to the original dataset licenses:
- EgoLife: S-Lab License 1.0
- Ego-Exo4D: Ego-Exo4D Dataset License
- Downloads last month
- 42
