EgoMM: Tri-Modal Egocentric Dataset (Video + Audio + IMU)

EgoMM is a large-scale tri-modal egocentric dataset combining Video, Audio, and IMU data from head-mounted Meta Aria glasses. Built from EgoLife and Ego-Exo4D sources.

Dataset Summary

Split	EgoLife	EgoExo4D	Total	Narration files
narrated (val/test)	4,689	17,367	22,056	19,933
raw (train)	27,119	6,357	33,476	—
Total	31,808	23,724	55,532	19,933

Clip duration: 30 seconds (fixed)
Total hours: 463h
Modalities: Video (MP4) + Audio (MP3) + IMU left/right (NPZ)
Video resolution: EgoLife 768×768, EgoExo4D 448×448

Structure

EgoMM/
├── narrated/ (val/test: clips with human annotations)
│ ├── egolife/{participant}/DAY1/{clip_id}/
│ │ ├── video.mp4 (768×768, 20fps, video-only)
│ │ ├── audio.mp3 (separate audio track)
│ │ ├── imu_left.npz (800Hz, 6-axis: accel_xyz + gyro_xyz)
│ │ ├── imu_right.npz (1000Hz, 6-axis)
│ │ └── narration.json (clip-relative timestamps)
│ └── egoexo4d/{activity}/{take_name}/{clip_id}/
│ ├── video.mp4 (448×448, video-only)
│ ├── audio.mp3
│ ├── imu_left.npz
│ ├── imu_right.npz
│ └── narration.json
├── raw/ (train: clips without annotations)
│ ├── egolife/{participant}/{DAY2-7}/{clip_id}/
│ │ ├── video.mp4
│ │ ├── audio.mp3
│ │ ├── imu_left.npz
│ │ └── imu_right.npz
│ └── egoexo4d/{activity}/{take_name}/{clip_id}/
│ └── ...
└── metadata/
 ├── egolife/ (DenseCaption SRTs, Transcript, Caption/QA JSONs)
 └── egoexo4d/ (atomic descriptions, expert commentary, splits)

Download

from huggingface_hub import snapshot_download

# Download narrated set only (val/test, ~120 GB)
snapshot_download(
 repo_id="ldkong/EgoMM",
 repo_type="dataset",
 local_dir="./EgoMM",
 allow_patterns=["narrated/**"]
)

# Download raw set only (train, ~370 GB)
snapshot_download(
 repo_id="ldkong/EgoMM",
 repo_type="dataset",
 local_dir="./EgoMM",
 allow_patterns=["raw/**"]
)

# Download metadata only
snapshot_download(
 repo_id="ldkong/EgoMM",
 repo_type="dataset",
 local_dir="./EgoMM",
 allow_patterns=["metadata/**"]
)

# Download a specific activity
snapshot_download(
 repo_id="ldkong/EgoMM",
 repo_type="dataset",
 local_dir="./EgoMM",
 allow_patterns=["narrated/egoexo4d/cooking/**"]
)

Narration Format

Each narration.json contains clip-relative annotations:

{
 "clip_id": "upenn_0714_Cooking_7_2_0090000",
 "narrations": [
 {"timestamp": 2.4, "text": "C scrolls through the mobile phone screen", "type": "atomic"},
 {"timestamp": 5.1, "text": "C places the phone down", "type": "atomic"}
 ],
 "expert_commentary": [
 {"timestamp": 0.0, "text": "He needs to give himself more room...", "type": "expert"}
 ],
 "sequence_info": {
 "take_name": "upenn_0714_Cooking_7_2",
 "clip_index": 3,
 "total_clips_in_take": 12,
 "start_sec_in_take": 90.0,
 "end_sec_in_take": 120.0
 }
}

Reconstructing Long Sequences

Clips can be combined into longer sequences using sequence_info:

import json
from pathlib import Path

# Load all clips from a take
take_name = "upenn_0714_Cooking_7_2"
clips = sorted(Path("narrated/egoexo4d/cooking").glob(f"{take_name}/*/narration.json"))

# Reconstruct full timeline
for clip_path in clips:
 clip = json.loads(clip_path.read_text())
 offset = clip["sequence_info"]["start_sec_in_take"]
 for n in clip["narrations"]:
 abs_time = offset + n["timestamp"]
 print(f"{abs_time:.1f}s: {n['text']}")

IMU Format

Each NPZ file contains 7 arrays at 800Hz (left) or 1000Hz (right):

timestamp: int64 nanoseconds
accel_x, accel_y, accel_z: float64, m/s²
gyro_x, gyro_y, gyro_z: float64, rad/s

import numpy as np
data = np.load("imu_left.npz")
# Duration: (data['timestamp'][-1] - data['timestamp'][0]) / 1e9 ≈ 30.0s
# Gravity magnitude: ~9.8 m/s²

Activities (EgoExo4D)

Activity	Narrated clips	Raw clips
cooking	~7,800	~3,600
music	~1,850	~1,030
health	~1,950	~560
bike_repair	~1,470	~260
dance	~1,200	~380
rock_climbing	~1,040	~295
basketball	~935	~316
soccer	~740	~244

Participants (EgoLife)

Participant	DAY1 (narrated)	DAY2-7 (raw)
A1_JAKE	~780 clips	~4,500 clips
A2_ALICE	~780 clips	~4,200 clips
A3_TASHA	~780 clips	~4,800 clips
A4_LUCIA	~780 clips	~4,600 clips
A5_KATRINA	~780 clips	~4,100 clips
A6_SHURE	~780 clips	~4,900 clips

Hardware

All data recorded with Meta Aria glasses:

Video: 1408×1408 fisheye RGB (downscaled to 768×768 for EgoLife, 448×448 for EgoExo4D)
Audio: built-in microphone
IMU: dual 6-axis sensors (left 800Hz, right 1000Hz)
Same hardware across EgoLife and EgoExo4D — models transfer between datasets

Sources

EgoLife — 6 participants × 7 days of continuous daily recording
Ego-Exo4D — 4,168 takes of skilled activities (8 categories)

License

Please refer to the original dataset licenses:

EgoLife: S-Lab License 1.0
Ego-Exo4D: Ego-Exo4D Dataset License

Downloads last month: 42

URL: https://huggingface.co/datasets/ldkong/EgoMM

⇱ ldkong/EgoMM · Datasets at Hugging Face