RoboTwin ICL v3 ARX-X5 VAM TI2V-5B OpenWAM EEF Mask-v3 Refdrop10 Train20 100K
This repository contains a 100k-step VAM checkpoint trained on the RoboTwin ICL-paired v3 dataset for the ARX-X5 target robot.
This is the EEF reference-video mask-v3 variant: during training the model used cross-embodiment full reference videos with reference dropout 0.1, and predicts absolute EEF/action-space targets. It is intended as the paired reference-video counterpart to the mask-v3 no-ref baseline with the same model backbone, action stream, target robot, train20 task split, and 100k training horizon.
Files
step-100000.safetensors joint VAM checkpoint at 100k steps
training_metadata.json training and upload metadata
README.md this model card
Checkpoint SHA256:
3c1060b425866afc60fc37136973fa36f5d6803c49468f6b7be5be2967e57fe8
The checkpoint contains both fine-tuned Wan video DiT weights and action_dit.* ActionMoT weights. The Wan2.2-TI2V-5B base assets are not bundled.
Training Summary
run directory: src/vam/models/train/vam_icl/paired_v3_alltasks_mv_mot_ti2v5b_16g_100k_mask_v3_ref_openwam_eef_fastwam_warmstart_refdrop10_0524_0329
wandb run: run-20260524_113637-h7w9s4pf
wandb run name: paired_v3_train_mv_mot_ti2v5b_16g_100k_mask_v3_openwam_refdrop10_0524_0329
wandb url: https://wandb.ai/wuji-tech/vam_icl/runs/h7w9s4pf
checkpoint: step-100000.safetensors
dataset: RoboTwin ICL-paired v3
target robot: arx-x5
task split: 20 train tasks
episodes/task: 150 train episodes
backbone: Wan2.2-TI2V-5B
action stream: 30-layer ActionMoT, OpenWAM/FastWAM warm-start style
reference mode: enabled, cross-embodiment full reference video
reference dropout: 0.1
multiview: enabled, RoboTwin head + left wrist + right wrist layout
mask variant: v3
Action And Proprio Contract
action_space: eef
proprio_space: action
action_dim: 16
proprio_dim: 16
action_format: absolute
The model predicts absolute 16-D EEF/action-space targets. Use the matching local EEF/action normalization stats from the OpenWAM EEF training setup.
Temporal And Visual Contract
raw action window: 33 frames
action horizon: 32 actions
action_video_freq_ratio: 4
target video frames: 9 frames sampled at raw indices [0, 4, ..., 32]
resolution: 384 x 320
resize mode: stretch
multiview layout: head camera on top, left/right wrist cameras below
full reference video: enabled
max reference frames: 41
reference subsample factor: 4
paired reference: false
For closed-loop eval, use the matching local VAM code path with:
action_dim=16
proprio_dim=16
num_frames=9
action_horizon=32
height=384
width=320
multiview=true
action_space=eef
proprio_space=action
disable_reference_video=false
full_reference_video=true
max_ref_frames=41
ref_subsample_factor=4
mask_variant=v3
reference_dropout=0.1
Validation Snapshot
Last validation line observed for this checkpoint:
[val_id step 100000] 20/98741 samples: loss=0.143614, loss_video=0.101261, loss_action=0.042352
[val_id step 100000] task=grab roller action_MSE=0.000742 action_MAE=0.020227 video_MSE=1894.89 PSNR=19.31 SSIM=0.8012 LPIPS=0.1998
These numbers are a training-time validation snapshot, not a full deployment evaluation.
Notes
This checkpoint was trained for research use in the local VAM codebase. It expects the matching model architecture and RoboTwin observation/action conventions used by src/vam/examples/wanvideo/human2robot/train_video_action.py.
Model tree for knightnemo/robotwin-icl-v3-arx-x5-vam-ti2v5b-openwam-eef-mask-v3-refdrop10-train20-100k
Base model
Wan-AI/Wan2.2-TI2V-5B