RoboTwin ICL v3 ARX-X5 VAM TI2V-5B OpenWAM EEF No-Ref Train20 100K
This repository contains a 100k-step VAM checkpoint trained on the RoboTwin ICL-paired v3 dataset for the ARX-X5 target robot.
This is the EEF no-reference-video variant: during training cross-embodiment reference video conditioning was disabled. It is intended as the EEF action-space no-ref baseline against paired-reference checkpoints while keeping the same model backbone, action stream, target robot, train20 task split, and 100k training horizon.
Files
step-100000.safetensors joint VAM checkpoint at 100k steps
model_config.json architecture and data/preprocessing contract
training_config.yaml training configuration snapshot
action_stats.npy EEF/action normalization stats used by training
training_log_node0.txt rank/node 0 training log
training_log_node1.txt rank/node 1 training log
README.md this model card
Checkpoint SHA256:
5af643a4e2a460ef5abd1267bd1c168f11bee8d5c90d20b75e91a34e5c350480
The checkpoint contains both fine-tuned Wan video DiT weights and action_dit.* ActionMoT weights. The Wan2.2-TI2V-5B base assets are not bundled.
Training Summary
run directory: src/vam/models/train/vam_icl/paired_v3_alltasks_mv_mot_ti2v5b_16g_100k_v2_no_ref_openwam_eef_fastwam_warmstart_relaunch_20260521-234638
wandb run: run-20260522_081526-lj5uhtfc
wandb run name: paired_v3_train_mv_mot_ti2v5b_16g_100k_mask_v2_openwam_0522_0812
checkpoint: step-100000.safetensors
dataset: RoboTwin ICL-paired v3
target robot: arx-x5
task split: 20 train tasks
episodes/task: 150 train episodes
backbone: Wan2.2-TI2V-5B
action stream: 30-layer ActionMoT, OpenWAM/FastWAM warm-start style
reference mode: disabled
multiview: enabled, RoboTwin head + left wrist + right wrist layout
mask variant: v2
Action And Proprio Contract
action_space: eef
proprio_space: action
action_dim: 16
proprio_dim: 16
action_format: absolute
The model predicts absolute 16-D EEF/action-space targets. Use the included action_stats.npy for normalization/denormalization.
Temporal And Visual Contract
raw action window: 33 frames
action horizon: 32 actions
action_video_freq_ratio: 4
target video frames: 9 frames sampled at raw indices [0, 4, ..., 32]
resolution: 384 x 320
resize mode: stretch
multiview layout: head camera on top, left/right wrist cameras below
full reference video: disabled for training
For closed-loop eval, use the matching local VAM code path with:
action_dim=16
proprio_dim=16
num_frames=9
action_horizon=32
height=384
width=320
multiview=true
action_space=eef
proprio_space=action
disable_reference_video=true
Validation Snapshot
Last validation line observed for this checkpoint:
[val_id step 100000] 20/98741 samples: loss=0.110698, loss_video=0.092703, loss_action=0.017994
[val_id step 100000] task=grab roller action_MSE=0.000646 action_MAE=0.018203 video_MSE=2014.58 PSNR=19.24 SSIM=0.7889 LPIPS=0.2337
These numbers are a training-time validation snapshot, not a full deployment evaluation.
Notes
This checkpoint was trained for research use in the local VAM codebase. It expects the matching model architecture and RoboTwin observation/action conventions used by src/vam/examples/wanvideo/human2robot/train_video_action.py.
Model tree for knightnemo/robotwin-icl-v3-arx-x5-vam-ti2v5b-openwam-eef-no-ref-train20-100k
Base model
Wan-AI/Wan2.2-TI2V-5B