RoboTwin ICL v3 ARX-X5 VAM TI2V-5B OpenWAM Ref 100K
This repository contains a 100k-step VAM checkpoint trained on the RoboTwin ICL-paired v3 dataset for the ARX-X5 target robot.
This is the reference-video variant: during training the model used cross-embodiment paired reference videos from the same task/seed when available. It is intended for ICL-style human/robot reference conditioning, current target-view anchoring, and action rollout prediction.
Files
step-100000.safetensors # joint VAM checkpoint at 100k steps
model_config.json # architecture and data/preprocessing contract
training_log_node0.txt # rank/node 0 training log
training_log_node1.txt # rank/node 1 training log
README.md # this model card
The checkpoint contains both fine-tuned Wan video DiT weights and action_dit.* ActionMoT weights.
The Wan2.2-TI2V-5B base assets are not bundled.
Training Summary
run directory: src/vam/models/train/vam_icl/paired_v3_alltasks_mv_mot_ti2v5b_16g_100k_v2_ref_openwam_fastwam_warmstart
checkpoint: step-100000.safetensors
dataset: RoboTwin ICL-paired v3
target robot: arx-x5
task split: 20 train tasks
episodes/task: 150 train episodes
backbone: Wan2.2-TI2V-5B
action stream: 30-layer ActionMoT, OpenWAM/FastWAM warm-start style
reference mode: enabled, paired cross-embodiment reference video
multiview: enabled, RoboTwin head + left wrist + right wrist layout
Action And Proprio Contract
action_space: joint
proprio_space: joint
action_dim: 14
proprio_dim: 14
action_format: absolute
The 14-D action/state vector follows RoboTwin joint_action/vector:
left arm joints + left gripper + right arm joints + right gripper
The model predicts absolute joint targets, not delta actions.
Temporal And Visual Contract
raw action window: 33 frames
action horizon: 32 actions
action_video_freq_ratio: 4
target video frames: 9 frames sampled at raw indices [0, 4, ..., 32]
resolution: 384 x 320
multiview layout: head camera on top, left/right wrist cameras below
full reference video: enabled
max reference frames: 41
reference subsample factor: 4
For closed-loop eval, use:
action_dim=14
proprio_dim=14
num_frames=9
action_horizon=32
height=384
width=320
multiview=true
action_space=joint
proprio_space=action
Notes
This checkpoint was trained for research use in the local VAM codebase. It expects the matching model architecture, action normalization stats, and RoboTwin observation/action conventions used by the training scripts in src/vam/examples/wanvideo/human2robot.
Model tree for knightnemo/robotwin-icl-v3-arx-x5-vam-ti2v5b-openwam-ref-100k
Base model
Wan-AI/Wan2.2-TI2V-5B