RoboTwin ICL v3 ARX-X5 VAM TI2V-5B OpenWAM Ref 100K

This repository contains a 100k-step VAM checkpoint trained on the RoboTwin ICL-paired v3 dataset for the ARX-X5 target robot.

This is the reference-video variant: during training the model used cross-embodiment paired reference videos from the same task/seed when available. It is intended for ICL-style human/robot reference conditioning, current target-view anchoring, and action rollout prediction.

Files

step-100000.safetensors # joint VAM checkpoint at 100k steps
model_config.json # architecture and data/preprocessing contract
training_log_node0.txt # rank/node 0 training log
training_log_node1.txt # rank/node 1 training log
README.md # this model card

The checkpoint contains both fine-tuned Wan video DiT weights and action_dit.* ActionMoT weights. The Wan2.2-TI2V-5B base assets are not bundled.

Training Summary

run directory: src/vam/models/train/vam_icl/paired_v3_alltasks_mv_mot_ti2v5b_16g_100k_v2_ref_openwam_fastwam_warmstart
checkpoint: step-100000.safetensors
dataset: RoboTwin ICL-paired v3
target robot: arx-x5
task split: 20 train tasks
episodes/task: 150 train episodes
backbone: Wan2.2-TI2V-5B
action stream: 30-layer ActionMoT, OpenWAM/FastWAM warm-start style
reference mode: enabled, paired cross-embodiment reference video
multiview: enabled, RoboTwin head + left wrist + right wrist layout

Action And Proprio Contract

action_space: joint
proprio_space: joint
action_dim: 14
proprio_dim: 14
action_format: absolute

The 14-D action/state vector follows RoboTwin joint_action/vector:

left arm joints + left gripper + right arm joints + right gripper

The model predicts absolute joint targets, not delta actions.

Temporal And Visual Contract

raw action window: 33 frames
action horizon: 32 actions
action_video_freq_ratio: 4
target video frames: 9 frames sampled at raw indices [0, 4, ..., 32]
resolution: 384 x 320
multiview layout: head camera on top, left/right wrist cameras below
full reference video: enabled
max reference frames: 41
reference subsample factor: 4

For closed-loop eval, use:

action_dim=14
proprio_dim=14
num_frames=9
action_horizon=32
height=384
width=320
multiview=true
action_space=joint
proprio_space=action

Notes

This checkpoint was trained for research use in the local VAM codebase. It expects the matching model architecture, action normalization stats, and RoboTwin observation/action conventions used by the training scripts in src/vam/examples/wanvideo/human2robot.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video-to-Video

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for knightnemo/robotwin-icl-v3-arx-x5-vam-ti2v5b-openwam-ref-100k

Base model

Wan-AI/Wan2.2-TI2V-5B

Finetuned

(62)

this model

URL: https://huggingface.co/knightnemo/robotwin-icl-v3-arx-x5-vam-ti2v5b-openwam-ref-100k