Wuji Pick Place Vam Ti2V5B 30L Robotwin Ref Openwam 16G Mask V3
This repository contains one Wuji pick-and-place VAM checkpoint from the May 26, 2026 OpenWAM/RobotWin-reference training run.
Identity
repo_id: knightnemo/wuji-pick-place-vam-ti2v5b-30l-robotwin-ref-openwam-16g-mask-v3
wandb project: wuji_pick_place
wandb run id: ge0f4qe7
wandb run name: wuji_pick_place_vam_ti2v5b_30L_robotwin_ref_openwam_16g_mask_v3_0526_2127
local training dir: /cpfs/huangsq/VAM_Learn_from_Human_Video/src/vam/models/train/wuji_pick_place_vam_ti2v5b_30L_robotwin_ref_openwam_16g_mask_v3
checkpoint: step-20000.safetensors
checkpoint size: 12042172757 bytes
base model: Wan-AI/Wan2.2-TI2V-5B
action expert style: openwam
mask variant: v3
reference frames: 57
reference dropout: 0.1
action/proprio dim: 54 / 54
The checkpoint is a joint model: step-20000.safetensors contains the fine-tuned
video DiT weights plus the action_dit.* action-stream weights. The
Wan2.2-TI2V-5B base model is not included.
Files
step-20000.safetensors final 20k-step checkpoint
model_config.json compact machine-readable configuration and metrics
training_config.yaml full W&B training config snapshot
wandb-summary.json final scalar metrics exported by W&B
training_log_node0.txt node-0 training log
training_log_node1.txt node-1 training log
README.md this model card
Final Step Metrics
These are the scalar values in wandb-summary.json at step=20000.
| Metric | Value |
|---|---|
val/loss |
0.20223 |
val/loss_action |
0.11154 |
val/loss_video |
0.0906898 |
val/action_mse |
0.00743866 |
val/action_mae |
0.055161 |
val/video_mse |
640.378 |
val/video_psnr |
21.9358 |
val/video_ssim |
0.784399 |
val/video_lpips |
0.0869703 |
train/loss |
0.0870087 |
train/loss_action |
0.00918485 |
train/loss_video |
0.0778239 |
train/grad_norm |
1.53125 |
_runtime |
32738.4 |
_step |
20000 |
Saved Validation Losses
Only step-20000.safetensors is uploaded here. Earlier local checkpoints were
saved every 2500 steps and are listed for provenance.
Best saved checkpoint by aggregate validation loss:
step 15000: loss=0.196729, loss_video=0.081665, loss_action=0.115064
| Step | Loss | Video Loss | Action Loss |
|---|---|---|---|
| 2500 | 0.288992 | 0.140088 | 0.148904 |
| 5000 | 0.196988 | 0.077994 | 0.118994 |
| 7500 | 0.250991 | 0.099323 | 0.151668 |
| 10000 | 0.330028 | 0.118547 | 0.211480 |
| 12500 | 0.321415 | 0.121587 | 0.199828 |
| 15000 | 0.196729 | 0.081665 | 0.115064 |
| 17500 | 0.333581 | 0.099311 | 0.234270 |
| 20000 | 0.202230 | 0.090690 | 0.111540 |
Training Configuration
| Key | Value |
|---|---|
dataset_type |
wuji_pick_place |
wuji_robot_dataset_root |
None |
variant |
clean_50 |
height |
384 |
width |
320 |
num_frames |
33 |
action_video_freq_ratio |
4 |
action_horizon |
None |
action_dim |
54 |
proprio_dim |
54 |
action_format |
absolute |
action_space |
joint |
proprio_space |
action |
action_pad_mode |
last |
action_expert_style |
openwam |
action_mot_backbone_pretrained_path |
/cpfs/huangsq/VAM_Learn_from_Human_Video/src/vam/models/pretrained/ActionMoT_openwam_linear_interp_Wan22_alphascale_1024hdim.pt |
mask_variant |
v3 |
mask_tail_padding_loss |
True |
full_reference_video |
True |
max_ref_frames |
57 |
reference_dropout |
0.1 |
bridge_exclude_full_ref |
True |
extra_inputs |
vace_reference_image,action_trajectory |
target_camera |
head_camera |
reference_camera |
head_camera |
resize_mode |
stretch |
backbone |
ti2v |
model_paths |
["models/Wan-AI/Wan2.2-TI2V-5B/diffusion_pytorch_model-00001-of-00003.safetensors","models/Wan-AI/Wan2.2-TI2V-5B/diffusion_pytorch_model-00002-of-00003.safetensors","models/Wan-AI/Wan2.2-TI2V-5B/diffusion_pytorch_model-00003-of-00003.safetensors","models/Wan-AI/Wan2.2-TI2V-5B/models_t5_umt5-xxl-enc-bf16.pth","models/Wan-AI/Wan2.2-TI2V-5B/Wan2.2_VAE.pth"] |
tokenizer_path |
models/Wan-AI/Wan2.2-TI2V-5B/google/umt5-xxl |
trainable_models |
dit |
learning_rate |
5e-05 |
action_lr |
None |
weight_decay |
0.01 |
warmup_steps |
500 |
max_steps |
20000 |
num_epochs |
1 |
batch_size |
1 |
gradient_accumulation_steps |
1 |
dataset_repeat |
1 |
dataset_num_workers |
8 |
use_gradient_checkpointing |
True |
save_steps |
2500 |
val_steps |
500 |
video_log_steps |
2500 |
max_val_samples |
20 |
lambda_video |
1 |
lambda_action |
1 |
video_dim |
3072 |
action_dit_dim |
1024 |
action_dit_ffn_dim |
4096 |
action_dit_num_heads |
24 |
action_dit_num_layers |
30 |
proprio_dropout |
0.1 |
window_stride |
1 |
val_ratio |
0.1 |
output_path |
/cpfs/huangsq/VAM_Learn_from_Human_Video/src/vam/models/train/wuji_pick_place_vam_ti2v5b_30L_robotwin_ref_openwam_16g_mask_v3 |
wandb_project |
wuji_pick_place |
wandb_run_name |
wuji_pick_place_vam_ti2v5b_30L_robotwin_ref_openwam_16g_mask_v3_0526_2127 |
Input/Output Contract
Expected inputs:
prompt: Pick up the ball with the left hand and place it in the basket.
target camera: head_camera
reference camera: head_camera
target video frames: 33
full reference frames: 57
image resolution: 384 x 320
action/proprio dim: 54 / 54
Expected outputs:
robot-view target video rollout
54-D absolute robot action targets
Masking Note
This run uses mask_variant=v3 with full cross-embodiment reference video
conditioning and records bridge_exclude_full_ref=True.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for knightnemo/wuji-pick-place-vam-ti2v5b-30l-robotwin-ref-openwam-16g-mask-v3
Base model
Wan-AI/Wan2.2-TI2V-5B