VOOZH about

URL: https://huggingface.co/knightnemo/wuji-pick-place-vam-ti2v5b-30l-robotwin-ref-openwam-16g-mask-v3

⇱ knightnemo/wuji-pick-place-vam-ti2v5b-30l-robotwin-ref-openwam-16g-mask-v3 · Hugging Face


Wuji Pick Place Vam Ti2V5B 30L Robotwin Ref Openwam 16G Mask V3

This repository contains one Wuji pick-and-place VAM checkpoint from the May 26, 2026 OpenWAM/RobotWin-reference training run.

Identity

repo_id: knightnemo/wuji-pick-place-vam-ti2v5b-30l-robotwin-ref-openwam-16g-mask-v3
wandb project: wuji_pick_place
wandb run id: ge0f4qe7
wandb run name: wuji_pick_place_vam_ti2v5b_30L_robotwin_ref_openwam_16g_mask_v3_0526_2127
local training dir: /cpfs/huangsq/VAM_Learn_from_Human_Video/src/vam/models/train/wuji_pick_place_vam_ti2v5b_30L_robotwin_ref_openwam_16g_mask_v3
checkpoint: step-20000.safetensors
checkpoint size: 12042172757 bytes
base model: Wan-AI/Wan2.2-TI2V-5B
action expert style: openwam
mask variant: v3
reference frames: 57
reference dropout: 0.1
action/proprio dim: 54 / 54

The checkpoint is a joint model: step-20000.safetensors contains the fine-tuned video DiT weights plus the action_dit.* action-stream weights. The Wan2.2-TI2V-5B base model is not included.

Files

step-20000.safetensors final 20k-step checkpoint
model_config.json compact machine-readable configuration and metrics
training_config.yaml full W&B training config snapshot
wandb-summary.json final scalar metrics exported by W&B
training_log_node0.txt node-0 training log
training_log_node1.txt node-1 training log
README.md this model card

Final Step Metrics

These are the scalar values in wandb-summary.json at step=20000.

Metric Value
val/loss 0.20223
val/loss_action 0.11154
val/loss_video 0.0906898
val/action_mse 0.00743866
val/action_mae 0.055161
val/video_mse 640.378
val/video_psnr 21.9358
val/video_ssim 0.784399
val/video_lpips 0.0869703
train/loss 0.0870087
train/loss_action 0.00918485
train/loss_video 0.0778239
train/grad_norm 1.53125
_runtime 32738.4
_step 20000

Saved Validation Losses

Only step-20000.safetensors is uploaded here. Earlier local checkpoints were saved every 2500 steps and are listed for provenance.

Best saved checkpoint by aggregate validation loss:

step 15000: loss=0.196729, loss_video=0.081665, loss_action=0.115064
Step Loss Video Loss Action Loss
2500 0.288992 0.140088 0.148904
5000 0.196988 0.077994 0.118994
7500 0.250991 0.099323 0.151668
10000 0.330028 0.118547 0.211480
12500 0.321415 0.121587 0.199828
15000 0.196729 0.081665 0.115064
17500 0.333581 0.099311 0.234270
20000 0.202230 0.090690 0.111540

Training Configuration

Key Value
dataset_type wuji_pick_place
wuji_robot_dataset_root None
variant clean_50
height 384
width 320
num_frames 33
action_video_freq_ratio 4
action_horizon None
action_dim 54
proprio_dim 54
action_format absolute
action_space joint
proprio_space action
action_pad_mode last
action_expert_style openwam
action_mot_backbone_pretrained_path /cpfs/huangsq/VAM_Learn_from_Human_Video/src/vam/models/pretrained/ActionMoT_openwam_linear_interp_Wan22_alphascale_1024hdim.pt
mask_variant v3
mask_tail_padding_loss True
full_reference_video True
max_ref_frames 57
reference_dropout 0.1
bridge_exclude_full_ref True
extra_inputs vace_reference_image,action_trajectory
target_camera head_camera
reference_camera head_camera
resize_mode stretch
backbone ti2v
model_paths ["models/Wan-AI/Wan2.2-TI2V-5B/diffusion_pytorch_model-00001-of-00003.safetensors","models/Wan-AI/Wan2.2-TI2V-5B/diffusion_pytorch_model-00002-of-00003.safetensors","models/Wan-AI/Wan2.2-TI2V-5B/diffusion_pytorch_model-00003-of-00003.safetensors","models/Wan-AI/Wan2.2-TI2V-5B/models_t5_umt5-xxl-enc-bf16.pth","models/Wan-AI/Wan2.2-TI2V-5B/Wan2.2_VAE.pth"]
tokenizer_path models/Wan-AI/Wan2.2-TI2V-5B/google/umt5-xxl
trainable_models dit
learning_rate 5e-05
action_lr None
weight_decay 0.01
warmup_steps 500
max_steps 20000
num_epochs 1
batch_size 1
gradient_accumulation_steps 1
dataset_repeat 1
dataset_num_workers 8
use_gradient_checkpointing True
save_steps 2500
val_steps 500
video_log_steps 2500
max_val_samples 20
lambda_video 1
lambda_action 1
video_dim 3072
action_dit_dim 1024
action_dit_ffn_dim 4096
action_dit_num_heads 24
action_dit_num_layers 30
proprio_dropout 0.1
window_stride 1
val_ratio 0.1
output_path /cpfs/huangsq/VAM_Learn_from_Human_Video/src/vam/models/train/wuji_pick_place_vam_ti2v5b_30L_robotwin_ref_openwam_16g_mask_v3
wandb_project wuji_pick_place
wandb_run_name wuji_pick_place_vam_ti2v5b_30L_robotwin_ref_openwam_16g_mask_v3_0526_2127

Input/Output Contract

Expected inputs:

prompt: Pick up the ball with the left hand and place it in the basket.
target camera: head_camera
reference camera: head_camera
target video frames: 33
full reference frames: 57
image resolution: 384 x 320
action/proprio dim: 54 / 54

Expected outputs:

robot-view target video rollout
54-D absolute robot action targets

Masking Note

This run uses mask_variant=v3 with full cross-embodiment reference video conditioning and records bridge_exclude_full_ref=True.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for knightnemo/wuji-pick-place-vam-ti2v5b-30l-robotwin-ref-openwam-16g-mask-v3

Finetuned
(62)
this model