Wuji Pick Place Vam Ti2V5B 30L Robotwin Ref Openwam 16G Mask V3

This repository contains one Wuji pick-and-place VAM checkpoint from the May 26, 2026 OpenWAM/RobotWin-reference training run.

Identity

repo_id: knightnemo/wuji-pick-place-vam-ti2v5b-30l-robotwin-ref-openwam-16g-mask-v3
wandb project: wuji_pick_place
wandb run id: ge0f4qe7
wandb run name: wuji_pick_place_vam_ti2v5b_30L_robotwin_ref_openwam_16g_mask_v3_0526_2127
local training dir: /cpfs/huangsq/VAM_Learn_from_Human_Video/src/vam/models/train/wuji_pick_place_vam_ti2v5b_30L_robotwin_ref_openwam_16g_mask_v3
checkpoint: step-20000.safetensors
checkpoint size: 12042172757 bytes
base model: Wan-AI/Wan2.2-TI2V-5B
action expert style: openwam
mask variant: v3
reference frames: 57
reference dropout: 0.1
action/proprio dim: 54 / 54

The checkpoint is a joint model: step-20000.safetensors contains the fine-tuned video DiT weights plus the action_dit.* action-stream weights. The Wan2.2-TI2V-5B base model is not included.

Files

step-20000.safetensors final 20k-step checkpoint
model_config.json compact machine-readable configuration and metrics
training_config.yaml full W&B training config snapshot
wandb-summary.json final scalar metrics exported by W&B
training_log_node0.txt node-0 training log
training_log_node1.txt node-1 training log
README.md this model card

Final Step Metrics

These are the scalar values in wandb-summary.json at step=20000.

Metric	Value
`val/loss`	0.20223
`val/loss_action`	0.11154
`val/loss_video`	0.0906898
`val/action_mse`	0.00743866
`val/action_mae`	0.055161
`val/video_mse`	640.378
`val/video_psnr`	21.9358
`val/video_ssim`	0.784399
`val/video_lpips`	0.0869703
`train/loss`	0.0870087
`train/loss_action`	0.00918485
`train/loss_video`	0.0778239
`train/grad_norm`	1.53125
`_runtime`	32738.4
`_step`	20000

Saved Validation Losses

Only step-20000.safetensors is uploaded here. Earlier local checkpoints were saved every 2500 steps and are listed for provenance.

Best saved checkpoint by aggregate validation loss:

step 15000: loss=0.196729, loss_video=0.081665, loss_action=0.115064

Step	Loss	Video Loss	Action Loss
2500	0.288992	0.140088	0.148904
5000	0.196988	0.077994	0.118994
7500	0.250991	0.099323	0.151668
10000	0.330028	0.118547	0.211480
12500	0.321415	0.121587	0.199828
15000	0.196729	0.081665	0.115064
17500	0.333581	0.099311	0.234270
20000	0.202230	0.090690	0.111540

Training Configuration

Key	Value
`dataset_type`	`wuji_pick_place`
`wuji_robot_dataset_root`	`None`
`variant`	`clean_50`
`height`	`384`
`width`	`320`
`num_frames`	`33`
`action_video_freq_ratio`	`4`
`action_horizon`	`None`
`action_dim`	`54`
`proprio_dim`	`54`
`action_format`	`absolute`
`action_space`	`joint`
`proprio_space`	`action`
`action_pad_mode`	`last`
`action_expert_style`	`openwam`
`action_mot_backbone_pretrained_path`	`/cpfs/huangsq/VAM_Learn_from_Human_Video/src/vam/models/pretrained/ActionMoT_openwam_linear_interp_Wan22_alphascale_1024hdim.pt`
`mask_variant`	`v3`
`mask_tail_padding_loss`	`True`
`full_reference_video`	`True`
`max_ref_frames`	`57`
`reference_dropout`	`0.1`
`bridge_exclude_full_ref`	`True`
`extra_inputs`	`vace_reference_image,action_trajectory`
`target_camera`	`head_camera`
`reference_camera`	`head_camera`
`resize_mode`	`stretch`
`backbone`	`ti2v`
`model_paths`	`["models/Wan-AI/Wan2.2-TI2V-5B/diffusion_pytorch_model-00001-of-00003.safetensors","models/Wan-AI/Wan2.2-TI2V-5B/diffusion_pytorch_model-00002-of-00003.safetensors","models/Wan-AI/Wan2.2-TI2V-5B/diffusion_pytorch_model-00003-of-00003.safetensors","models/Wan-AI/Wan2.2-TI2V-5B/models_t5_umt5-xxl-enc-bf16.pth","models/Wan-AI/Wan2.2-TI2V-5B/Wan2.2_VAE.pth"]`
`tokenizer_path`	`models/Wan-AI/Wan2.2-TI2V-5B/google/umt5-xxl`
`trainable_models`	`dit`
`learning_rate`	`5e-05`
`action_lr`	`None`
`weight_decay`	`0.01`
`warmup_steps`	`500`
`max_steps`	`20000`
`num_epochs`	`1`
`batch_size`	`1`
`gradient_accumulation_steps`	`1`
`dataset_repeat`	`1`
`dataset_num_workers`	`8`
`use_gradient_checkpointing`	`True`
`save_steps`	`2500`
`val_steps`	`500`
`video_log_steps`	`2500`
`max_val_samples`	`20`
`lambda_video`	`1`
`lambda_action`	`1`
`video_dim`	`3072`
`action_dit_dim`	`1024`
`action_dit_ffn_dim`	`4096`
`action_dit_num_heads`	`24`
`action_dit_num_layers`	`30`
`proprio_dropout`	`0.1`
`window_stride`	`1`
`val_ratio`	`0.1`
`output_path`	`/cpfs/huangsq/VAM_Learn_from_Human_Video/src/vam/models/train/wuji_pick_place_vam_ti2v5b_30L_robotwin_ref_openwam_16g_mask_v3`
`wandb_project`	`wuji_pick_place`
`wandb_run_name`	`wuji_pick_place_vam_ti2v5b_30L_robotwin_ref_openwam_16g_mask_v3_0526_2127`

Input/Output Contract

Expected inputs:

prompt: Pick up the ball with the left hand and place it in the basket.
target camera: head_camera
reference camera: head_camera
target video frames: 33
full reference frames: 57
image resolution: 384 x 320
action/proprio dim: 54 / 54

Expected outputs:

robot-view target video rollout
54-D absolute robot action targets

Masking Note

This run uses mask_variant=v3 with full cross-embodiment reference video conditioning and records bridge_exclude_full_ref=True.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video-to-Video

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for knightnemo/wuji-pick-place-vam-ti2v5b-30l-robotwin-ref-openwam-16g-mask-v3

Base model

Wan-AI/Wan2.2-TI2V-5B

Finetuned

(62)

this model

URL: https://huggingface.co/knightnemo/wuji-pick-place-vam-ti2v5b-30l-robotwin-ref-openwam-16g-mask-v3

⇱ knightnemo/wuji-pick-place-vam-ti2v5b-30l-robotwin-ref-openwam-16g-mask-v3 · Hugging Face