Wuji Hand Gesture VAM TI2V-5B 30L OpenWAM V2
This repository contains the newer Wuji hand gesture VAM checkpoint trained on May 20, 2026.
Important identity:
action expert style: OpenWAM
masking variant: v2 masking
not: v3 masking
The checkpoint was trained with the Wan2.2-TI2V-5B video backbone plus a 30-layer ActionMoT stream. It is a joint checkpoint: the .safetensors file contains both fine-tuned video DiT weights and action_dit.* action-stream weights.
Given a human reference gesture video, a current robot-view anchor frame, a text prompt, and the current robot proprioceptive state, the model jointly predicts:
- a short robot-view target video rollout, and
- a sequence of 20-D robot action targets.
Files
Expected model files in this repo:
step-10000.safetensors # final checkpoint, trained for 10k steps
model_config.json # architecture and preprocessing contract
training_config.yaml # W&B training config snapshot
training_log_node0.txt # training log snapshot
README.md # this model card
Checkpoint SHA256:
a27d4c977c9fb1978bf40f313143c3f185edad794f7458a67211a43da5f16105
Checkpoint size:
12041754617 bytes
The Wan2.2-TI2V-5B base model is not included. You need it separately under:
models/Wan-AI/Wan2.2-TI2V-5B/
with these files:
diffusion_pytorch_model-00001-of-00003.safetensors
diffusion_pytorch_model-00002-of-00003.safetensors
diffusion_pytorch_model-00003-of-00003.safetensors
models_t5_umt5-xxl-enc-bf16.pth
Wan2.2_VAE.pth
google/umt5-xxl/
Training Setup
Final run:
run directory: src/vam/models/train/wuji_hand_gesture_vam_ti2v5b_30L_openwam_fastwam_warmstart
checkpoint: step-10000.safetensors
wandb run name: wuji_hand_gesture_vam_ti2v5b_30L_mask_v2_openwam_0520_1708
wandb run: https://wandb.ai/wuji-tech/wuji_hand_gesture/runs/toehxouy
max steps: 10000
backbone: Wan2.2-TI2V-5B
trainable: video DiT + ActionMoT
action expert: OpenWAM
mask variant: v2
full ref flag: bridge_exclude_full_ref=true in the run config
action_dim: 20
proprio_dim: 20
resolution: 256 x 256
The ActionMoT stream was initialized from:
src/vam/models/pretrained/ActionMoT_openwam_linear_interp_Wan22_alphascale_1024hdim.pt
Final validation sample logged during training:
val loss: 0.569243
loss_video: 0.090544
loss_action: 0.478699
action_MSE: 0.025611
action_MAE: 0.110880
video_MSE: 57.30
PSNR: 31.48
SSIM: 0.9654
LPIPS: 0.0169
Training dataset summary from the log:
train episodes: 227
val episodes: 25
train labels: eight, four, one, seven, six, ten, three, two
val labels: eight, one, seven, six, ten, three
The validation split is not label-balanced because it does not include four or two.
Masking Strategy
The active joint self-attention sequence layout is:
[ref_video | first_frame | gen_video | action]
This checkpoint uses mask_variant=v2. Its attention visibility is:
ref_video -> ref_video
first_frame -> ref_video + first_frame
gen_video -> ref_video + first_frame + gen_video
action -> ref_video + first_frame + gen_video + action
So v2 prevents the generated video tokens from attending to action tokens. It does not prevent action tokens from attending to generated video tokens.
Do not treat this checkpoint as v3 masking. In v3, action would not attend to generated video tokens. That is not the mask used here.
The run config includes bridge_exclude_full_ref=true. For this OpenWAM/ActionMoT MoT path, the operative mask is the mask_variant=v2 joint self-attention above; the bridge flag is recorded for provenance and legacy bridge compatibility, but it is not the v2/v3 distinction.
Important Shape Contract
Do not confuse the raw action window length with the video frame count.
Training used:
raw action window length: 49 frames
action horizon: 48 actions
action_video_freq_ratio: 4
target video frames: 13 frames, sampled at raw indices [0, 4, 8, ..., 48]
full reference frames: 65 frames
image resolution: 256 x 256 RGB
So direct inference should normally call:
num_frames = 13
action_horizon = 48
height = width = 256
action_dim = proprio_dim = 20
If you pass num_frames=49 to inference, you are no longer matching the training distribution. The model saw 13 target video frames per 48-action rollout.
Expected Inputs
There are two supported ways to feed this model.
1. LeRobot Dataset Sample
The training dataloader expects a LeRobot v2-style dataset directory, for example:
wuji-hand-gestures-cropped/
meta/
info.json
tasks.jsonl
episodes.jsonl
stats.json
data/
chunk-000/
episode_000000.parquet
...
videos/
chunk-000/
observation.images.robot_view/
episode_000000.mp4
observation.images.human_view/
episode_000000.mp4
Required features:
observation.images.robot_view robot-view target video stream
observation.images.human_view paired human reference video stream
action 20-D robot action vector
task_index integer key into meta/tasks.jsonl
meta/stats.json must contain action normalization stats:
{
"action": {
"mean": [20 floats],
"std": [20 floats]
}
}
Each task string is canonicalized into a prompt:
the robot performs hand gesture {label}
Examples:
the robot performs hand gesture one
the robot performs hand gesture two
the robot performs hand gesture thumbs_up
The dataloader builds one sample like this:
{
"video": list[PIL.Image], # 13 robot-view RGB frames, 256x256
"vace_reference_image": [PIL.Image], # first robot frame, 256x256
"full_reference_video": list[PIL.Image], # 65 human-view RGB frames, 256x256
"action_trajectory": torch.Tensor, # shape (48, 20), normalized
"action_mask": torch.BoolTensor, # shape (48,)
"proprio": torch.Tensor, # shape (20,), normalized current qpos/action
"prompt": str,
}
2. Direct Python Inputs
For direct inference through examples/wanvideo/human2robot/joint_inference.py, prepare:
prompt:
Must follow the training template:
"the robot performs hand gesture {label}"
vace_reference_image:
A one-frame list containing the current robot-view anchor frame.
Shape after preprocessing: RGB, 256x256.
full_reference_video:
The human demonstration/reference video.
Expected length: 65 RGB frames.
Each frame should be resized/padded to 256x256.
If your source video has fewer than 65 frames, pad with neutral gray frames.
If it has more than 65 frames, uniformly sample 65 frames.
Length must satisfy 4n+1. 65 is the training value.
proprio:
Current robot proprioceptive state, normalized.
Shape can be (20,), (1, 20), or (1, 1, 20).
Normalization is:
proprio_normed = (proprio_raw - action_mean) / action_std
The checkpoint stores action_mean/action_std in action_dit buffers.
num_frames:
Use 13.
action_horizon:
Use 48.
Provenance
Local training checkpoint:
src/vam/models/train/wuji_hand_gesture_vam_ti2v5b_30L_openwam_fastwam_warmstart/step-10000.safetensors
The run config records:
action_expert_style: openwam
mask_variant: v2
bridge_exclude_full_ref: true
full_reference_video: true
Model tree for knightnemo/wuji-hand-gesture-vam-ti2v5b-30l-openwam-v2
Base model
Wan-AI/Wan2.2-TI2V-5B