VOOZH about

URL: https://huggingface.co/knightnemo/wuji-hand-gesture-vam-ti2v5b-30l-openwam-v2

⇱ knightnemo/wuji-hand-gesture-vam-ti2v5b-30l-openwam-v2 · Hugging Face


Wuji Hand Gesture VAM TI2V-5B 30L OpenWAM V2

This repository contains the newer Wuji hand gesture VAM checkpoint trained on May 20, 2026.

Important identity:

action expert style: OpenWAM
masking variant: v2 masking
not: v3 masking

The checkpoint was trained with the Wan2.2-TI2V-5B video backbone plus a 30-layer ActionMoT stream. It is a joint checkpoint: the .safetensors file contains both fine-tuned video DiT weights and action_dit.* action-stream weights.

Given a human reference gesture video, a current robot-view anchor frame, a text prompt, and the current robot proprioceptive state, the model jointly predicts:

  • a short robot-view target video rollout, and
  • a sequence of 20-D robot action targets.

Files

Expected model files in this repo:

step-10000.safetensors # final checkpoint, trained for 10k steps
model_config.json # architecture and preprocessing contract
training_config.yaml # W&B training config snapshot
training_log_node0.txt # training log snapshot
README.md # this model card

Checkpoint SHA256:

a27d4c977c9fb1978bf40f313143c3f185edad794f7458a67211a43da5f16105

Checkpoint size:

12041754617 bytes

The Wan2.2-TI2V-5B base model is not included. You need it separately under:

models/Wan-AI/Wan2.2-TI2V-5B/

with these files:

diffusion_pytorch_model-00001-of-00003.safetensors
diffusion_pytorch_model-00002-of-00003.safetensors
diffusion_pytorch_model-00003-of-00003.safetensors
models_t5_umt5-xxl-enc-bf16.pth
Wan2.2_VAE.pth
google/umt5-xxl/

Training Setup

Final run:

run directory: src/vam/models/train/wuji_hand_gesture_vam_ti2v5b_30L_openwam_fastwam_warmstart
checkpoint: step-10000.safetensors
wandb run name: wuji_hand_gesture_vam_ti2v5b_30L_mask_v2_openwam_0520_1708
wandb run: https://wandb.ai/wuji-tech/wuji_hand_gesture/runs/toehxouy
max steps: 10000
backbone: Wan2.2-TI2V-5B
trainable: video DiT + ActionMoT
action expert: OpenWAM
mask variant: v2
full ref flag: bridge_exclude_full_ref=true in the run config
action_dim: 20
proprio_dim: 20
resolution: 256 x 256

The ActionMoT stream was initialized from:

src/vam/models/pretrained/ActionMoT_openwam_linear_interp_Wan22_alphascale_1024hdim.pt

Final validation sample logged during training:

val loss: 0.569243
loss_video: 0.090544
loss_action: 0.478699

action_MSE: 0.025611
action_MAE: 0.110880
video_MSE: 57.30
PSNR: 31.48
SSIM: 0.9654
LPIPS: 0.0169

Training dataset summary from the log:

train episodes: 227
val episodes: 25
train labels: eight, four, one, seven, six, ten, three, two
val labels: eight, one, seven, six, ten, three

The validation split is not label-balanced because it does not include four or two.

Masking Strategy

The active joint self-attention sequence layout is:

[ref_video | first_frame | gen_video | action]

This checkpoint uses mask_variant=v2. Its attention visibility is:

ref_video -> ref_video
first_frame -> ref_video + first_frame
gen_video -> ref_video + first_frame + gen_video
action -> ref_video + first_frame + gen_video + action

So v2 prevents the generated video tokens from attending to action tokens. It does not prevent action tokens from attending to generated video tokens.

Do not treat this checkpoint as v3 masking. In v3, action would not attend to generated video tokens. That is not the mask used here.

The run config includes bridge_exclude_full_ref=true. For this OpenWAM/ActionMoT MoT path, the operative mask is the mask_variant=v2 joint self-attention above; the bridge flag is recorded for provenance and legacy bridge compatibility, but it is not the v2/v3 distinction.

Important Shape Contract

Do not confuse the raw action window length with the video frame count.

Training used:

raw action window length: 49 frames
action horizon: 48 actions
action_video_freq_ratio: 4
target video frames: 13 frames, sampled at raw indices [0, 4, 8, ..., 48]
full reference frames: 65 frames
image resolution: 256 x 256 RGB

So direct inference should normally call:

num_frames = 13
action_horizon = 48
height = width = 256
action_dim = proprio_dim = 20

If you pass num_frames=49 to inference, you are no longer matching the training distribution. The model saw 13 target video frames per 48-action rollout.

Expected Inputs

There are two supported ways to feed this model.

1. LeRobot Dataset Sample

The training dataloader expects a LeRobot v2-style dataset directory, for example:

wuji-hand-gestures-cropped/
 meta/
 info.json
 tasks.jsonl
 episodes.jsonl
 stats.json
 data/
 chunk-000/
 episode_000000.parquet
 ...
 videos/
 chunk-000/
 observation.images.robot_view/
 episode_000000.mp4
 observation.images.human_view/
 episode_000000.mp4

Required features:

observation.images.robot_view robot-view target video stream
observation.images.human_view paired human reference video stream
action 20-D robot action vector
task_index integer key into meta/tasks.jsonl

meta/stats.json must contain action normalization stats:

{
 "action": {
 "mean": [20 floats],
 "std": [20 floats]
 }
}

Each task string is canonicalized into a prompt:

the robot performs hand gesture {label}

Examples:

the robot performs hand gesture one
the robot performs hand gesture two
the robot performs hand gesture thumbs_up

The dataloader builds one sample like this:

{
 "video": list[PIL.Image], # 13 robot-view RGB frames, 256x256
 "vace_reference_image": [PIL.Image], # first robot frame, 256x256
 "full_reference_video": list[PIL.Image], # 65 human-view RGB frames, 256x256
 "action_trajectory": torch.Tensor, # shape (48, 20), normalized
 "action_mask": torch.BoolTensor, # shape (48,)
 "proprio": torch.Tensor, # shape (20,), normalized current qpos/action
 "prompt": str,
}

2. Direct Python Inputs

For direct inference through examples/wanvideo/human2robot/joint_inference.py, prepare:

prompt:
 Must follow the training template:
 "the robot performs hand gesture {label}"

vace_reference_image:
 A one-frame list containing the current robot-view anchor frame.
 Shape after preprocessing: RGB, 256x256.

full_reference_video:
 The human demonstration/reference video.
 Expected length: 65 RGB frames.
 Each frame should be resized/padded to 256x256.
 If your source video has fewer than 65 frames, pad with neutral gray frames.
 If it has more than 65 frames, uniformly sample 65 frames.
 Length must satisfy 4n+1. 65 is the training value.

proprio:
 Current robot proprioceptive state, normalized.
 Shape can be (20,), (1, 20), or (1, 1, 20).
 Normalization is:
 proprio_normed = (proprio_raw - action_mean) / action_std
 The checkpoint stores action_mean/action_std in action_dit buffers.

num_frames:
 Use 13.

action_horizon:
 Use 48.

Provenance

Local training checkpoint:

src/vam/models/train/wuji_hand_gesture_vam_ti2v5b_30L_openwam_fastwam_warmstart/step-10000.safetensors

The run config records:

action_expert_style: openwam
mask_variant: v2
bridge_exclude_full_ref: true
full_reference_video: true
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for knightnemo/wuji-hand-gesture-vam-ti2v5b-30l-openwam-v2

Finetuned
(61)
this model