![]() |
VOOZH | about |
Teaser Video
Building Behavioral Foundation Models (BFMs) for humanoid robots has the potential to unify diverse control tasks under a single, promptable generalist policy. However, existing approaches are either exclusively deployed on simulated humanoid characters, or specialized to specific tasks such as tracking. We propose BFM-Zero, a framework that learns an effective shared latent representation that embeds motions, goals, and rewards into a common space, enabling a single policy to be prompted for multiple downstream tasks without retraining. This well-structured latent space in BFM-Zero enables versatile and robust whole-body skills on a Unitree G1 humanoid in the real world, via diverse inference methods, including zero-shot motion tracking, goal reaching, and reward optimization, and few-shot optimization-based adaptation. Unlike prior on-policy reinforcement learning (RL) frameworks, BFM-Zero builds upon recent advancements in unsupervised RL and Forward-Backward (FB) models, which offer an objective-centric, explainable, and smooth latent representation of whole-body motions. We further extend BFM-Zero with critical reward shaping, domain randomization, and history-dependent asymmetric learning to bridge the sim-to-real gap. Those key design choices are quantitatively ablated in simulation. A first-of-its-kind model, BFM-Zero establishes a step toward scalable, promptable behavioral foundation models for whole-body humanoid control.
An overview of the BFM-Zero: After the pre-training stage, BFM-Zero forms a latent space that can be used for zero-shot inference and few-shot adaptation.
Objective: Learn a unified latent representation that embeds tasks (e.g., target motions, rewards, goals) into a shared space and a promptable policy that conditions on this representation to perform diverse tasks without retraining.
BFM-Zero can zero-shot perform tasks including motion tracking, goal reaching, and reward optimization.
BFM-Zero supports few-shot adaptation to quickly adapt to specific requirements with minimal additional training.
In pre-training, BFM-Zero aims to learn a latent space $z\in Z$, a forward $\boldsymbol{F}(s, a, z)$ and a backward $\boldsymbol{B}(s)$ representations, and a $z$-conditioned policy $\pi_z$ such that:
Input: Target goal pose $s_g$
Input: Target reference motion $\{s_1, ..., s_T\}$
$N$ is the window-size for tracking.
Input: Any given reward function $r(s)$
$s_i$ is the $i$-th state in the buffer.
๐ Remark: Compared to other frameworks, we don't give the model any specific๏ผtask-related๏ผ reward in the training, i.e., it is an unsupervised RL problem. Moreover, the learned representation $\boldsymbol{F}$, $\boldsymbol{B}$ and latent $Z$ are aware of humanoid dynamics.
The well-regularized, dynamics-aware latent space โจ also enables natural and smooth transitions during goal reaching, natural and gentle recovery in motion tracking and disturbance rejection, zero-shot reward optimization (for any given reward at test time), and efficient few-shot adaptation โก.
BFM-Zero enables a natural and smooth transition from the ground to T-Pose.
A brief running-like adjustment occurs before achieving full stability.
Smooth and natural transitions to Hands-on-hips posture.
Rapid stabilization occurs when the initial stand-up is unsteady.
Successfully recovers even after the first failed attempt to stand.
If the initial pose is not natural, it prioritizes standing up quickly before fine adjustments.
Besides, BFM-Zero maintains a high rate of efficient goal reaching...
...a high rate of efficient goal reaching...
...a high rate of efficient goal reaching.
Robust even under severe wrist breakage.
We enable the robot to perform basic locomotion tasks including standing still, walking forward/backward/sideways, turning left/right.
Maintains stable standing posture without movement.
Forward walking at 0.7m/s
Sideways movement to the left at 0.3m/s
Backward walking at 0.3m/s
Sideways movement to the right at 0.3m/s
Anticlockwise turning at 5.0 rad/s
Clockwise turning at 5.0 rad/s
Put down the arm (low) or Raise the arm (high)
By sampling different sub-buffers from the replay buffer, we can find different behaviors even with the same reward function.
Observation:
Taking the arm control and basic locomotion as examples, we can combine them to form new skills.
Here, w is the corresponding weight for the reward function, we express low as "l", high as "h", "arm-l-h" means the right wrist is low and the left wrist is high.
Note: All demos are from a continuous video shooting with the same policy.
Note: Right/Left is relative to the robot; the reward functions are for illustration purposes, in the inference time, we also have some soft constraints and regularization terms.
$$z_{t} := \frac{\sin((1-t)\theta)}{\sin \theta}z_0 + \frac{\sin(t\theta)}{\sin \theta}z_1, \quad \theta := \arccos\left(\langle z_0, z_1 \rangle\right), \ z_0\ne z_1, t\in[0,1].$$