Voozh

OpenVLA is not a VLM that returns captions or answers questions. It returns robot actions. That single difference changes every deployment decision: model call latency has a hard deadline set by your robot's control frequency, not by user patience. A 10 Hz control loop gives you 100 ms per step. A network round-trip to an external API typically eats 50-200 ms before the model even runs. That math rules out remote inference for any closed-loop robot that needs to react in real time.

RT-2 (Google) and Pi-0 (Physical Intelligence) are the best-known alternatives, but both are closed-source API-only services. OpenVLA is the only fully open-weight vision-language-action model in the 7B range, released under an MIT license. You can fine-tune it on your robot's proprietary demonstration data, run it on your own hardware with no per-call fees, and modify the action tokenizer if your robot's action space differs from the training distribution.

For context on how general vision-language models differ from action-producing models, see Deploy Vision Language Models on GPU Cloud.

What Is OpenVLA

OpenVLA 7B is built on Prismatic-7B, a vision-language model that uses a dual ViT encoder: SigLIP for high-level semantic features and DinoV2 for spatial detail. The language backbone is a 7B Llama-2-based model. Together, the Prismatic ViT encoder plus the LM decoder produce a model that takes an RGB image plus a natural language instruction and generates a robot action.

The action space is 7-DoF: x, y, z translation, roll, pitch, yaw rotation, and gripper open/close. Each continuous float32 value in the action vector gets discretized into one of 256 bins, then mapped to a token ID. Rather than extending the vocabulary, OpenVLA overwrites the 256 least-used tokens in the Llama tokenizer vocabulary with these action bin tokens. A single action step produces 7 output tokens, one per dimension. The vocabulary extension is what requires --trust-remote-code when loading the model in vLLM or HuggingFace.

Property	Value
Parameters	~8B
Base model	Prismatic-7B (SigLIP + DinoV2 + Llama-2 7B backbone)
Action space	7-DoF delta actions (x, y, z, roll, pitch, yaw, gripper)
Action tokens	256 discrete bins per dimension, 7 tokens per step
Context length	1024 tokens (image tokens + instruction + action)
Training data	Open X-Embodiment: ~970k curated episodes, 22 embodiments
License	MIT

Training used a curated subset of the Open X-Embodiment (OXE) dataset, which aggregates demonstrations from 970,000+ episodes across 22 robot embodiments. The mix spans tabletop manipulation, mobile manipulation, and navigation, including both simulated and real-robot data.

Why Self-Host Instead of Using an API

Closed-loop latency. A 10 Hz control loop gives 100 ms per step. Cloud API round-trips typically add 50-200 ms in network latency before the model runs, which consumes the entire step budget. A self-hosted H100 can return an action in under 150 ms including image preprocessing and action de-tokenization, keeping the network out of the critical path entirely. For sub-100 ms loops, OpenVLA-OFT (the parallel-decoding follow-up) is worth evaluating as it removes the sequential autoregressive bottleneck.

Data residency. Live sensor streams and proprietary demonstration data cannot go to an external API in defense robotics, medical robotics, and any context where the robot's observation data is commercially sensitive. Your demonstration data represents months of operator time; it is a competitive asset. Running inference on-premise or in a private cloud instance means that data never leaves your infrastructure.

Fine-tuning on proprietary embodiments. RT-2 and Pi-0 have no public fine-tuning API. If your robot has a different arm configuration, gripper type, or observation setup from the training distribution, you are stuck with the base model's generalization. OpenVLA's LoRA fine-tuning workflow lets you adapt the model to a new embodiment in hours on a single H100. See the GRPO fine-tuning guide for GPU memory math that also applies to LoRA VLA training.

GPU Sizing for OpenVLA Inference

OpenVLA 7B in BF16 occupies approximately 14-15 GB of VRAM for weights. The practical minimum for production serving with KV cache and visual encoder workspace is an A100 40GB: the weights fit in under half the card's memory, leaving 25+ GB for the ViT encoder intermediates and action token KV cache.

For closed-loop control, the H100's memory bandwidth advantage (3.35 TB/s vs the A100's 2 TB/s) translates directly to faster per-step decode. The action is only 7 tokens, but decode throughput at batch size 1 is almost entirely memory-bandwidth-bound.

GPU	VRAM	Precision	Est. Latency (per step)	On-Demand $/hr	Best For
H100 SXM5 80GB	80 GB	BF16	~100-150 ms	$3.10	Real-time control (<150 ms), multi-robot fleets
H100 SXM5 80GB	80 GB	FP8	~80 ms	$3.10	Highest throughput, fleet scale
A100 80GB SXM4	80 GB	BF16	~150 ms	$1.64	Pick-and-place, 5-10 Hz loops
L40S 48GB	48 GB	FP8	~200 ms	$0.72	Cost-sensitive, 2-5 Hz loops

Latency estimates are based on H100 and A100 memory bandwidth at batch size 1 for an ~8B parameter model generating 7 tokens. Actual numbers depend on your image resolution and preprocessing pipeline. The OpenVLA paper measured ~200 ms per step on a single A100 for autoregressive decoding; H100's higher memory bandwidth brings this down, but sub-100 ms reliably requires OpenVLA-OFT's parallel decoding approach. L40S supports FP8 via the --quantization fp8 flag in vLLM (Ada Lovelace architecture).

Pricing fluctuates based on GPU availability. The prices above are based on 02 May 2026 and may have changed. Check current GPU pricing → for live rates.

For H100 instances, see the H100 rental page. For A100 instances, see the A100 rental page. Both are available on-demand with provisioning in under 90 seconds.

Inference Setup with vLLM

Note: vLLM's OpenVLA support is experimental. OpenVLA is not on vLLM's official supported model list as of this writing (see tracking issue vllm-project/vllm#14739). The --trust-remote-code flag enables the custom architecture, but results can vary by vLLM version because OpenVLA's fused SigLIP+DinoV2 visual encoder is not a standard vLLM-supported component. For production workloads, the officially documented path is the HuggingFace predict_action() API. The vLLM path below is useful for teams that need concurrent multi-robot requests and are willing to validate on their specific vLLM version.

vLLM treats OpenVLA as a standard causal language model with a custom vocabulary extension. The 256 action tokens are part of the model's vocab; vLLM generates them as token IDs. The conversion from token IDs to a continuous action vector happens on the client side. vLLM does not need to know that some tokens represent actions, not words.

Install dependencies:

bash

pip install "vllm>=0.6.0" transformers>=4.40
pip install git+https://github.com/openvla/openvla.git

Download weights:

bash

huggingface-cli download openvla/openvla-7b \
 --local-dir /data/models/openvla-7b

Verify the repo name on Hugging Face before downloading. Model repository naming can change between releases.

Launch the vLLM server (BF16, single H100 or A100 80GB):

bash

vllm serve /data/models/openvla-7b \
 --dtype bfloat16 \
 --max-model-len 1024 \
 --served-model-name openvla \
 --trust-remote-code \
 --port 8000

--trust-remote-code is required. OpenVLA uses a custom model class with the action token vocabulary extension. Without this flag, vLLM will reject the model config.

Launch the vLLM server (FP8, H100 or L40S):

bash

vllm serve /data/models/openvla-7b \
 --dtype bfloat16 \
 --quantization fp8 \
 --max-model-len 1024 \
 --served-model-name openvla \
 --trust-remote-code \
 --port 8000

Python client with action de-tokenization:

python

import base64
import numpy as np
from openai import OpenAI
from PIL import Image
from transformers import AutoProcessor

# Load the OpenVLA processor for action de-tokenization
processor = AutoProcessor.from_pretrained(
 "/data/models/openvla-7b",
 trust_remote_code=True
)

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

def get_action(image: Image.Image, instruction: str) -> np.ndarray:
 # Encode image to base64
 import io
 buf = io.BytesIO()
 image.save(buf, format="JPEG")
 img_b64 = base64.b64encode(buf.getvalue()).decode()

 response = client.chat.completions.create(
 model="openvla",
 messages=[{
 "role": "user",
 "content": [
 {
 "type": "image_url",
 "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}
 },
 {
 "type": "text",
 "text": f"What action should the robot take to {instruction}?"
 }
 ]
 }],
 max_tokens=7, # one token per action dimension
 temperature=0.0
 )

 # vLLM returns the generated tokens as text.
 # For direct HuggingFace usage (not vLLM), the documented API is:
 # action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)
 # For the vLLM path, verify the exact decode method against the current
 # OpenVLA repo (https://github.com/openvla/openvla) before using in production.
 if not response.choices:
 raise RuntimeError("vLLM returned no completions; check server logs for errors")
 generated_text = response.choices[0].message.content
 action = processor.decode_actions(generated_text)
 return action # shape: (7,) with [x, y, z, roll, pitch, yaw, gripper]

The --max-model-len 1024 setting is appropriate because OpenVLA observations are short: one image tokenizes to a few hundred visual tokens, and the instruction plus action generation adds very little additional context. Keeping max-model-len small reduces KV cache pre-allocation and lets you fit more concurrent requests into available VRAM.

TensorRT-LLM Engine Build (Optional, for Sub-100 ms)

The Prismatic ViT encoder is the latency bottleneck at batch size 1. The language model decoder generates only 7 tokens per step, so it finishes quickly. The ViT forward pass, which processes the observation image into visual embeddings, is what determines whether you hit the 80 ms or 150 ms mark.

Building a TensorRT-LLM engine for the Prismatic ViT encoder compiles it to a fixed-shape CUDA kernel tuned for your specific input resolution. At batch size 1, this typically runs 1.5-2x faster than the PyTorch reference implementation. The general workflow is:

bash

# Install TensorRT-LLM
pip install tensorrt-llm

# Export the ViT encoder to ONNX, then build with trtllm-build
# Exact flags depend on your input resolution and GPU generation
trtllm-build \
 --model_dir /data/models/openvla-7b/vision_encoder \
 --output_dir /data/engines/openvla-vit \
 --dtype bfloat16

This section describes the general TRT-LLM approach. Specific trtllm-build flags for the Prismatic ViT encoder change between OpenVLA releases. Check the OpenVLA GitHub issues for current TRT-LLM compatibility status before investing build time.

Use this path only if the vLLM serving approach exceeds your latency target. The vLLM path is simpler, maintains compatibility with new releases, and works well for 5 Hz loops and above.

Production Latency Tuning

Image Preprocessing Pipeline

OpenCV resize and normalization on the robot controller should run in a background thread. The goal is to have the preprocessed tensor ready before the previous action chunk finishes executing, so the GPU call starts immediately at the end of execution rather than waiting for CPU preprocessing.

A typical pipeline: when the controller starts executing action chunk N, it submits the current observation image to the preprocessing thread. By the time the last action in chunk N executes, the preprocessed tensor for chunk N+1 is ready to send. This overlaps CPU preprocessing with robot motion and eliminates preprocessing stalls from the GPU call latency.

Action Chunking

Instead of calling the model once per control step, request a chunk of 8 to 16 actions in a single forward pass. Your robot controller executes the chunk while the GPU decodes the next one.

The right chunk size depends on two factors: your controller's replanning tolerance (how quickly you need to react to unexpected obstacles or trajectory deviations), and your control frequency. A 10 Hz controller with 8-action chunks replans every 800 ms. A 5 Hz controller with 4-action chunks replans every 800 ms as well. Longer chunks reduce replanning frequency and amortize model call overhead, but increase tracking error on curved paths because the model does not see intermediate observations.

A practical starting point: set chunk size so that the execution time for one chunk equals roughly 1.5x the model call latency. This gives the GPU time to finish decoding the next chunk before the controller needs it.

Control Loop Integration

The producer-consumer pattern works well for overlapping GPU inference with robot execution. The GPU inference thread pulls an observation from the queue, calls the model, and pushes an action chunk to the output queue. The robot execution thread pulls chunks from the output queue and sends commands to the controller at the target frequency.

python

import asyncio
import queue

CONTROL_FREQUENCY_HZ = 10 # target control loop frequency

observation_queue = queue.Queue(maxsize=2)
action_queue = queue.Queue(maxsize=2)

def get_action_chunk(image: Image.Image, instruction: str, chunk_size: int = 8) -> list:
 import io
 buf = io.BytesIO()
 image.save(buf, format="JPEG")
 img_b64 = base64.b64encode(buf.getvalue()).decode()

 response = client.chat.completions.create(
 model="openvla",
 messages=[{
 "role": "user",
 "content": [
 {
 "type": "image_url",
 "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}
 },
 {
 "type": "text",
 "text": f"What action should the robot take to {instruction}?"
 }
 ]
 }],
 max_tokens=7 * chunk_size, # 7 action dimensions per step
 temperature=0.0
 )

 if not response.choices:
 raise RuntimeError("vLLM returned no completions; check server logs for errors")
 generated_text = response.choices[0].message.content
 # decode_actions returns a flat array of shape (7 * chunk_size,)
 all_actions = processor.decode_actions(generated_text)
 return [all_actions[i * 7:(i + 1) * 7] for i in range(chunk_size)]

async def inference_worker():
 while True:
 obs, instruction = await asyncio.get_running_loop().run_in_executor(
 None, observation_queue.get
 )
 action_chunk = await asyncio.get_running_loop().run_in_executor(
 None, get_action_chunk, obs, instruction, 8
 )
 await asyncio.get_running_loop().run_in_executor(None, action_queue.put, action_chunk)

async def execution_worker(robot_controller):
 while True:
 chunk = await asyncio.get_running_loop().run_in_executor(
 None, action_queue.get
 )
 for action in chunk:
 robot_controller.send(action)
 await asyncio.sleep(1.0 / CONTROL_FREQUENCY_HZ)

For a broader look at prefill-decode disaggregation patterns that can further reduce tail latency in multi-robot serving, see Prefill-Decode Disaggregation on GPU Cloud.

Fine-Tuning OpenVLA on a Custom Embodiment

Data Preparation

Record demonstrations as (RGB observation image, natural language instruction, action vector) triples. Each step in a demonstration is one training example. The action vector is a float32 array of 7 values in your robot's action space.

If you do not have enough real robot demonstrations, Genesis physics engine simulation can generate thousands of synthetic trajectories in LeRobot v2 format in minutes - enough to bootstrap a usable OpenVLA adapter before collecting real data.

OpenVLA's normalization scripts convert your action vectors to the 256-bin discrete token format the model expects. The normalization is per-dimension and computed from statistics across your training dataset. Run normalize_actions.py from the official OpenVLA repo to generate the normalization statistics before training.

bash

# Generate action normalization statistics from your dataset
python normalize_actions.py \
 --dataset_path /data/demos/my_robot \
 --output_path /data/demos/my_robot/action_stats.json

LoRA Setup

LoRA on the language backbone is the practical path for cross-embodiment fine-tuning. The recommended configuration:

python

from peft import LoraConfig

lora_config = LoraConfig(
 r=32,
 lora_alpha=64,
 lora_dropout=0.05,
 target_modules=["q_proj", "v_proj", "out_proj"], # LM backbone only
 bias="none",
 task_type="CAUSAL_LM"
)

Apply LoRA only to the LM backbone, not the visual encoder. The ViT encoder handles observation features that transfer well across embodiments. The LM backbone is where the embodiment-specific action mapping lives. Applying LoRA to the ViT can hurt cross-embodiment transfer if your dataset is small.

Training Command

bash

python finetune.py \
 --model_path /data/models/openvla-7b \
 --dataset_path /data/demos/my_robot \
 --action_stats_path /data/demos/my_robot/action_stats.json \
 --use_lora True \
 --lora_rank 32 \
 --lora_alpha 64 \
 --batch_size 16 \
 --gradient_accumulation_steps 4 \
 --learning_rate 2e-4 \
 --num_steps 10000 \
 --save_every 500 \
 --output_dir /data/checkpoints/my_robot_lora

GPU-Hour Budget

Dataset Size	GPU	Estimated Time	Estimated Cost
1,000 demos	H100 SXM5 80GB	~1.5 hrs	~$4.65
10,000 demos	H100 SXM5 80GB	~6 hrs	~$18.60
50,000 demos	H100 SXM5 80GB	~28 hrs	~$86.80

Cost is based on $3.10/hr on-demand H100 SXM5 pricing. Demo count assumes each demonstration is roughly 100-200 steps at 10 Hz, which is typical for tabletop manipulation tasks.

Pricing fluctuates based on GPU availability. The prices above are based on 02 May 2026 and may have changed. Check current GPU pricing → for live rates.

For budget fine-tuning runs, the L40S on Spheron is a cost-effective option for small datasets (under 5,000 demos) at FP8 quantization.

Evaluation with LIBERO

The LIBERO simulation benchmark, used in the original OpenVLA paper, provides a standard evaluation framework for tabletop manipulation tasks across four task suites: spatial, object, goal, and long-horizon. Run LIBERO evaluations on your fine-tuned adapter before deploying to real hardware to confirm that task success rate improves over the base checkpoint.

A minimal evaluation run:

bash

# Install LIBERO
pip install libero

# Evaluate base checkpoint
python eval_libero.py \
 --model_path /data/models/openvla-7b \
 --suite libero_spatial \
 --num_trials 50

# Evaluate fine-tuned checkpoint
python eval_libero.py \
 --model_path /data/models/openvla-7b \
 --lora_path /data/checkpoints/my_robot_lora \
 --suite libero_spatial \
 --num_trials 50

If LIBERO task success rate does not improve after fine-tuning, check your action normalization statistics and confirm your dataset covers the full range of the task's object positions and configurations.

For related fine-tuning pipelines, see GRPO fine-tuning on GPU Cloud for reasoning-oriented RL-based approaches and DPO fine-tuning on GPU Cloud for preference-based refinement after an initial LoRA step. For large-scale multi-node fine-tuning where a single H100 is not enough, RLinf's VLA training pipeline disaggregates rollout workers from the trainer and scales OpenVLA fine-tuning across 8 to 256 GPUs without rewriting algorithm code.

Deployment Patterns

Edge Robot with Cloud GPU

The robot sends compressed observation images over a low-latency WAN connection. The cloud GPU instance runs OpenVLA and returns action tokens. The robot de-tokenizes and executes.

This works for 5 Hz control loops with a reliable network hop under 20 ms each way. Pair with action chunking to buffer against variable network latency. It fails at 10 Hz or on unreliable connections, because a single dropped packet or 50 ms network spike puts the controller behind schedule.

Best for: mobile robots or manipulators with 5 Hz target control frequency, located within a campus or data center network of the cloud instance.

Hybrid Inference Split

Run the Prismatic ViT encoder on the robot's onboard GPU (an RTX 4090 or similar) and send only the resulting visual embeddings to the cloud for the LM decoder step. The raw RGB image at 224x224 pixels is roughly 150 KB. The ViT output embedding is a few hundred float32 values, typically under 10 KB. This cuts image transfer bandwidth by a factor of 15 or more.

More importantly, network transfer latency for a 10 KB embedding over a 1 Gbps LAN is under 1 ms, versus 5-50 ms for a full image over WAN. The cloud only needs to run the 7B LM decoder, which generates 7 action tokens quickly.

Best for: robots with a local GPU (RTX 4090, RTX Pro 6000) that need the LM capacity of a 7B model but want to keep the visual processing local.

Fallback to a Lightweight Policy Head

Keep a compact MLP policy on the robot as a fallback for when the cloud call misses the latency SLA. The MLP is trained by behavioral cloning on the same demonstration data. It handles routine, repetitive motions where the 7B model is not needed, and activates only when the cloud call exceeds a latency threshold.

OpenVLA handles novel, ambiguous, or instruction-conditional tasks. The MLP handles the high-frequency portions of motions where the trajectory is already committed. This dual-track setup means a network hiccup does not stop the robot mid-task.

For more on combining cloud and edge inference, see Hybrid Cloud and Edge AI Inference Guide.

OpenVLA vs RT-2 vs Pi-0

Model	Open-Weight?	Fine-Tunable?	Inference Latency	Action Space	Training Data
OpenVLA 7B	Yes (MIT)	Yes (LoRA)	~100-150 ms self-hosted H100	7-DoF delta	Open X-Embodiment (~970k eps)
RT-2 (Google)	No	No	200-600 ms (API)	6-DoF delta	Google internal
Pi-0 (Physical Intelligence)	No	Partial (via API)	~300 ms (API)	Flow matching policy	Physical Intelligence internal

OpenVLA's openness matters when your team has proprietary robot demonstrations, works under data residency constraints, operates in environments without reliable internet, or cannot absorb per-call API fees at scale. A fleet of 20 robots making 10 calls per second is 200 calls per second; API costs add up fast at that frequency.

RT-2 and Pi-0 have an edge in pre-trained generalization. They were trained on far more data and compute. For teams without proprietary embodiment data or latency constraints, the API options are simpler to get started with. The decision usually comes down to whether you need to fine-tune on your own robot, or whether the base model's generalization is sufficient.

Teams running multi-robot fleets at scale often need the same TRT-LLM engine optimizations covered in the TensorRT-LLM Production Deployment Guide.

Robotics teams using OpenVLA need predictable bare-metal latency, not serverless cold starts. Spheron provides on-demand H100 SXM5 instances from $3.10/hr and A100 80GB from $1.64/hr, with provisioning in under 90 seconds and no minimum commitment.
Check H100 availability → | A100 80GB on Spheron → | L40S GPU pricing → | View all GPU pricing →
Get started on Spheron →

STEPS / 07

Quick Setup Guide

Choose a GPU tier based on latency target
Pick your GPU based on your closed-loop control budget: H100 80GB for under 150 ms end-to-end (model call + image preprocessing + action execution), A100 80GB for 200-400 ms tolerances (most pick-and-place tasks), L40S 48GB for FP8-quantized serving at 250-500 ms. Identify your robot's control frequency first. A 10 Hz control loop gives you 100 ms per step budget; a 5 Hz loop gives 200 ms. The GPU choice follows directly from that number.
Provision a GPU instance on Spheron
Go to app.spheron.ai, select your GPU model, and deploy. Spheron provisions bare-metal instances in under 90 seconds. SSH in and confirm your CUDA version with nvidia-smi. For OpenVLA, ensure you have at least 50 GB of persistent storage for the 7B model weights plus LoRA adapter checkpoints.
Install dependencies and download OpenVLA weights
Install PyTorch 2.3+, transformers>=4.40, and the openvla package from GitHub (pip install git+https://github.com/openvla/openvla.git). Download weights: huggingface-cli download openvla/openvla-7b --local-dir /data/models/openvla-7b. The full checkpoint is approximately 14 GB.
Run inference with the OpenVLA Python API
Load the model and processor with from_pretrained, pass an observation image and a task language instruction, and decode the predicted action vector. The returned action is a 7-DoF delta-action (x, y, z, roll, pitch, yaw, gripper) in normalized space. The officially documented API is vla.predict_action(**inputs, unnorm_key='bridge_orig', do_sample=False), which returns the continuous action vector directly.
Serve OpenVLA via vLLM for concurrent robot fleet requests
Launch vLLM with vllm serve /data/models/openvla-7b --dtype bfloat16 --max-model-len 1024 --served-model-name openvla --trust-remote-code --port 8000. Note that vLLM support for OpenVLA is experimental. Post-process generated token IDs through the OpenVLA processor on the client side to recover the continuous action vector. Each robot in your fleet sends its observation image as base64 in the messages array alongside the task instruction text.
Apply action chunking to reduce effective per-step latency
Instead of requesting one action per model call, request a chunk of 8 to 16 actions in a single forward pass. Your robot controller executes the chunk while the GPU decodes the next one. This hides model call latency behind execution time. Configure the chunk size based on your controller's replanning tolerance, as longer chunks reduce replanning frequency but increase tracking error on paths that change direction.
Fine-tune on your robot embodiment with LoRA
Format your demonstrations as (RGB observation image, natural language instruction, action vector) triples. Use the provided normalization scripts to convert your action space to OpenVLA's 256-bin discrete token format. Run the official fine-tuning script with LoRA rank 32 and alpha 64. A 10,000-step dataset typically requires 4-6 GPU-hours on an H100 80GB. Use the LIBERO benchmark tasks as a baseline to confirm that your fine-tuned adapter improves task success rate over the base checkpoint before deploying to real hardware.

FAQ / 06

Frequently Asked Questions

OpenVLA 7B at BF16 needs approximately 15 GB of VRAM for weights. A single A100 40GB is the practical minimum for production serving with enough headroom for KV cache and action token generation. For closed-loop control at under 200 ms end-to-end latency, an H100 80GB is recommended because its higher memory bandwidth (3.35 TB/s vs 2 TB/s) cuts the per-step decode time significantly.

Yes. OpenVLA supports LoRA fine-tuning. You format your robot's demonstrations as (image, action) pairs, convert actions to the OpenVLA discrete token format using the provided normalization scripts, and train with the HuggingFace Trainer or TRL. A LoRA run (r=32, alpha=64) on a 10,000-step dataset trains in roughly 4-6 GPU-hours on an H100 80GB. Your proprietary data never leaves your cloud instance.

vLLM handles OpenVLA's action tokens as a text generation task, because OpenVLA tokenizes continuous actions into 256 discrete bins mapped to token IDs in the Llama vocabulary. The action decoding step happens on the client: you post-process vLLM's generated token IDs through the OpenVLA processor to recover the original continuous action vector. For production use, verify the exact decode method name against the current OpenVLA repo; the officially documented non-vLLM path is vla.predict_action(**inputs, unnorm_key='bridge_orig', do_sample=False). The model itself runs as a standard causal LM in vLLM.

RT-2 (Google) and Pi-0 (Physical Intelligence) are closed-source models accessible only via API. OpenVLA is fully open-weight under an MIT license, letting you fine-tune on your own robot demonstrations, inspect and modify the action tokenizer, and run inference on your own hardware with no per-call fees. The tradeoff is that RT-2 and Pi-0 have significantly more training compute and larger policy capacity. For teams with proprietary embodiments or latency requirements that rule out remote inference, OpenVLA is the practical choice.

The main levers are: (1) quantize to FP8 on H100 or INT8 on A100 to cut weight memory and decode time; (2) pre-tokenize the observation image once and cache visual tokens across action chunk steps; (3) use action chunking (decode 8-16 actions at once and execute them while the GPU prepares the next chunk) to amortize the model call overhead; (4) build a TensorRT-LLM engine for the Prismatic ViT encoder specifically, which is typically the slowest component at low batch sizes.

OpenVLA 7B was fine-tuned from Prismatic-7B on the Open X-Embodiment (OXE) dataset, which aggregates demonstrations from 970,000+ episodes across 22 robot embodiments. The model was trained on a curated mix of ~970k episodes spanning tabletop manipulation, mobile manipulation, and navigation tasks. Training data includes both simulated and real-robot demonstrations.

URL: https://www.spheron.network/blog/deploy-openvla-gpu-cloud/

⇱ Deploy OpenVLA on GPU Cloud: Self-Host the Open Vision-Language-Action Robotics Foundation Model (2026 Setup Guide) | Spheron Blog

What Is OpenVLA

Why Self-Host Instead of Using an API

GPU Sizing for OpenVLA Inference

Inference Setup with vLLM

TensorRT-LLM Engine Build (Optional, for Sub-100 ms)

Production Latency Tuning

Image Preprocessing Pipeline

Action Chunking

Control Loop Integration

Fine-Tuning OpenVLA on a Custom Embodiment

Data Preparation

LoRA Setup

Training Command

GPU-Hour Budget

Evaluation with LIBERO

Deployment Patterns

Edge Robot with Cloud GPU

Hybrid Inference Split

Fallback to a Lightweight Policy Head

OpenVLA vs RT-2 vs Pi-0

Quick Setup Guide

Choose a GPU tier based on latency target

Provision a GPU instance on Spheron

Install dependencies and download OpenVLA weights

Run inference with the OpenVLA Python API

Serve OpenVLA via vLLM for concurrent robot fleet requests

Apply action chunking to reduce effective per-step latency

Fine-tune on your robot embodiment with LoRA

Frequently Asked Questions

Build what's next.