VOOZH about

URL: https://www.spheron.network/blog/deploy-openhands-gpu-cloud/

⇱ Deploy OpenHands on GPU Cloud: Self-Host the Open-Source AI Software Engineering Agent (2026 Guide) | Spheron Blog


OpenHands consistently places at the top of the SWE-bench Verified leaderboard with open-weight models, and it's free to self-host under the MIT license. Devin charges per task on a subscription tier that stops making economic sense above a few hundred tasks per month. OpenHands on GPU cloud breaks that constraint.

If you're running an autonomous agent that writes, edits, and tests real code, you want the model backend under your control and the per-task cost predictable. This guide covers exactly that: the two-node architecture, model selection for SWE-bench-class performance, vLLM deployment on Spheron, and the security setup that makes Docker-sandboxed code execution safe in production.

Before the deployment steps: if you want Devstral 24B on GPU cloud as a standalone coding assistant (not the full autonomous agent loop), that guide covers single-GPU vLLM setup for IDE integration. If you're looking for self-hosted IDE autocomplete tools rather than an agent that autonomously completes tasks, that covers Continue, Aider, and Tabby. OpenHands is different from both: it's an agent loop that reads repos, edits files, runs tests, and iterates until a task passes. For the AI agent code execution sandboxes that sit underneath this kind of agent, including Firecracker and E2B, see that guide. And if you're evaluating OpenHands deployments against benchmarks, the SWE-bench evaluation infrastructure guide covers running the full harness on GPU cloud. If your goal is a general-purpose agentic assistant with 50+ tool integrations rather than code-specific tasks, see the OpenClaw GPU deployment guide which covers that setup on the same two-node architecture.

OpenHands in 2026

OpenHands started as OpenDevin in early 2024, was renamed, and now sits at version 1.7.0 as of May 2026. The core design is an observe-think-act loop: the agent sees a task, calls a tool (read file, run shell command, apply patch), observes the output, decides the next action, and iterates until a termination condition is met.

The runtime architecture has two components. First, the controller process: a Python server that manages the agent loop, handles the LLM abstraction via LiteLLM, and coordinates sandbox lifecycle. Second, the sandbox: a Docker container that the controller spawns for each task. The agent's code runs inside that sandbox, isolated from the controller host. Shell commands, file writes, and test runs all happen inside the sandbox container. The controller communicates with the sandbox over a socket.

Recent releases added headless mode for programmatic task submission via REST API, multi-agent support where one controller can spawn subagents for parallel subtasks, and a MicroAgent system for specialized skills. Headless mode is what makes batch processing at scale viable without a human clicking through the web UI.

On SWE-bench Verified (500 real GitHub issues), OpenHands with Claude Opus 4.6 scores 68.4%. With Devstral 24B as the LLM backend, it scores 46.8%. With Qwen3-235B-A22B MoE, estimates place it above 52%. Devin 2.0's publicly reported score is 45.8%, which open-weight OpenHands with Devstral already matches.

Architecture Overview

The production setup runs two nodes:

NodeHardwareRole
Inference nodeH100 SXM5 80GB (or H200)vLLM serves the LLM over HTTP
Controller node8-16 vCPU CPU instanceOpenHands app + Docker socket for sandbox management

Communication flows: the controller sends LLM requests to http://<inference-node-ip>:8000/v1 (OpenAI-compatible). The controller also mounts the host Docker socket (/var/run/docker.sock) to spawn and manage sandbox containers. The sandbox containers run on the same host as the controller.

The two-node split is not strictly required at low scale. For development or small teams, you can run vLLM and the OpenHands controller on the same GPU instance. Separating them makes sense when you want to scale inference independently from the controller, or when you want GPU cost to scale with model load rather than with the number of agent tasks in flight.

Model Selection

Your choice of LLM backend determines VRAM requirements, benchmark performance, and per-task cost. Here are the practical options for self-hosted OpenHands:

ModelVRAM (BF16)SWE-bench VerifiedSingle GPUNotes
Devstral 24B~50 GB46.8%A100 80GB / H100Best coding specialist per dollar
Qwen3-32B~65 GBEst. 45-50%H100 80GBStrong reasoning and coding
Qwen3-235B-A22B (MoE)~235 GB FP8 / ~470 GB BF16Est. 52%+4x H100 / 2x H200Near-frontier open-weight
DeepSeek-V3 (MoE)~200 GB FP8~50%8x H100 FP8Max performance open-weight
Claude Opus 4.6 (API)N/A68.4%NoneManaged API, top OpenHands benchmark result

For most teams starting with OpenHands, Devstral 24B on a single H100 is the practical default. You get 46.8% SWE-bench Verified at a single-GPU cost, with tool-call support that works correctly with the Mistral function calling parser. Qwen3-32B is worth the additional VRAM if your task mix goes beyond pure coding into reasoning-heavy debugging or cross-language work.

For a broader comparison of open-weight frontier models, see the open-weight frontier model showdown.

For teams that need H200 GPU rental on Spheron to run Qwen3-235B-A22B MoE, you need 2x H200 (282 GB combined HBM3e) to hold the full weight set at FP8, or 4x H100 at FP8. A single H200's 141 GB is not enough, because all expert weights must reside in VRAM even though only 22B parameters are active per forward pass. At spot pricing, 2x H200 is often cheaper per hour than 4x H100 on-demand, which changes the cost math for large MoE models significantly.

GPU Sizing and Pricing

GPUVRAMOn-demand ($/hr)Spot ($/hr)Models supportedBest for
H100 SXM580 GB$4.21$0.80Devstral BF16, Qwen3-32BPrimary inference node
H200 SXM5141 GB$4.54$1.19Qwen3-235B MoE, DeepSeek-V3 FP8Large-model inference
A100 80GB SXM480 GB$1.64$0.45Devstral BF16, Qwen3-32BBudget single-GPU

Pricing fluctuates based on GPU availability. The prices above are based on 09 May 2026 and may have changed. Check current GPU pricing → for live rates.

For single-team deployments running Devstral on H100, the all-in compute cost is $4.21/hr on-demand or $0.80/hr spot. At 30-minute average task duration and one agent session, that's $2.11/task on-demand and $0.40/task on spot before any MIG concurrency gains.

Step-by-Step Deployment

Step 1: Provision the inference node

Log into app.spheron.ai and provision an H100 SXM5 80GB instance. Choose spot pricing for the inference node if your tasks are interruptible. Attach at least 200 GB persistent storage for model weights and vLLM KV cache. For the controller node, a CPU instance with 8-16 vCPU and 32 GB RAM is sufficient.

For on-demand H100 access on Spheron, select the SXM5 variant if MIG partitioning for concurrent agents is part of your plan. MIG is not available on PCIe variants.

Step 2: Install vLLM and download weights

On the inference node:

bash
pip install 'vllm>=0.8.0' huggingface_hub hf_transfer
export HF_TOKEN=your_hf_token
export HF_HUB_ENABLE_HF_TRANSFER=1

# For Devstral 24B
huggingface-cli download mistralai/Devstral-Small-2505

# For Qwen3-32B
huggingface-cli download Qwen/Qwen3-32B

Step 3: Launch vLLM

For Devstral 24B at BF16 on H100:

bash
vllm serve mistralai/Devstral-Small-2505 \
 --dtype bfloat16 \
 --max-model-len 65536 \
 --port 8000 \
 --enable-auto-tool-choice \
 --tool-call-parser mistral

The --tool-call-parser mistral flag is required for Devstral. Omitting it causes malformed function call output that breaks the OpenHands agent loop silently. The agent receives tool outputs but they are unparseable, and you'll see the agent spinning without making progress. For Qwen3 models, use --tool-call-parser hermes instead.

For Qwen3-32B at BF16 on H100:

bash
vllm serve Qwen/Qwen3-32B \
 --dtype bfloat16 \
 --max-model-len 32768 \
 --port 8000 \
 --enable-auto-tool-choice \
 --tool-call-parser hermes

Do not expose port 8000 to the public internet. Use the instance's internal network IP for controller-to-inference communication.

Step 4: Configure and launch OpenHands

On the controller node, create config.toml:

toml
[core]
workspace_base = "/opt/workspace_base"

[llm]
model = "openai/devstral"
base_url = "http://<inference-node-ip>:8000/v1"
api_key = "none"

The openai/ prefix on the model name tells LiteLLM to use the OpenAI-compatible request format. This works regardless of the actual model, as long as your vLLM server speaks the OpenAI API.

Pull and run OpenHands:

bash
docker pull ghcr.io/all-hands-ai/openhands:1.7.0

docker run -d \
 --restart unless-stopped \
 -e SANDBOX_RUNTIME_CONTAINER_IMAGE=ghcr.io/all-hands-ai/runtime:1.7.0-nikolaik \
 -e LOG_ALL_EVENTS=true \
 -v /var/run/docker.sock:/var/run/docker.sock \
 -v /your/workspace:/opt/workspace_base \
 -v /path/to/config.toml:/app/config.toml \
 -p 3000:3000 \
 --name openhands-app \
 ghcr.io/all-hands-ai/openhands:1.7.0

Two things to note here: first, the Docker socket mount (/var/run/docker.sock) is required. OpenHands uses it to spawn sandbox containers. Without it, the controller cannot start sandbox containers and the agent loop fails immediately. Second, the SANDBOX_RUNTIME_CONTAINER_IMAGE version must match the openhands image version. Running openhands:1.7.0 with runtime:1.6.0-nikolaik causes a container start failure with a cryptic error. Always use matching version tags.

Open the UI at http://<controller-ip>:3000. Submit a simple task ("add a docstring to function X in file Y") and watch the event log. You should see the controller spawn a sandbox container, send tool call requests to the vLLM endpoint, and iterate until the task completes.

Step 5: Run headless mode for batch tasks

OpenHands 1.7.0 supports a REST API for programmatic task submission. Use the v1 endpoint (the v0 /api/conversations path was removed April 1, 2026):

bash
curl -X POST http://<controller-ip>:3000/api/v1/app-conversations \
 -H "Content-Type: application/json" \
 -d '{
 "initial_message": {
 "content": [
 {"type": "text", "text": "Fix the failing test in tests/test_api.py"}
 ]
 },
 "selected_repository": "your-org/your-repo"
 }'

The response includes app_conversation_id. Poll GET /api/v1/app-conversations/<id> until status is READY, then fetch the result. This is the interface for integrating OpenHands into CI/CD pipelines or batch processing queues.

Scaling Concurrent Agents with MIG

MIG partitioning is available on A100 SXM4, H100 SXM5, and H200 SXM5 (not H100 PCIe, not A100 PCIe, not RTX-series). It splits a single GPU into isolated slices, each with dedicated VRAM and compute. For a deep dive into MIG vs. time-slicing vs. MPS for running multiple models on one GPU, see running multiple LLMs on one GPU.

On an H100 80GB, the 3g.40gb profile creates two slices, each with 40 GB VRAM. Each slice runs one vLLM instance serving one agent session:

bash
# Enable MIG mode
nvidia-smi -i 0 -mig 1

# Create two 3g.40gb slices
nvidia-smi mig -cgi 3g.40gb,3g.40gb -i 0
nvidia-smi mig -cci -i 0

# List MIG instance UUIDs
nvidia-smi -L

# Launch two separate vLLM processes, each on one slice
CUDA_VISIBLE_DEVICES=MIG-<uuid-0> vllm serve mistralai/Devstral-Small-2505 \
 --quantization fp8 --max-model-len 32768 --port 8000 \
 --enable-auto-tool-choice --tool-call-parser mistral &

CUDA_VISIBLE_DEVICES=MIG-<uuid-1> vllm serve mistralai/Devstral-Small-2505 \
 --quantization fp8 --max-model-len 32768 --port 8001 \
 --enable-auto-tool-choice --tool-call-parser mistral &

Run two OpenHands controllers, each configured to use a different vLLM port. Two fully isolated agent sessions, one H100, no interference between sessions.

MIG has one important constraint: it changes the VRAM available to each vLLM instance. A 3g.40gb slice gives 40 GB, which is not enough for Devstral 24B at BF16 (~50 GB). Use FP8 quantization (~26 GB) to fit within the slice. Qwen3-32B at BF16 needs 65 GB and doesn't fit a single MIG slice on H100. For Qwen3-32B concurrency, use NVIDIA MPS instead:

bash
# Start MPS daemon (no MIG required)
nvidia-cuda-mps-control -d

# All vLLM processes share the GPU through MPS
vllm serve Qwen/Qwen3-32B --dtype bfloat16 --max-model-len 32768 --port 8000 ...

MPS multiplexes a single vLLM process across concurrent agent request streams without partitioning VRAM. Throughput is shared, not isolated.

Cost per task with concurrent agents

The formula for per-task cost at MIG concurrency:

cost per task = (avg_task_duration_hours) × (GPU $/hr) / concurrent_agents

Example: 30-minute tasks, H100 on-demand at $4.21/hr, 2 concurrent MIG sessions:

cost per task = 0.5 × $4.21 / 2 = $1.05/task

On spot at $0.80/hr:

cost per task = 0.5 × $0.80 / 2 = $0.20/task

Security

Four things to get right before running autonomous code execution in production:

Sandbox network isolation. Set SANDBOX_NETWORK_DISABLED=true in the OpenHands config. This prevents the sandbox container from making outbound network requests. Useful for tasks that shouldn't pull external packages or exfiltrate code during execution. For tasks that genuinely need network access (installing dependencies, API calls), disable this selectively per-task rather than leaving it open globally.

Secret handling. Never put API keys, database credentials, or GitHub tokens in config.toml. Mount a secrets directory as a read-only Docker volume into the controller container and reference secrets via environment variables. The sandbox container has its own filesystem isolation, but anything mounted into the controller's config is visible to the agent's Python environment.

Repo permissions. Use fine-grained GitHub PATs scoped to the specific repository the agent is working on. Do not give OpenHands an org-wide token or a token with write access to unrelated repos. The agent will use whatever permissions the token provides.

Docker socket access. The OpenHands controller requires /var/run/docker.sock mounted into its container. This is effectively root-equivalent access on the host. Run the controller container with --security-opt no-new-privileges and keep the controller host separate from production systems. The controller node should not have access to production databases or internal services. This is exactly why the two-node split from the architecture section matters: the inference node with the expensive GPU has no Docker socket access and cannot spawn containers.

Cost Comparison

Assuming 30-minute average task duration and 2 concurrent MIG sessions at 1,000+ tasks/month:

Tasks/monthDevin (Team plan, est.)GitHub Copilot WorkspaceOpenHands on H100 on-demandOpenHands on H100 spot
100~$500 (plan minimum)~$19-39/seat~$105~$20
1,000~$2,000-5,000 (per-task overage)~$190-390/seat~$1,050~$200
10,000~$20,000+Not designed for this volume~$10,500~$2,000

Devin's team plan pricing is based on publicly reported figures from early 2026 and includes a fixed task allocation. Overage pricing varies by plan tier. Copilot Workspace pricing is per-seat and not designed for high-volume autonomous task execution. OpenHands costs are calculated from live H100 SXM5 pricing: on-demand $4.21/hr, spot $0.80/hr, 30-min avg task, 2 concurrent MIG sessions.

At 1,000 tasks/month, self-hosted OpenHands on H100 spot costs roughly 10-25x less than Devin at scale, and 2-5x less on on-demand pricing. The crossover point where self-hosting pays off depends on your ops overhead to maintain the GPU instance and the OpenHands stack. For teams already running GPU cloud infrastructure, that overhead is near zero.


OpenHands runs well on Spheron H100 instances, where flat hourly pricing keeps cost predictable as task volume grows - unlike serverless GPU billing that compounds on long agent loops.

H100 pricing on Spheron → | View all GPU pricing → | Get started →

STEPS / 08

Quick Setup Guide

  1. Choose your model and size the inference GPU

    Pick Devstral 24B (single A100 80GB or H100) for coding-specialized tasks at 46.8% SWE-bench Verified. Pick Qwen3-32B at BF16 (single H100 80GB) for stronger general reasoning with coding ability. Pick DeepSeek-V3 or a Qwen3 MoE variant (4-8x H100 or H200) for near-frontier performance. Confirm VRAM requirements from the sizing table in this post before provisioning.

  2. Provision GPU and controller nodes on Spheron

    Log in to app.spheron.ai. Provision one H100 SXM5 80GB instance (or H200 for larger models) as the inference node. Provision one CPU-only or small instance (8-16 vCPU, 32 GB RAM) as the OpenHands controller node. Attach persistent storage (200 GB minimum) to the inference node for model weights and vLLM cache.

  3. Deploy vLLM inference server on the GPU node

    Install vLLM 0.8+ on the inference node. For Devstral 24B BF16: `vllm serve mistralai/Devstral-Small-2505 --dtype bfloat16 --max-model-len 65536 --port 8000 --enable-auto-tool-choice --tool-call-parser mistral`. For Qwen3-32B: `vllm serve Qwen/Qwen3-32B --dtype bfloat16 --max-model-len 32768 --port 8000 --enable-auto-tool-choice --tool-call-parser hermes`. Expose port 8000 on the instance's internal network. Do not expose it to the public internet without authentication.

  4. Install and configure the OpenHands controller

    On the controller node, pull the OpenHands image: `docker pull ghcr.io/all-hands-ai/openhands:1.7.0` (use the latest stable tag). Create a config.toml with LLM settings pointing to your vLLM endpoint: set `model = 'openai/devstral'` (the openai/ prefix tells LiteLLM to use OpenAI-compatible format), `base_url = 'http://<inference-node-ip>:8000/v1'`, and `api_key = 'none'`. Set `workspace_base` to a mounted persistent volume path.

  5. Run OpenHands and verify the agent loop

    Start the OpenHands server: `docker run -d --restart unless-stopped -e SANDBOX_RUNTIME_CONTAINER_IMAGE=ghcr.io/all-hands-ai/runtime:1.7.0-nikolaik -e LOG_ALL_EVENTS=true -v /var/run/docker.sock:/var/run/docker.sock -v /your/workspace:/opt/workspace_base -v /path/to/config.toml:/app/config.toml -p 3000:3000 --name openhands-app ghcr.io/all-hands-ai/openhands:1.7.0`. Open the UI at http://<controller-ip>:3000, submit a task, and confirm the agent spins up a sandbox container and sends requests to the vLLM endpoint. Watch for tool call round-trips in the event log.

  6. Configure MIG or MPS for concurrent agent sessions

    For H100 or H200 SXM instances, enable MIG mode: `nvidia-smi -i 0 -mig 1`. Create MIG slices: `nvidia-smi mig -cgi 3g.40gb,3g.40gb -i 0 && nvidia-smi mig -cci -i 0`. Run two separate vLLM instances, each pinned to one slice via `CUDA_VISIBLE_DEVICES=MIG-<uuid>`. Each vLLM instance serves one concurrent agent session with isolated VRAM. For shared-model concurrent sessions, NVIDIA MPS (`nvidia-cuda-mps-control -d`) multiplexes a single vLLM process across multiple agent request streams without MIG partitioning overhead.

  7. Set up audit logging and secret isolation

    Enable OpenHands event logging with `LOG_ALL_EVENTS=true`. Mount a secrets directory as a read-only volume into the controller container and reference secrets via environment variables, not hardcoded strings. Set `SANDBOX_NETWORK_DISABLED=true` in the sandbox config to block internet access from agent containers. Use Docker's `--security-opt no-new-privileges` and `--read-only` flags on sandbox containers. Review the OpenHands security hardening docs at docs.all-hands.ai for repo permission scoping.

  8. Benchmark resolved-task cost and compare to managed alternatives

    Instrument the total wall-clock time per task from controller start to agent completion signal. Multiply by your GPU hourly rate (from the pricing table in this post). Compare to Devin's per-task subscription pricing and GitHub Copilot Workspace's per-task usage. At 1,000 tasks/month, self-hosted OpenHands on H100 costs roughly $1.05/resolved task on-demand and $0.20 on spot, while Devin's per-task pricing runs several dollars per task at scale.

FAQ / 05

Frequently Asked Questions

The OpenHands controller itself is CPU-only and lightweight - it runs on any Linux instance with Docker. The GPU is for the LLM inference backend (vLLM or SGLang). Devstral 24B at FP8 requires a single A100 80GB or H100. Qwen3-32B needs ~65 GB VRAM and fits on an H100 80GB. Larger models like Qwen3-235B-A22B MoE or DeepSeek-V3 require 4-8x H100 or H200. For most single-team deployments, a single H100 80GB handles Devstral or Qwen3-32B at BF16 with throughput for 3-5 concurrent agent sessions.

Each OpenHands agent session gets its own Docker sandbox container (the runtime). The OpenHands controller manages sandbox lifecycle. For concurrent agents, run one OpenHands controller per host and set SANDBOX_RUNTIME_CONTAINER_IMAGE to your pinned runtime image. GPU sharing across concurrent sessions is handled by NVIDIA MPS on a single GPU or MIG slices on H100/H200 SXM. Each MIG slice (e.g., 3g.40gb on H100) runs one inference process and serves one agent session with isolated VRAM.

Yes. OpenHands uses LiteLLM as its model abstraction layer, so it works with any provider LiteLLM supports: Anthropic Claude, OpenAI, Azure OpenAI, Gemini, and any OpenAI-compatible self-hosted endpoint. Set LLM_MODEL=anthropic/claude-sonnet-4-6 and LLM_API_KEY in the config. Self-hosting the LLM only makes sense if you need data sovereignty, predictable per-task cost, or plan to run high enough task volumes that API costs exceed self-hosted GPU compute.

On SWE-bench Verified (500 tasks), Devin 2.0 scored around 45.8% as of early 2026. OpenHands with Claude Opus 4.6 reaches 68.4% on SWE-bench Verified. With open-weight models, Devstral running via OpenHands achieves around 46.8% on SWE-bench Verified. The key difference is cost: Devin charges per task on a subscription model, while OpenHands on self-hosted GPU costs roughly $0.20-1.05 per resolved task at H100 rates (spot to on-demand with 2 MIG sessions) depending on model and task complexity.

The OpenHands runtime sandbox is a Docker container where the agent's code edits, shell commands, and test runs execute. The sandbox itself does not need GPU access unless your tasks involve GPU workloads (ML training, CUDA code testing). For typical software engineering tasks like writing, editing, and testing code, the sandbox needs only CPU and disk. Only the vLLM/SGLang LLM inference server needs the GPU.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.