VOOZH about

URL: https://www.spheron.network/blog/deploy-hrm-gpu-cloud/

⇱ Deploy the Hierarchical Reasoning Model (HRM) on GPU Cloud: Self-Host a 27M-Parameter Reasoner (2026) | Spheron Blog


Running a 70B reasoning model on an H100 costs around $3.10/hr on-demand. Running HRM on a single RTX 4090 costs $0.79/hr. On structured reasoning benchmarks, HRM at 27M parameters matches or beats 70B-class chain-of-thought models. That gap only makes sense when you understand what HRM actually does differently.

HRM does not generate reasoning tokens. It reasons in embedding space through a deliberation loop, then emits a final answer. The result: no thinking-token overhead, no KV cache explosion, and a model small enough to fit in the margin of GPU memory you would have spent anyway. For the broader economics of why reasoning model inference is expensive, the reasoning model inference cost optimization guide covers the token-explosion problem in detail.

What Is HRM?

HRM is a 27M-parameter hierarchical recurrent reasoner published in 2025. Its defining property: it solves structured reasoning tasks through internal deliberation rather than chain-of-thought token generation.

Where a standard CoT model emits thousands of thinking tokens that consume GPU memory and compute, HRM runs multiple passes of an executor network over compressed sub-goal representations. The final answer is produced directly, with no intermediate tokens in the output stream.

ModelParametersWeights (FP16)VRAM NeededReasoning TokensRelative Cost
HRM27M~50MB~200MBNone (internal)Baseline
DeepSeek R2 7B distill7B~14GB~17GB+2,000-8,00050-100x
DeepSeek R1 671B671B~336GB (FP8)8x H1004,000-12,00010,000x+
Nemotron Ultra 253B253B~127GB (FP8)4x H1002,000-8,0003,000x+

The parameter count matters for GPU economics because VRAM is the bottleneck in inference. At 27M parameters, HRM fits in the VRAM overhead that other models waste. You can run hundreds of concurrent HRM instances on a single RTX 4090 while a single 7B model would fill it.

Architecture: Planner, Executor, and the Deliberation Loop

HRM uses a two-level hierarchy: a planner module that decomposes problems and a low-level executor that solves each sub-goal iteratively.

How the data flows:

  1. Input arrives at the planner
  2. Planner decomposes the problem into a sequence of sub-goals (represented as embeddings, not tokens)
  3. Each sub-goal goes to the executor
  4. Executor runs N iterations over the sub-goal using KV-cache reuse across iterations
  5. Executor produces a sub-goal solution when it converges
  6. Planner assembles sub-goal solutions into a final answer
  7. Final answer is emitted as output tokens (short, since there is no CoT trace)

The deliberation loop terminates when the executor converges or hits a configurable max-step limit. Default is 8 steps; harder tasks benefit from 12-16. Critically, no intermediate tokens are sampled during deliberation. All reasoning happens in the model's internal embedding space.

This differs fundamentally from CoT models like DeepSeek R2 or QwQ. Those models sample tokens at every reasoning step, which grows the KV cache linearly with reasoning depth. HRM's KV cache footprint stays nearly constant regardless of deliberation depth because the executor reuses cached representations across iterations rather than extending the sequence.

For the theoretical background on compute vs token tradeoffs, see inference-time compute scaling.

GPU Sizing for HRM

Single-Instance: RTX 4090

HRM weights fit in under 200MB in FP16. A single RTX 4090 has 24GB of VRAM, which means roughly 23.8GB is available for batch queues and executor KV-cache. That headroom handles hundreds of concurrent HRM instances even at aggressive batch sizes.

  • On-demand price: $0.79/hr
  • Batch size: up to 512 concurrent requests at default deliberation depth
  • Best for: development, low-to-medium concurrency production, cost-sensitive pipelines

Run HRM on RTX 4090 for any workload processing under 50,000 queries/hour. Above that, move to A100 for the larger memory headroom and more memory bandwidth.

Batched Serving: A100 80G

The A100 80G SXM4 has 80GB of HBM2e and significantly higher memory bandwidth than the RTX 4090. At production batch sizes (64+), the bandwidth advantage translates to better throughput per dollar.

  • On-demand price: $1.64/hr
  • Spot price: $0.45/hr
  • Best for: production APIs, multi-tenant serving, embedding HRM in a larger inference router

HRM's executor iterations are bandwidth-bound at small batch sizes and compute-bound at large ones. The A100's 2 TB/s bandwidth handles the transition well. For workloads that are interruptible (batch jobs, async queues), A100 spot at $0.45/hr makes HRM essentially free at scale.

When to Escalate to H100

HRM itself never needs an H100. The H100 is for your fallback model when HRM cannot handle a query. The architecture here is a two-tier stack: HRM on RTX 4090 for structured reasoning, H100 with DeepSeek R2 for everything else.

See the LLM inference router guide for how to build the routing layer between tiers.

GPUVRAMHRM Instances (FP16, est.)On-Demand Price/hrBest Use
RTX 409024GB~400$0.79Dev / low concurrency
A100 80G SXM480GB~1,000+$1.64Production serving
H100 SXM580GBN/A (overkill)$3.10Fallback model only

Pricing fluctuates based on GPU availability. The prices above are based on 02 May 2026 and may have changed. Check current GPU pricing → for live rates.

Prerequisites

  • Spheron account with GPU access (app.spheron.ai)
  • PyTorch 2.6 + CUDA 12.4
  • Python 3.11+
  • Hugging Face account (for model weights download)
  • 8GB+ disk space for weights and cache
  • Ray 2.9+ (for Ray Serve wrapper)

Step-by-Step Deployment

1. Provision a GPU Instance

Rent an RTX 4090 on Spheron. For detailed provisioning and SSH connection steps, see the Spheron getting-started docs. For batch workloads, use spot pricing on the A100 for the lowest cost. See GPU pricing for current rates.

bash
# After SSH into your Spheron instance, verify the GPU
nvidia-smi
# Expected: NVIDIA GeForce RTX 4090 or NVIDIA A100-SXM4-80GB

2. Install Dependencies

bash
pip install torch==2.6.0+cu124 torchvision --index-url https://download.pytorch.org/whl/cu124
pip install "ray[serve]"
git clone https://github.com/sapientinc/HRM hrm
cd hrm && pip install -e .
python
import torch
print(torch.cuda.is_available()) # True
print(torch.cuda.get_device_name(0)) # RTX 4090

3. Download Model Weights

python
from huggingface_hub import snapshot_download

# HRM ships three task-specific checkpoints, not one general model:
# - ARC-AGI-2: sapientinc/HRM-checkpoint-ARC-2
# - Sudoku 9x9 Extreme: sapientinc/HRM-checkpoint-sudoku-extreme
# - Maze 30x30 Hard: sapientinc/HRM-checkpoint-maze-30x30-hard
# Download the ARC-AGI-2 checkpoint for ARC-style reasoning tasks:
snapshot_download(
 repo_id="sapientinc/HRM-checkpoint-ARC-2",
 local_dir="./hrm-weights",
 revision="main"
)

FP32 checkpoint: ~100MB. FP16 checkpoint: ~50MB. Always verify the SHA checksum after download.

HRM is not a single general model. The official repo ships three separate checkpoints, each trained from scratch on approximately 1,000 task-specific examples: ARC-AGI-2, Sudoku 9x9 Extreme, and Maze 30x30 Hard. Each checkpoint solves problems within its own task family. Downloading the ARC-AGI-2 checkpoint gives you a model for ARC-style visual reasoning patterns, not a drop-in for arbitrary constraint satisfaction or logic puzzles outside its training distribution. Pick the checkpoint that matches your target task before building your serving stack.

4. Configure the Inference Loop

python
# pseudocode: verify class names against https://github.com/sapientinc/HRM before running
import torch
from hrm import HRMConfig, HRMInference

config = HRMConfig(
 model_path="./hrm-weights",
 max_deliberation_steps=8, # increase to 16 for harder tasks
 executor_kv_cache_reuse=True,
 seed=42,
 device="cuda:0",
 dtype=torch.float16,
)

Set max_deliberation_steps based on task difficulty. ARC-AGI tasks typically converge in 4-6 steps. Constraint satisfaction with many variables may need 12-16.

5. Serve with FastAPI + Ray Serve

HRM is a custom recurrent architecture. vLLM cannot serve it because it is not a transformer language model. The deliberation loop runs natively with PyTorch. The code below shows the serving structure with pseudocode for the HRM-specific parts. Verify the actual checkpoint loading API against evaluate.py in the official repo before deploying.

python
# pseudocode: verify HRM checkpoint loading against evaluate.py in the official repo
# Usage reference: torchrun --nproc-per-node 1 evaluate.py checkpoint=./hrm-weights

import ray
from ray import serve
from fastapi import FastAPI
import torch

app = FastAPI()

@serve.deployment(num_replicas=1, ray_actor_options={"num_gpus": 1})
@serve.ingress(app)
class HRMServer:
 def __init__(self):
 # pseudocode: replace with actual checkpoint loading
 # checkpoint = torch.load("./hrm-weights/checkpoint.pt", map_location="cuda")
 # self.model = build_hrm_model(checkpoint)
 # self.model.eval()
 pass

 @app.post("/infer")
 async def infer(self, request: dict):
 # pseudocode deliberation loop, implement using the actual model forward pass:
 # with torch.no_grad():
 # result = self.model.forward(request["prompt"], max_steps=8)
 # return {"output": result.answer, "depth": result.steps_taken}
 pass

ray.init()
serve.run(HRMServer.bind())
bash
# Test the endpoint
curl -X POST http://localhost:8000/infer \
 -H "Content-Type: application/json" \
 -d '{"prompt": "Solve: [example structured reasoning task]"}'

6. Enable torch.compile for Throughput

python
import torch
import torch._dynamo
from hrm import HRMConfig, HRMInference

config = HRMConfig(
 model_path="./hrm-weights",
 max_deliberation_steps=8,
 device="cuda:0",
 dtype=torch.float16,
)
model = torch.compile(HRMInference(config), mode="reduce-overhead")

reduce-overhead mode improves executor iteration throughput by 15-25% on RTX 4090. The first call incurs compilation overhead. Warm up the model with 5-10 requests before benchmarking.

Benchmarks

HRM's advantage is specific to tasks with a finite, verifiable answer space. On open-ended tasks, it loses badly.

ModelARC-AGI AccuracyThroughput (queries/sec, RTX 4090)Cost per 1,000 correct answers
HRM (27M, depth 8)40.3%50-200+~$0.01-0.05
o3-mini-high34.5%API onlyAPI pricing
Claude 3.7 8K21.2%API onlyAPI pricing

Accuracy figures from the HRM paper (arXiv 2506.21734). HRM also achieves near-perfect accuracy on Sudoku 9x9 Extreme and Maze 30x30 Hard, the other two benchmarks reported in the paper.

HRM throughput varies significantly with deliberation depth and batch size. The ranges above reflect default depth 8 at batch sizes 32-256 on a single RTX 4090. Actual numbers depend on task complexity and your hardware.

Pricing fluctuates based on GPU availability. The prices above are based on 02 May 2026 and may have changed. Check current GPU pricing → for live rates.

The cost-per-correct-answer advantage is large for structured tasks. On ARC-AGI, HRM at 40.3% accuracy (self-hosted for ~$0.01-0.05 per 1K correct answers) beats o3-mini-high (34.5%) and Claude 3.7 8K (21.2%) while running on a single RTX 4090 at a fraction of API pricing.

When HRM Wins

HRM is the right call when:

  • Tasks fall within HRM's trained distributions: ARC-AGI-2, Sudoku 9x9 Extreme, or Maze 30x30 Hard patterns. Each checkpoint is task-specific and will not generalize to reasoning tasks outside its training distribution.
  • Answer space is closed: the output is one of N options or satisfies explicit constraints
  • Workload is high-volume and structured: thousands of constraint satisfaction or logic puzzles per hour
  • You need deterministic, reproducible reasoning outputs (fixed seed + deliberation loop)
  • Budget matters: cost-per-correct-answer is the primary metric
  • Low latency matters: no thinking-token generation means fast time-to-first-token

The deliberation loop finds solutions without generating intermediate tokens. On closed-answer-space tasks within its training distribution, that is exactly the right trade-off.

When HRM Loses

Escalate to a larger model when:

  • Open-ended generation: use DeepSeek R1 671B or a frontier model
  • Code synthesis: HRM has no training signal for code generation; use DeepSeek R2 or Qwen-Coder
  • Long-context summarization (>8K tokens): HRM has a limited context window
  • Broad world knowledge: 27M parameters cannot store enough factual knowledge for general QA
  • Unstructured reasoning: free-form reasoning chains require CoT models

For escalation routing, see the DeepSeek R2 deployment guide. For a comparison of open-weight frontier models as fallback options, see the open-weight frontier model showdown.

Production Checklist

  1. Deliberation depth monitoring - log deliberation_depth per request. Alert if P95 depth exceeds 12. That signals tasks beyond the model's capability, and those should route to a larger model.
  2. Early-exit heuristics - terminate the executor loop when confidence score exceeds 0.95. This avoids wasting compute on easy tasks that converge quickly.
  3. Fallback routing - when HRM entropy exceeds threshold, route to DeepSeek R2 7B distill or larger. The LLM inference router guide covers the routing layer in full.
  4. KV-cache sizing - allocate at least 4GB KV-cache per GPU for executor iterations across concurrent requests.
  5. Batch size tuning - HRM benefits from large batch sizes. Target batch size 64+ on A100 80G.
  6. Spot vs on-demand - HRM inference is stateless per request. Spot instances are safe for the HRM tier. Reserve on-demand for the fallback model only.

Cost Analysis

SetupGPUPrice/hrQueries/hr (est.)Cost per 1K queries
HRM singleRTX 4090$0.79~5,000-20,000~$0.04-0.16
HRM batchedA100 80G SXM4$1.64 (on-demand)~20,000-80,000~$0.02-0.08
HRM batched (spot)A100 80G SXM4$0.45 (spot)~20,000-80,000~$0.006-0.02
DeepSeek R2 7B distillA100 80G SXM4$1.64~500-2,000~$0.82-3.28
DeepSeek R1 671B8x H100 SXM5~$24.80~100-400~$62-248

Throughput estimates for HRM vary by deliberation depth and batch size. Figures above assume depth 8, batch size 64-256. For deeper cost optimization techniques, see the reasoning model inference cost guide.

Pricing fluctuates based on GPU availability. The prices above are based on 02 May 2026 and may have changed. Check current GPU pricing → for live rates.


HRM proves that reasoning quality does not require scale. A 27M-parameter model on a single RTX 4090 handles structured reasoning tasks that would otherwise demand a multi-GPU H100 setup. Spheron's on-demand RTX 4090 and A100 instances are the cheapest place to run the small-GPU side of a two-tier reasoning stack.

Rent RTX 4090 → | A100 80GB on Spheron → | View all GPU pricing →

Deploy HRM on Spheron →

STEPS / 06

Quick Setup Guide

  1. Provision a GPU instance

    Rent an RTX 4090 (24GB VRAM) or A100 80G (80GB VRAM) on Spheron. For single-user or low-concurrency inference, RTX 4090 is sufficient and cheapest. For batched production serving with 16+ concurrent requests, use A100 80G.

  2. Install PyTorch and dependencies

    Install PyTorch 2.6 with CUDA 12.4 support, then clone and install the HRM library from the official GitHub repository. vLLM is not required for HRM inference. Verify GPU availability with torch.cuda.is_available().

  3. Download HRM model weights

    Pull HRM weights from Hugging Face using huggingface_hub or git-lfs. The full FP32 checkpoint is approximately 100MB; the recommended FP16 checkpoint is under 50MB. Verify the SHA checksum after download.

  4. Configure the inference loop

    Set a fixed random seed for deterministic outputs, configure max deliberation steps (default 8, increase to 16 for harder tasks), and enable KV-cache reuse for the executor module across deliberation iterations.

  5. Launch the inference server

    Load the HRM checkpoint with PyTorch and run the deliberation loop natively in your application. Wrap the inference function with FastAPI or Ray Serve for production serving.

  6. Configure production monitoring

    Track average deliberation depth per request, set an early-exit threshold when executor confidence exceeds 0.95, and configure fallback routing to DeepSeek R2 or a larger model when HRM's output entropy indicates low confidence.

FAQ / 05

Frequently Asked Questions

HRM has 27 million parameters and fits within 200MB of VRAM in FP16. A single RTX 4090 (24GB) is far more than enough. You can serve hundreds of concurrent HRM instances on a single RTX 4090 while leaving the rest of VRAM for batching overhead and KV cache.

On structured reasoning benchmarks like ARC-AGI and grid-constraint tasks, HRM achieves competitive accuracy at a fraction of the cost per query. HRM scored 40.3% on ARC-AGI, beating o3-mini-high (34.5%) with 27M parameters. DeepSeek R1 leads on open-ended generation, long-context summarization, and code synthesis. For batched structured reasoning jobs, HRM can deliver 20-50x better cost-per-correct-answer.

No. HRM is a custom recurrent architecture, not a transformer language model, so vLLM cannot serve it directly. Inference loads the HRM checkpoint with PyTorch and runs the deliberation loop natively inside your application. You can wrap that in FastAPI or Ray Serve for production serving.

The planner module decomposes a problem into sub-goals, then passes each sub-goal to the executor. The executor iterates multiple times per sub-goal using KV-cache reuse across iterations. The loop exits when the executor converges or hits a configurable max-step limit, producing a final answer without emitting intermediate reasoning tokens.

Use HRM for structured reasoning with a defined answer space: constraint satisfaction, ARC-style visual reasoning analogues, multi-step logic puzzles, and classification tasks that benefit from iterative refinement. Escalate to DeepSeek R2 or a frontier model when the task requires open-ended generation, broad world knowledge, or code synthesis.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.