Try Crusoe Serverless Fine-Tuning in private preview.

Crusoe Managed Inference

Breakthrough inference
speed is here

Achieve up to
9.9x faster time-to-first-token^*

Process up to 5x more tokens per second^*

Optimal price-performance.
No limits.

Run model inference with fast time-
to-first-token, low latency, limitless throughput, and resilient scaling.

Eliminate latency with Crusoe's MemoryAlloy technology.

Scale to more users while maintaining consistent low latency.

Reduce token spend and serve more users without hitting capacity limits.

View pricing

Try now

Benchmarked against vLLM for Llama-3.3-70B model.
‍Read our blog to learn more details.

Crusoe's inference engine is powered by MemoryAlloy^TM technology, a unique cluster-native memory fabric that enables persistent sessions and intelligent request routing.

👁 Abstract illustration composed of layered curved lines forming an organic shape

Model catalog

Experiment with top open/open-source models or work with our team to optimize performance for your own fine-tuned model.

👁 Image

Nemotron 3 Ultra

Input price

$1.00

/ 1M tokens

Output price

$3.20

/ 1M tokens

/ video sec

Cached token price

$0.25

/ 1M tokens

Context length

262,144

Try the model

👁 Image

DeepSeek V3 0324

Input price

$0.50

/ 1M tokens

Output price

$1.50

/ 1M tokens

/ video sec

Cached token price

$0.25

/ 1M tokens

Context length

163,840

Try the model

👁 Image

DeepSeek V4 Flash

Input price

$0.14

/ 1M tokens

Output price

$0.28

/ 1M tokens

/ video sec

Cached token price

$0.03

/ 1M tokens

Context length

1,048,576

Try the model

👁 Image

DeepSeek V4 Pro

Input price

$1.74

/ 1M tokens

Output price

$3.48

/ 1M tokens

/ video sec

Cached token price

$0.15

/ 1M tokens

Context length

1,048,576

Try the model

👁 Image

Gemma-4- 31B-it

Input price

$0.14

/ 1M tokens

Output price

$0.40

/ 1M tokens

/ video sec

Cached token price

$0.14

/ 1M tokens

Context length

262,144

Try the model

👁 Image

GLM 5.1

Input price

$1.20

/ 1M tokens

Output price

$4.40

/ 1M tokens

/ video sec

Cached token price

$0.25

/ 1M tokens

Context length

202,752

Try the model

👁 Image

gpt-oss- 120b

Input price

$0.05

/ 1M tokens

Output price

$0.20

/ 1M tokens

/ video sec

Cached token price

$0.05

/ 1M tokens

Context length

131,072

Try the model

👁 Image

Llama 3.3 70B Instruct

Input price

$0.25

/ 1M tokens

Output price

$0.75

/ 1M tokens

/ video sec

Cached token price

$0.13

/ 1M tokens

Context length

131,072

Try the model

👁 Image

Nemotron-3-Nano- 30B-A3B-FP8

Input price

$0.05

/ 1M tokens

Output price

$0.20

/ 1M tokens

/ video sec

Cached token price

$0.03

/ 1M tokens

Context length

261,144

Try the model

👁 Image

Nemotron 3 VoiceChat

Cached token price

Context length

131,072

Try the model

👁 Image

Nemotron-3-Nano-Omni- 30B-A3B-Reasoning

Input price

$0.30

/ 1M tokens

Output price

$1.83

/ 1M tokens

/ video sec

Cached token price

$0.30

/ 1M tokens

Try the model

👁 Image

Nemotron-3-Super- 120B-A12B-FP8

Input price

$0.30

/ 1M tokens

Output price

$2.40

/ 1M tokens

/ video sec

Cached token price

$0.15

/ 1M tokens

Context length

261,144

Try the model

👁 Image

Qwen3 235B A22B Instruct 2507

Input price

$0.22

/ 1M tokens

Output price

$0.80

/ 1M tokens

/ video sec

Cached token price

$0.11

/ 1M tokens

Context length

262,144

Try the model

👁 Image

Yutori n1.5

Input price

$1.50

/ 1M tokens

Output price

$5.00

/ 1M tokens

/ video sec

Cached token price

$1.50

/ 1M tokens

Context length

128k

Try the model

Bring your own
fine-tuned model

Contact sales

👁 Image

Nemotron-3-Nano-Omni-30B-
A3B Reasoning

Input price / 1M tokens

$0.30 (text, image, video)
$0.50 (audio)

Output price / 1M tokens

$1.83

Cached price / 1M tokens

$0.30 (text, image, video)
$0.50 (audio)

Context length

256,000

Try the model

Built with cutting-edge technology to deliver unmatched performance

Breakthrough speed

Achieve up to 9.9x faster time-to-first-token* for real-world workloads with our inference engine featuring Crusoe's MemoryAlloy technology, a cluster-wide KV cache that eliminates duplicate prefills.

Superior throughput

Process up to 5x tokens per second* while maintaining low latency for each user with speculative decoding and dynamic batching.

Seamless scaling

Meet changing workload demands with scaling that is managed for you, and reliable even when loading the largest models.

Benchmarked against vLLM for Llama-3.3-70B model. Read our blog to learn more details.

Crusoe inference engine vs vLLM

TTFT

Throughput

9.9x

5.0x

x Improvement vs. vLLM

Llama-3.3-70B model, 4-node deployment

Optimizing for throughput and price is critical for our product experience. We're excited to explore the performance benefits that Crusoe's Inference Engine provides, and are looking forward to serving our models through the service.

👁 Headshot of Dhruv Batra

Dhruv Batra

Co-founder & Chief Scientist

👁 Image

This is the kind of foundational technology that will enable our customers to build and deploy far more powerful and responsive AI agents with confidence.

👁 Headshot of Roey Lalazar against wooden background

Roey Lalazar

Co-founder & CTO

👁 Image

We need to process complex records instantly. Crusoe Managed Inference helps us meet that challenge. It provides a reliable path to production at a pace we haven’t seen on other platforms.

👁 Headshot of Grant Jensen against dark background.

Grant Jensen

Co-Founder & CEO

👁 Image

0:00/0:00

0:00

0.25x

0.5x

0.75x

1.5x

👁 Black background with the words 'Crusoe Managed Inference' in white and green text alongside four layered transparent panels with abstract white line designs.
👁 Image

Crusoe Intelligence Foundry,
designed for AI developers

Speed up app development with a unified hub that accelerates model discovery and experimentation, supports quick iteration, and removes the burden of managing infrastructure.

API keys for fastest
path to production

Experiment with top open-source models rapidly. Generate API keys, monitor performance metrics and enable provisioned throughput for production-scale deployments.

Managed endpoints
for rapid deployment

Leverage fully managed endpoints powered by our inference engine, with Crusoe's MemoryAlloy technology, tuned specifically to each model for optimized performance.

Unified interface for
cross-team collaboration

Users working across teams can easily switch between the Crusoe Intelligence Foundry for inference tasks and the Crusoe Cloud Console for infrastructure-as-a-service (IaaS) resources within a single, integrated environment.

Try now

👁 Screenshot of a coding interface showing a prompt requesting a Python 3.11+ fractal viewer app, followed by a partial Python script with imports and comments.

Frequently
asked questions

👁 Image

URL: https://www.crusoe.ai/cloud/managed-inference

⇱ Managed inference for open models | Low latency + throughput | Crusoe

Breakthrough inference
speed is here

Achieve up to
9.9x faster time-to-first-token^*

Process up to 5x more tokens per second^*

Optimal price-performance.
No limits.

Crusoe's inference engine is powered by MemoryAlloy^TM technology, a unique cluster-native memory fabric that enables persistent sessions and intelligent request routing.

Model catalog

Bring your own
fine-tuned model

Nemotron-3-Nano-Omni-30B-
A3B Reasoning

Built with cutting-edge technology to deliver unmatched performance

Breakthrough speed

Superior throughput

Seamless scaling

Crusoe inference engine vs vLLM

Crusoe Intelligence Foundry,
designed for AI developers

API keys for fastest
path to production

Managed endpoints
for rapid deployment

Unified interface for
cross-team collaboration

Frequently
asked questions

Are you ready to build something amazing?

URL: https://www.crusoe.ai/cloud/managed-inference

⇱ Managed inference for open models | Low latency + throughput | Crusoe

Breakthrough inferencespeed is here

Achieve up to9.9x faster time-to-first-token*

Process up to 5x more tokens per second*

Optimal price-performance.No limits.

Crusoe's inference engine is powered by MemoryAlloyTM technology, a unique cluster-native memory fabric that enables persistent sessions and intelligent request routing.

Model catalog

Bring your ownfine-tuned model

Nemotron-3-Nano-Omni-30B-A3B Reasoning

Built with cutting-edge technology to deliver unmatched performance

Breakthrough speed

Superior throughput

Seamless scaling

Crusoe inference engine vs vLLM

Crusoe Intelligence Foundry,designed for AI developers

API keys for fastestpath to production

Managed endpointsfor rapid deployment

Unified interface forcross-team collaboration

Frequentlyasked questions

Are you ready to build something amazing?

Breakthrough inference
speed is here

Achieve up to
9.9x faster time-to-first-token^*

Process up to 5x more tokens per second^*

Optimal price-performance.
No limits.

Crusoe's inference engine is powered by MemoryAlloy^TM technology, a unique cluster-native memory fabric that enables persistent sessions and intelligent request routing.

Bring your own
fine-tuned model

Nemotron-3-Nano-Omni-30B-
A3B Reasoning

Crusoe Intelligence Foundry,
designed for AI developers

API keys for fastest
path to production

Managed endpoints
for rapid deployment

Unified interface for
cross-team collaboration

Frequently
asked questions