VOOZH about

URL: https://www.crusoe.ai/cloud/managed-inference

⇱ Managed inference for open models | Low latency + throughput | Crusoe


Sign up
Crusoe Managed Inference

Breakthrough inference
speed is here

Achieve up to
9.9x faster time-to-first-token*

Process up to 5x more tokens per second*

Optimal price-performance.
No limits.

Run model inference with fast time-
to-first-token, low latency, limitless throughput, and resilient scaling.

Eliminate latency with Crusoe's MemoryAlloy technology.

Scale to more users while maintaining consistent low latency.

Reduce token spend and serve more users without hitting capacity limits.

*
Benchmarked against vLLM for Llama-3.3-70B model.
‍Read our blog to learn more details.

Crusoe's inference engine is powered by MemoryAlloyTM technology, a unique cluster-native memory fabric that enables persistent sessions and intelligent request routing.

Model catalog

Experiment with top open/open-source models or work with our team to optimize performance for your own fine-tuned model.
πŸ‘ Image
Nemotron 3 Ultra
Input price

$1.00

 / 1M tokens
Output price

$3.20

 / 1M tokens
 / video sec
Cached token price

$0.25

 / 1M tokens
Context length
262,144
πŸ‘ Image
DeepSeek V3 0324
Input price

$0.50

 / 1M tokens
Output price

$1.50

 / 1M tokens
 / video sec
Cached token price

$0.25

 / 1M tokens
Context length
163,840
πŸ‘ Image
DeepSeek V4 Flash
Input price

$0.14

 / 1M tokens
Output price

$0.28

 / 1M tokens
 / video sec
Cached token price

$0.03

 / 1M tokens
Context length
1,048,576
πŸ‘ Image
DeepSeek V4 Pro
Input price

$1.74

 / 1M tokens
Output price

$3.48

 / 1M tokens
 / video sec
Cached token price

$0.15

 / 1M tokens
Context length
1,048,576
πŸ‘ Image
Gemma-4- 31B-it
Input price

$0.14

 / 1M tokens
Output price

$0.40

 / 1M tokens
 / video sec
Cached token price

$0.14

 / 1M tokens
Context length
262,144
Input price

$1.20

 / 1M tokens
Output price

$4.40

 / 1M tokens
 / video sec
Cached token price

$0.25

 / 1M tokens
Context length
202,752
πŸ‘ Image
gpt-oss- 120b
Input price

$0.05

 / 1M tokens
Output price

$0.20

 / 1M tokens
 / video sec
Cached token price

$0.05

 / 1M tokens
Context length
131,072
πŸ‘ Image
Llama 3.3 70B Instruct
Input price

$0.25

 / 1M tokens
Output price

$0.75

 / 1M tokens
 / video sec
Cached token price

$0.13

 / 1M tokens
Context length
131,072
πŸ‘ Image
Nemotron-3-Nano- 30B-A3B-FP8
Input price

$0.05

 / 1M tokens
Output price

$0.20

 / 1M tokens
 / video sec
Cached token price

$0.03

 / 1M tokens
Context length
261,144
πŸ‘ Image
Nemotron 3 VoiceChat
Cached token price
Context length
131,072
πŸ‘ Image
Nemotron-3-Nano-Omni- 30B-A3B-Reasoning
Input price

$0.30

 / 1M tokens
Output price

$1.83

 / 1M tokens
 / video sec
Cached token price

$0.30

 / 1M tokens
πŸ‘ Image
Nemotron-3-Super- 120B-A12B-FP8
Input price

$0.30

 / 1M tokens
Output price

$2.40

 / 1M tokens
 / video sec
Cached token price

$0.15

 / 1M tokens
Context length
261,144
πŸ‘ Image
Qwen3 235B A22B Instruct 2507
Input price

$0.22

 / 1M tokens
Output price

$0.80

 / 1M tokens
 / video sec
Cached token price

$0.11

 / 1M tokens
Context length
262,144
πŸ‘ Image
Yutori n1.5
Input price

$1.50

 / 1M tokens
Output price

$5.00

 / 1M tokens
 / video sec
Cached token price

$1.50

 / 1M tokens
Context length
128k

Bring your own
fine-tuned model

πŸ‘ Image

Nemotron-3-Nano-Omni-30B-
A3B Reasoning

Input price / 1M tokens

$0.30 (text, image, video)
$0.50 (audio)

Output price / 1M tokens

$1.83

Cached price / 1M tokens

$0.30 (text, image, video)
$0.50 (audio)

Context length

256,000

Built with cutting-edge technology to deliver unmatched performance

1

Breakthrough speed

Achieve up to 9.9x faster time-to-first-token* for real-world workloads with our inference engine featuring Crusoe's MemoryAlloy technology, a cluster-wide KV cache that eliminates duplicate prefills.
2

Superior throughput

Process up to 5x tokens per second* while maintaining low latency for each user with speculative decoding and dynamic batching.
3

Seamless scaling

Meet changing workload demands with scaling that is managed for you, and reliable even when loading the largest models.
*
Benchmarked against vLLM for Llama-3.3-70B model. Read our blog to learn more details.

Crusoe inference engine vs vLLM

0
2
4
6
8
10
TTFT
Throughput
9.9x
5.0x
x Improvement vs. vLLM
Llama-3.3-70B model, 4-node deployment
Optimizing for throughput and price is critical for our product experience. We're excited to explore the performance benefits that Crusoe's Inference Engine provides, and are looking forward to serving our models through the service.
πŸ‘ Headshot of Dhruv Batra
Dhruv Batra
Co-founder & Chief Scientist
πŸ‘ Image
This is the kind of foundational technology that will enable our customers to build and deploy far more powerful and responsive AI agents with confidence.
πŸ‘ Image
We need to process complex records instantly. Crusoe Managed Inference helps us meet that challenge. It provides a reliable path to production at a pace we haven’t seen on other platforms.
πŸ‘ Image

Crusoe Intelligence Foundry,
designed for AI developers

Speed up app development with a unified hub that accelerates model discovery and experimentation, supports quick iteration, and removes the burden of managing infrastructure.

API keys for fastest
path to production

Experiment with top open-source models rapidly. Generate API keys, monitor performance metrics and enable provisioned throughput for production-scale deployments.

Managed endpoints
for rapid deployment

Leverage fully managed endpoints powered by our inference engine, with Crusoe's MemoryAlloy technology, tuned specifically to each model for optimized performance.

Unified interface for
cross-team collaboration

Users working across teams can easily switch between the Crusoe Intelligence Foundry for inference tasks and the Crusoe Cloud Console for infrastructure-as-a-service (IaaS) resources within a single, integrated environment.

Frequently
asked questions

Are you ready to build something amazing?