VOOZH about

URL: https://thenewstack.io/red-hats-ai-platform-now-has-an-ai-inference-server/

⇱ Red Hat's AI Platform Now Has an AI Inference Server - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2025-05-21 12:00:14
Red Hat's AI Platform Now Has an AI Inference Server
AI Engineering / Cloud Native Ecosystem / Linux

Red Hat’s AI Platform Now Has an AI Inference Server

Run any GenAI model on any cloud, hybrid cloud, or multicloud with Red Hat AI Platform.
May 21st, 2025 12:00pm by Steven J. Vaughan-Nichols
👁 Featued image for: Red Hat’s AI Platform Now Has an AI Inference Server

BOSTON — So you want to run a generative AI (GenAI) model, Or, make that models. Or, OK, let’s admit it, you want to run multiple models on the platforms you want when you want them. That’s not easy. To address this need, at Red Hat Summit 2025, Red Hat rolled out the Red Hat AI Interference (RHAI) server.

RHAI  is a high-performance, open source platform that works as the execution engine for AI workloads. Like the name suggests, RHAI is all about the inference. This is, where pre-trained models generate predictions or responses based on new data. Inference is AI’s critical execution engine, where pre-trained models translate data into user interactions.

This platform is built on the widely adopted, open source vLLM project. VLLM is a high-throughput and memory-efficient inference and serving engine for Large Language Models (LLMs). The difference between vLLM and older inference engines is that the earlier engines are bottlenecked by memory I/O. VLLM divides memory, wherever it may be, into manageable chunks and only accesses what’s needed when necessary. If that sounds a lot like how computers handle virtual memory and paging, you’re right, it does, and it works just as well for LLMs as it does for your PCs.

Neural Magic Technology

To vLLM, Red Hat added technologies from its Neural Magic acquisition. Neural Magic brings software and algorithms that accelerate GenAI inference workloads to the table. The result is an AI inference platform that’s fast enough and cost-efficient enough for you to deploy scalable AI inference engines across any cloud.

RHAI’s key features include:

  • Support for any GenAI Model: The server is model-agnostic, supporting leading open source and third-party validated models such as Llama, Gemma, DeepSeek, Mistral, and Phi, among others.
  • Hardware and Cloud Flexibility: Users can run AI inference on any AI accelerator (GPUs, CPUs, specialized chips) and in any environment — on-premises, public cloud, or hybrid cloud — including seamless integration with Red Hat OpenShift AI and Red Hat Enterprise Linux AI (RHEL AI).
  • Performance and Efficiency: Leveraging vLLM’s high-throughput inference engine, the server supports features like large input contexts, multi-GPU acceleration, and continuous batching, delivering, Red Hat claims, two to four times more token production with optimized models.
  • Model Compression and Optimization: Built-in tools reduce the size of foundational and fine-tuned models, minimizing compute requirements while maintaining or even improving accuracy.
  • Enterprise-Grade Support: Red Hat provides hardened, supported distributions and third-party support, enabling deployment even on non-Red Hat Linux and Kubernetes platforms.

Red Hat‘s AI Inference Server is available as a standalone containerized solution or as an integrated component of Red Hat OpenShift AI. This is what empowers you to use RHAI to deploy and scale pretty much anywhere. As Brian Stevens, Red Hat’s AI CTO and former Neural Magic CEO, explained in his keynote,  you can deploy it “anywhere on anything.” Or, more specifically, on Red Hat OpenShift or any third-party Linux or Kubernetes environment.” I don’t know about you, but I like that flexibility.

From a business perspective, Joe Fernandes, Red Hat’s VP and general manager of the AI Business Unit, said, “Inference is where the real promise of GenAI is delivered, where user interactions are met with fast, accurate responses delivered by a given model, but it must be delivered in an effective and cost-efficient way. RHAI Server is intended to meet the demand for high-performing, responsive inference at scale while keeping resource demands low, providing a common inference layer that supports any model, running on any accelerator in any environment.”

Red Hat has big ambitions for RHAI. Red Hat is aiming to do for AI what it did for Linux — make it accessible, reliable, and ubiquitous across enterprise environments.

Distributed GenAI Inference at Scale

Of course, for that to happen, you need a solid open source foundation. For that, Red Hat, in partnership with CoreWeave, Google Cloud, IBM Research NVIDIA and numerous other companies and groups has launched llm-d. Llm-d is an open source project that marries Kubernetes, vLLM-based distributed inference, and intelligent AI-aware network routing to create robust, large language model (LLM) inference clouds.

Besides Kubernetes and vLLM, llm-d also incorporates:

  • Prefill and Decode Disaggregation to separate the input context and token generation phases of AI into discrete operations, where they can then be distributed across multiple servers.
  • KV (key-value) Cache Offloading, based on LMCache, shifts the memory burden of the KV cache from GPU memory to more cost-efficient and abundant standard storage, like CPU memory or network storage.
  • AI-Aware Network Routing for scheduling incoming requests to the servers and accelerators that are most likely to have hot caches of past inference calculations.
  • High-performance communication APIs for faster and more efficient data transfer between servers, with support for NVIDIA Inference Xfer Library (NIXL).

Put it all together, Stevens explained, and “The launch of the llm-d community … marks a pivotal moment in addressing the need for scalable GenAI inference, a crucial obstacle that must be overcome to enable broader enterprise AI adoption. By tapping the innovation of vLLM and the proven capabilities of Kubernetes, llm-d paves the way for distributed, scalable and high-performing AI inference across the expanded hybrid cloud, supporting any model, any accelerator, on any cloud environment and helping realize a vision of limitless AI potential.”

TRENDING STORIES
Steven J. Vaughan-Nichols, aka sjvn, has been writing about technology and the business of technology since CP/M-80 was the cutting-edge PC operating system, 300bps was a fast internet connection, WordStar was the state-of-the-art word processor, and we liked it.
Read more from Steven J. Vaughan-Nichols
SHARE THIS STORY
TRENDING STORIES
Red Hat is a sponsor of The New Stack.
TNS owner Insight Partners is an investor in: Unit.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.