![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
BOSTON — So you want to run a generative AI (GenAI) model, Or, make that models. Or, OK, let’s admit it, you want to run multiple models on the platforms you want when you want them. That’s not easy. To address this need, at Red Hat Summit 2025, Red Hat rolled out the Red Hat AI Interference (RHAI) server.
RHAI is a high-performance, open source platform that works as the execution engine for AI workloads. Like the name suggests, RHAI is all about the inference. This is, where pre-trained models generate predictions or responses based on new data. Inference is AI’s critical execution engine, where pre-trained models translate data into user interactions.
This platform is built on the widely adopted, open source vLLM project. VLLM is a high-throughput and memory-efficient inference and serving engine for Large Language Models (LLMs). The difference between vLLM and older inference engines is that the earlier engines are bottlenecked by memory I/O. VLLM divides memory, wherever it may be, into manageable chunks and only accesses what’s needed when necessary. If that sounds a lot like how computers handle virtual memory and paging, you’re right, it does, and it works just as well for LLMs as it does for your PCs.
To vLLM, Red Hat added technologies from its Neural Magic acquisition. Neural Magic brings software and algorithms that accelerate GenAI inference workloads to the table. The result is an AI inference platform that’s fast enough and cost-efficient enough for you to deploy scalable AI inference engines across any cloud.
RHAI’s key features include:
Red Hat‘s AI Inference Server is available as a standalone containerized solution or as an integrated component of Red Hat OpenShift AI. This is what empowers you to use RHAI to deploy and scale pretty much anywhere. As Brian Stevens, Red Hat’s AI CTO and former Neural Magic CEO, explained in his keynote, you can deploy it “anywhere on anything.” Or, more specifically, on Red Hat OpenShift or any third-party Linux or Kubernetes environment.” I don’t know about you, but I like that flexibility.
From a business perspective, Joe Fernandes, Red Hat’s VP and general manager of the AI Business Unit, said, “Inference is where the real promise of GenAI is delivered, where user interactions are met with fast, accurate responses delivered by a given model, but it must be delivered in an effective and cost-efficient way. RHAI Server is intended to meet the demand for high-performing, responsive inference at scale while keeping resource demands low, providing a common inference layer that supports any model, running on any accelerator in any environment.”
Red Hat has big ambitions for RHAI. Red Hat is aiming to do for AI what it did for Linux — make it accessible, reliable, and ubiquitous across enterprise environments.
Of course, for that to happen, you need a solid open source foundation. For that, Red Hat, in partnership with CoreWeave, Google Cloud, IBM Research NVIDIA and numerous other companies and groups has launched llm-d. Llm-d is an open source project that marries Kubernetes, vLLM-based distributed inference, and intelligent AI-aware network routing to create robust, large language model (LLM) inference clouds.
Besides Kubernetes and vLLM, llm-d also incorporates:
Put it all together, Stevens explained, and “The launch of the llm-d community … marks a pivotal moment in addressing the need for scalable GenAI inference, a crucial obstacle that must be overcome to enable broader enterprise AI adoption. By tapping the innovation of vLLM and the proven capabilities of Kubernetes, llm-d paves the way for distributed, scalable and high-performing AI inference across the expanded hybrid cloud, supporting any model, any accelerator, on any cloud environment and helping realize a vision of limitless AI potential.”