![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
When Clayton Coleman‘s quote was dropped at KubeCon NA, it resonated. Just five years back, ask any Site Reliability Engineer (SRE) about their job, and you’d hear about keeping web apps fast, scalable, and resilient. Today? The landscape is shifting beneath our feet. AI inference workloads — the process where a trained model uses its knowledge to make predictions on new data — are becoming as mission-critical as web applications ever were.
“Inference — refers to the process by which a trained model applies its learned patterns to new, unseen data to generate predictions or decisions. During inference, the model utilizes its knowledge to respond to real-world inputs.”
This evolution demands a new discipline: AI Reliability Engineering (AIRe). We’re no longer just battling latency spikes in HTTP requests; we’re grappling with token generation delays in LLMs. Optimizing database queries feels almost quaint compared to optimizing model checkpoints and tensors. AI models, like the web apps before them, demand intense scalability, reliability, and observability — but on a level we’re still architecting.
I’ve spent almost two years deep in AI Reliability Engineering — researching, prototyping, and building real-world inference systems. From the DevOps conferences to SRE Days and community meetups in Nuremberg and London, I’ve shared hard-earned lessons with peers in the field. Now, I’m bringing those insights here.
Unreliable AI is worse than no AI at all.
Inference is no longer just a sub-process of machine learning. It’s the application. It’s production. And it’s redefining the operational stack beneath it.
Traditional SRE principles offer a foundation, but they don’t quite fit the AI.
Model Decay — Silent model degradation — unlike traditional software issues that trigger immediate crashes or errors, AI models can degrade silently, continuing to function but with increasingly inaccurate, biased, or inconsistent outputs.
Why We Treat Silent Model Degradation Like a Production Incident
Because it is one. Unlike crashing pods or failing endpoints, silent model degradation slips under the radar — the model keeps responding, but it’s answers grow weaker, biased, or just wrong. Users don’t see 500 errors; they get hallucinations, toxic outputs, or faulty decisions. That’s not just a bug — it’s a breach of trust. In the world of AI, correctness is uptime. When reliability means quality, degradation is downtime.
Perhaps we won’t just extend Kubernetes for AI — we might eventually need to fork it.
Large Language Models (LLMs) require specialized traffic routing, rate limiting, and security enforcement capabilities that standard Kubernetes Ingress mechanisms weren’t built to handle. Kubernetes, architected around stateless web apps, wasn’t designed with inference in mind. While it’s adapting, key gaps remain.
Inference workloads demand tightly integrated solutions for hardware acceleration, resource orchestration, and high-throughput traffic control. The Kubernetes ecosystem is catching up with initiatives like WG-Serving (targeting optimized AI/ML serving), Device Management (focused on integrating GPUs/TPUs via DRA), and the evolving Gateway API Inference Extension, which lays the groundwork for scalable and secure LLM endpoint routing. Meanwhile, emerging AI Gateways step in to fill the void — providing routing logic, observability, and access control tailored to inference.
Still, we’re layering AI on top of an orchestration system that wasn’t originally meant for it. Google’s announcement of supporting 65K-node Kubernetes clusters by swapping etcd with Spanner-backed storage hints at a future where foundational changes might be required. Perhaps we won’t just extend Kubernetes for AI — we might eventually need to fork it.
So, how do we apply SRE practices to this new AI reality?
In the early days of SRE, we relied on load balancers, service meshes, and API gateways to manage traffic, enforce policies, and maintain observability. Today, inference workloads demand the same — but with more complexity, more scale, and far less tolerance for latency or failure. That’s where AI Gateways come in.
Think of them as the modern SRE’s all-in-one box for AI: routing requests to the right model, balancing load across replicas, enforcing rate limits and security
policies, and exposing deep observability hooks — all at once. Projects like Gloo AI Gateway are pushing this forward. They’re tackling enterprise-grade challenges, such as model cost control, token-based security, and real-time tracing of LLM responses — challenges that traditional service meshes weren’t built for.
This is where SRE belongs today: not just tuning autoscalers, but operating the control plane for intelligent systems.
The AI Gateway is the new tool on our belt — and maybe the most important one.
Our role as SREs is evolving. We need the curiosity described in “97 Things Every SRE Should Know” more than ever — the drive to understand the entire system, from silicon to the nuances of model behavior. We must build AI we can trust, leveraging the emerging ecosystem of tools and standards.
Björn Rabenstein spoke of a “third age” of SRE, where its principles become universally embedded. While this is true, the new era is being shaped by AI. AI Reliability Engineering isn’t just an extension of SRE; it’s a fundamental reshaping, shifting focus from infrastructure reliability to the reliability of intelligent systems themselves.
Because if Inference truly is the new web app, then ensuring its Reliability is the new Age of SRE. And an unreliable AI? That’s worse than no AI at all.