![]() |
VOOZH | about |
We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.
Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.
Follow TNS on your favorite social media networks.
Become a TNS follower on LinkedIn.
Check out the latest featured and trending stories while you wait for your first TNS newsletter.
Kubernetes has always been sensitive to bad inputs. What’s changed is how often teams are running into it. As more bursty workloads—especially inference—land on Kubernetes, a familiar pattern shows up more often: clusters add nodes even when utilization looks fine. The CNCF’s latest annual survey frames Kubernetes as the default platform for running AI in production, which means more teams are now feeling the edges of scheduling and autoscaling behavior that used to stay in the background.
If you’ve ever looked at a cluster and wondered why it’s adding nodes when the dashboards suggest plenty of headroom, you’re not alone. This tends to show up after Cluster Autoscaler or Karpenter is already in place and obvious capacity issues such as mis-sized nodes, overly tight constraints, or consolidation settings have been cleaned up.
And yet capacity keeps creeping up. Why is that?
Most of the time it isn’t because Cluster Autoscaler or Karpenter is misconfigured. It’s the inputs. Requests are often set with a bit of guesswork and a safety buffer on top, and they don’t always get revisited as the workload changes.
The graphs you’re staring at can look calm, but when you look across a lot of clusters, the signs show up in a familiar order:
From the outside, it looks like the cluster is scaling when it shouldn’t. But under the hood, it’s doing what it was built to do: place pods conservatively based on declared reservations.
The scheduler places and packs pods based on requests. Usage graphs don’t factor into that placement decision.
So when requests drift high, nodes get treated as full long before usage gets there. Once the scheduler can’t place pods, the autoscaler adds enough capacity so the pods can land, and that’s when the cluster starts growing even though usage dashboards still look fine.
Request drift doesn’t require a messy environment. Actually, the causes are usually boring.
A service has an incident and someone bumps requests to stabilize it. The incident fades, the request values stay, and nobody wants to be the person who lowers them and finds out the hard way they were masking a real issue.
Defaults get standardized because teams need a way to ship, and those defaults get copied into new services long after anyone remembers why they were chosen.
Workloads change. Traffic shape shifts. Dependencies change. A request that was reasonable a quarter ago can be off after a launch, a new customer profile, or a batch job that quietly becomes part of the daily baseline.
Inference workloads make this show up faster because replica counts move quickly, and a padded request doesn’t stay small for long. Either way, you end up with drift: old values surviving in YAML because nothing forced them to get revisited.
Instead, the question I’d start with is: do our requests still resemble how these workloads run today?
If requests are inflated, the scheduler blocks placements that should have fit if you were going off usage. Packing looks worse than it should, so the autoscaler adds nodes to satisfy reservations rather than actual pressure. Consolidation can’t save you because the cluster is full on paper.
You’re not trying to nail perfect numbers. You’re trying to get requests back into the neighborhood so the scheduler and autoscaler stop amplifying stale reservations.
You can sanity-check this pretty quickly. Start with the handful of namespaces that dominate capacity and do two checks over a week or two.
First, compare requests to observed usage over time. Not a single snapshot. Look for a gap that repeats day after day.
Second, look at what happens when the workload scales out. Padding doesn’t stay small when replicas climb. If a deployment goes from 5 to 50 replicas under load, a little extra request turns into a lot of reserved capacity fast.
Once requests are inflated and replicas climb, extra nodes are the expected outcome.
Allocation views can help here because they make baseline drivers easier to see, even if you’re not doing chargeback.
When requests are inflated, the same mismatch tends to surface: allocated looks high while usage looks low, and shared overhead becomes the part nobody can explain cleanly. Usually the tool is telling the truth — it’s just showing an uncomfortable mix of reserved capacity, safety headroom, and shared services that everyone depends on.
The useful questions are practical:
Once those buckets are explicit, the conversation gets more straightforward. You stop debating the output and start looking at the biggest gaps.
This is where teams sometimes over-scope it. You can usually get traction by starting with a small set of workloads.
Start with the few services that dominate baseline capacity. If you bring those requests back in line, packing improves and the autoscaler stops chasing reservations that aren’t real. That’s often the point where the cluster stops feeling like it’s scaling for no reason.
The other make-or-break factor is rollout safety. If request changes reliably create pager noise, nobody keeps doing them. The padded values stick around because they feel safer.
So the sustainable approach is intentionally boring: make changes gradually, make rollback easy, and put guardrails around how aggressive you’re willing to be. Once teams trust that request changes won’t destabilize things, drift becomes manageable instead of permanent.
When requests are closer to reality, a few things stop being weird.
The scheduler can pack workloads more cleanly. Pending pods line up with real demand instead of stale reservations. Autoscaling adds capacity when there’s actual pressure, not because requests are reserving empty space.
Allocation also gets easier to explain, mostly because shared overhead is visible instead of hidden inside everyone’s numbers.
If capacity has been creeping up and cluster-level tuning isn’t sticking, start with the workloads that reserve the most. Compare requests to usage over time. If they’re far apart, you’ve probably found the driver behind the strange scaling behavior.