VOOZH about

URL: https://thenewstack.io/kubernetes-fleet-management-scale/

⇱ How Microsoft is governing thousands of Kubernetes clusters without manual intervention - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2026-05-07 18:07:21
How Microsoft is governing thousands of Kubernetes clusters without manual intervention
podcast,sponsor-microsoft,sponsored-post,video,
Cloud Native Ecosystem / KubeCon Cloudnativecon EU 2026 / Kubernetes / Platform Engineering

How Microsoft is governing thousands of Kubernetes clusters without manual intervention

How Microsoft Azure Kubernetes Fleet Manager and Cilium Cluster Mesh tackle the complexity of managing thousands of Kubernetes clusters across distributed environments at scale.
May 7th, 2026 6:07pm by Adrian Bridgwater
👁 Featued image for: How Microsoft is governing thousands of Kubernetes clusters without manual intervention
Microsoft sponsored this post.

Kubernetes is complicated; everybody knows it. Logically enough, Kubernetes deployed as a cluster of collected and coalesced instances at “fleet scale” is unquestionably complicated.

Synchronizing the configurations of thousands of clusters across massively distributed environments that span on-premises, cloud, and edge is a big ask, so how is it done?

Small is beautiful, big is brutal

In a more standard (smaller) environment, automated controllers in Git repositories sync the desired state of a cluster held in Git with the actual state in a given Kubernetes cluster. 

Stephane Erbrech, principal software engineer at Microsoft, tells The New Stack that although GitOps is the dominant declarative management pattern in the Kubernetes ecosystem, its single-cluster assumptions are a real constraint as fleet size grows. 

“In a standard GitOps setup, cloud-native software engineering teams might manage one or two clusters,” Erbrech says. “At fleet scale, the complexity shifts from how you deploy… to how you govern a massive, distributed environment without manual intervention.”

“Since the customer base and community have grown around Kubernetes…teams now start with one cluster, but then grow to two, then ten… and then subsequently to hundreds or even thousands.”

He notes that GitOps typically assumes a 1:1 relationship between a repository and a cluster. It often overlooks multi-cluster complexities like global traffic routing, cross-cluster secret synchronization, and unified observability across environments.

“Since the customer base and community have grown around Kubernetes (and it has been cemented as the platform of choice), we’ve seen teams start with one cluster, but then grow to two, then ten… and then subsequently to hundreds or even thousands,” Erbrech explains. “On this journey, they all end up with the same problems that they used to have with VMs and how to manage them and stay compliant and secure.”

The need for massively distributed Kubernetes stems from many reasons (straightforward popularity being one), but a key driver is that AI is being deployed everywhere, i.e., on every edge device, from wind turbines to bakery ovens. This means unified cluster management had better scale the same way. As inference workloads become distributed by default, cluster management needs to move beyond the reconciliation lag that GitOps will suffer from at scale.

All the cluster’s a stage

Offering this backdrop as contextual validation for Microsoft Azure Kubernetes Fleet Manager, Erbrech says that this management-layer technology allows teams to define reusable strategies for orchestrating cluster updates across a fleet. 

Engineers can group clusters into stages to enable a more controlled rollout. This means cluster updates can be applied sequentially, potentially with validation in lower-risk environments (such as test environments) before being applied to critical live production cluster areas within the fleet.

“This control enables developers to deploy applications safely, environment by environment, cluster by cluster, at the pace the team chooses, all while continuously checking metrics and ensuring nothing breaks across the deployed environment,” Erbrech explains.

“Cilium Cluster Mesh is the technology we use to enable the cross-cluster connectivity [that Microsoft Azure Kubernetes Fleet Manager delivers] and enable the network to be seamless.”

Cilium Cluster Mesh

Cilium is also an important part of this story. The open-source networking, security, and observability service for cloud-native environments was extended with the arrival of Cilium Cluster Mesh.

“Cilium Cluster Mesh is the technology we use to enable the cross-cluster connectivity [that Microsoft Azure Kubernetes Fleet Manager delivers] and enable the network to be seamless,” Erbrech says.

The control aggregation this technology offers is clearly appealing to cloud-native engineers, many of whom will be painfully aware of just how multifarious and unwieldy the full Kubernetes toolkit can be. The elevated control that Erbrech talks of in Microsoft Azure Kubernetes Fleet Manager means teams can now enable clusters to “talk to each other” as workloads can be moved from cluster to cluster

“All that happens seamlessly, and the end user is none the wiser, because the workload remains eminently accessible, and everything works fine,” Erbrech surmises. 

Because GPU resources are expensive and occasionally scarce, cross-cluster workload journeys help ensure teams take advantage of provisioned resources and do not leave them idle and wasted.

We mentioned the wider transept of AI inference at the start in relation to the edge, and there’s a second vector that impacts the workload trajectory to accommodate here. Because GPU resources are expensive and occasionally scarce, cross-cluster workload journeys help ensure teams make efficient use of provisioned resources and do not leave them idle or wasted.

Cluster lifecycle management

All of which brings us full circle to cluster lifecycle management and the ability to use Microsoft Azure Kubernetes Fleet Manager as a route not just for sequenced Kubernetes version upgrades, but also for end-of-life actions as clusters are periodically retired. 

As platform engineering now intersects with cloud-native management layers across increasingly distributed and complex environments, we will need to manage the fleet to keep it afloat amid a veritable armada of misconfiguration skews. Batten down the hatches, everyone.

Microsoft (Nasdaq “MSFT” @microsoft) enables digital transformation for the era of an intelligent cloud and an intelligent edge. Its mission is to empower every person and every organization on the planet to achieve more.
Learn More
The latest from Microsoft
TRENDING STORIES
Adrian Bridgwater is a technology journalist with three decades of press experience. He has an extensive background in communications, starting in print media, newspapers and also television. Primarily working as an analysis writer dedicated to a software application development ‘beat’,...
Read more from Adrian Bridgwater
Microsoft sponsored this post.
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.