VOOZH about

URL: https://dzone.com/articles/open-source-gitops-edge-rancher-fleet

⇱ Deploying to Thousands of Clusters With Rancher Fleet


Related

  1. DZone
  2. Testing, Deployment, and Maintenance
  3. Deployment
  4. Open-Source GitOps at the Edge: Deploying to Thousands of Clusters With Rancher Fleet

Open-Source GitOps at the Edge: Deploying to Thousands of Clusters With Rancher Fleet

Establish GitOps-driven CI/CD pipelines to create zero-downtime deployments across thousands of edge locations with automated rollbacks.

By Mar. 03, 26 · Analysis
Likes
Comment
Save
1.0K Views

Join the DZone community and get the full member experience.

Join For Free

The Edge Deployment Challenge

Modern microservice applications are moving beyond central data centers and the cloud to the edge to provide ultra-low latency and real-time processing. This enables real-time responsiveness for applications powering autonomous vehicles, remote healthcare, and IoT solutions.

A fundamental operational challenge exists when you attempt to deploy code to distributed edge computing environments. Each time that you are deploying code to containerized workloads at thousands of different edge locations, it will require coordination across unreliable networks, heterogeneous hardware, and edge locations with no technical staff available to correct failed deployments.

The edge computing environment provides limited connectivity, low bandwidth for other critical business operations, and no on-site engineers to resolve failures during deployment.

CI/CD pipelines based on traditional models use a push-based model where a centralized server connects to the target deployment environments and directly pushes changes to those environments. The traditional push-based model assumes the deployment target is always accessible, and a failure to deploy can be immediately recovered. Edge computing violates both assumptions.

For example, in a retail deployment that includes 2,500 store locations, a push-based pipeline that attempts to simultaneously deploy to all stores will experience connection timeouts as a result of connectivity issues with some stores; partially deployed code as a result of network outages during the deployment process; and lack of visibility into the status of the deployment process for the many store locations with no connectivity (see Figure 1).

Figure 1: Push-based vs. pull-based deployment models for edge environments


GitOps With Open-Source Rancher Fleet

To solve this problem, the method of deployment needs to be inverted. The current model has changes pushed to edge locations. With pull-based GitOps, each edge cluster pulls its state from a central Git repository. Rancher Fleet is designed to provide GitOps-based deployment for managing large numbers of clusters. One Fleet Controller can manage well over one million resources across thousands of clusters. This makes Rancher Fleet ideal for use in edge deployments.

Edge locations using Fleet are part of a continuous reconciliation cycle. The cycle includes observing the desired state for the cluster from the Git repository and then comparing the observed state against the actual state of the cluster. It identifies what is out of sync or "drift" and applies changes to the cluster. This model provides several key advantages for edge locations—locations that lose their connection will auto-synchronize when they regain their connection. A persistent connection to an edge cluster is not required, and failed deployments will be retried (see Figure 2).

Figure 2: Fleet architecture with upstream controller and downstream edge clusters


Fleet’s clustering capability enables deployment to thousands of different locations based on a single configuration file:

YAML
yaml 

apiVersion: fleet.cattle.io/v1alpha1 
kind: GitRepo 
metadata: 
 name: edge-retail-app 
 namespace: fleet-default 
spec: 
 repo: https://github.com/org/edge-manifests.git 
 branch: main 
 paths: 
 - apps/retail-app 
 pollingInterval: 5m 
 imageScanInterval: 30m 
 targets: 
 - name: canary-stores 
 clusterSelector: 
 matchLabels: 
 environment: edge 
 rollout-wave: canary 
 - name: retail-edge-all 
 clusterSelector: 
 matchLabels: 
 environment: edge 
 tier: retail 
  helmSecretName: jfrog-registry-credentials


When an additional edge cluster registers with matching labels, Fleet will automatically include it in all deployments; this eliminates configuration drift throughout the fleet.

The rollout behavior and customizations for each target are defined by the fleet.yaml within the application path:

YAML
yaml 

# fleet.yaml - Controls deployment behavior 
defaultNamespace: retail-apps 
helm: 
 releaseName: edge-pos-app 
 values: 
 image: 
 repository: artifactory.internal.com/edge/pos-app 
 tag: v2.4.1 
 resources: 
 limits: 
 memory: 256Mi 
 cpu: 200m 
 replicaCount: 1 
 
rollout: 
 autoPartitionSize: 25 
 partitions: 
 - name: canary 
 maxUnavailable: 1 
 clusterSelector: 
 matchLabels: 
 rollout-wave: canary 
 - name: production 
 maxUnavailable: 10% 
 clusterGroup: retail-fleet 
 
targetCustomizations: 
 - name: high-traffic-stores 
 clusterSelector: 
 matchLabels: 
 traffic-tier: high 
 helm: 
 values: 
 replicaCount: 3 
 resources: 
 limits: 
            memory: 512Mi


This configuration enables partition-based rollout, so that your canary clusters are updated first and then the production clusters in batches of predetermined size. The target customization section allows high-traffic stores to receive additional resources without having to create their own manifests.

Wave-Based Deployment Strategy

Implementing changes to all edge locations at the same time poses too much risk. For example, if there is an error in how we process payments and it is rolled out to the 2,500 stores at the same time, it would immediately cause disruption to our business across the entire fleet. Staged rollouts reduce the blast radius because they deploy small subsets of stores first and validate success before expanding the deployment (see Figure 3).

Figure 3: Four-wave deployment strategy with health validation gates


Wave-Based Rollout Schedule

Wave Coverage Duration Purpose

Wave 1: Canary

1% of fleet

30 minutes

Identify Obvious Failures (with Minimal Impact)

Wave 2: Early Adopter

10% of fleet

2 hours

Validate against a variety of conditions

Wave 3: Regional


50% of fleet

4 hours

Confirm scalability and regional variations

Wave 4: Full

100% of fleet

Continuous

Complete rollout


Automated Health Assessment is required for transition in each stage of the waves. The deployment controller automatically collects performance metric data from the deployments at all locations and compares them to thresholds such as error rate, latency, and success rate that were previously established. When the collected metrics meet or exceed those thresholds, then the deployment proceeds to the next wave.

Health Check Thresholds for Wave Promotion

metric threshold rationale

Success Rate

≥ 99%

Determine if application is functioning as intended

Error Rate

≤ 1%

Catch error spikes in the system rapidly

P99 Latency


≤ 500ms

Detect performance degradation


Handling Disconnected Edge Locations

Network reliability varies greatly depending on how you deploy your edge. A typical urban retail location will generally have reliable connectivity; a remote site could be down for hours. Therefore, the pipeline needs to be able to support either of these options, but no manual effort should be required.

Fleet allows for disconnection by design with its agent-based architecture. The Fleet Agent on each edge cluster maintains a connection to the upstream controller. If that fails, it simply runs in the last known desired state until the agent can reconcile the difference once connectivity is restored. Applications continue to run while the agents are working to get back into sync due to cached container images.

To ensure containers can operate in a disconnected environment, they need hierarchical caching. JFrog Artifactory is the authoritative repository at the center; the JFrog Edge nodes provide caching in each region, and the edge clusters cache locally. This enables successful pod restarts regardless of whether there is network connectivity.

Automated Rollback

When an incident occurs, it is most important to minimize recovery time. Automated rollback removes human decision-making latency from the recovery path (see Figure 4).

Figure 4: Automated rollback flow from detection to recovery


The same metrics powering deployment promotion also trigger rollback of a deployment. If the success rate falls below thresholds, Fleet will halt all future activity and initiate a rollback to the most recent good version of the application. Operations teams will be notified immediately via both Slack and PagerDuty, with full audit history available for post-deployment incident review.

Key Outcomes

Using this framework at distributed edge locations yields quantifiable benefits:

Deployment Framework Results

metric result

Deployment Success Rate

99.7% across 2,500+ locations

Mean Time to Deploy

45 minutes (full fleet, staged)

Automatic Rollback Time


Under 5 minutes

Disconnected Recovery

Automatic sync upon reconnection


Conclusion

The development of CI/CD pipelines for distributed edge computing does not follow the traditional CI/CD pipelines of the cloud computing world. By implementing GitOps-based synchronization through Rancher Fleet, wave-based rollouts with automated analysis, and disconnected operation capabilities, organizations achieve reliable deployments across thousands of edge locations. The pull-based model changes network unreliability from a blocker to expected behavior with automatic recovery.

Git clusters

Opinions expressed by DZone contributors are their own.

Related

  • Runtime FinOps: Making Cloud Cost Observable
  • Shrink a Bloated Git Repository and Optimize Pack Files
  • From Command Lines to Intent Interfaces: Reframing Git Workflows Using Model Context Protocol
  • Automating Unix Security Across Hybrid Clouds

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

Let's be friends: