VOOZH about

URL: https://dev.to/samson_tanimawo/canary-deployments-the-pattern-that-cut-our-rollback-rate-by-80-bfa

⇱ Canary Deployments: The Pattern That Cut Our Rollback Rate by 80% - DEV Community


Deploy and Pray

Our deployment strategy used to be: merge to main, deploy to all pods, watch Slack for complaints. Professional? No. Common? Absolutely.

After a particularly bad deploy took down checkout for 23 minutes, we implemented canary deployments.

What Canary Deployments Actually Mean

A canary deployment routes a small percentage of traffic to the new version while monitoring for problems:

Traffic flow:
 Users ──→ Load Balancer ──→ 95% → v1.2.3 (current)
 └──→ 5% → v1.2.4 (canary)

If the canary looks healthy after N minutes, gradually increase traffic. If it looks bad, kill it. Zero impact on 95% of users.

Our Canary Pipeline

# .github/workflows/canary-deploy.yml
canary_deploy:
 steps:
 - name: Deploy canary (5%)
 run: |
 kubectl set image deployment/api-canary api=api:${{ github.sha }}
 kubectl scale deployment/api-canary --replicas=1
 # Configure traffic split
 kubectl apply -f - <<EOF
 apiVersion: split.smi-spec.io/v1alpha1
 kind: TrafficSplit
 metadata:
 name: api-canary
 spec:
 service: api
 backends:
 - service: api-stable
 weight: 95
 - service: api-canary
 weight: 5
 EOF

 - name: Wait and analyze (10 minutes)
 run: |
 sleep 600
 # Check canary health
 ERROR_RATE=$(curl -s prometheus/api/v1/query?query=rate(http_errors{version="canary"}[5m]) | jq '.data.result[0].value[1]')
 LATENCY=$(curl -s prometheus/api/v1/query?query=histogram_quantile(0.99,rate(http_duration_bucket{version="canary"}[5m])) | jq '.data.result[0].value[1]')

 echo "Canary error rate: $ERROR_RATE"
 echo "Canary p99 latency: $LATENCY"

 if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
 echo "CANARY FAILED: Error rate too high"
 exit 1
 fi

 - name: Promote to 50%
 run: |
 kubectl apply -f traffic-split-50.yaml
 sleep 600 # Wait another 10 min

 - name: Full rollout
 run: |
 kubectl set image deployment/api-stable api=api:${{ github.sha }}
 kubectl delete deployment api-canary
 kubectl delete trafficsplit api-canary

The Canary Checklist

What we check during the canary window:

CANARY_CHECKS = {
 'error_rate': {
 'query': 'rate(http_5xx_total{version="canary"}[5m]) / rate(http_requests_total{version="canary"}[5m])',
 'threshold': 0.01, # Max 1% errors
 'comparison': 'less_than'
 },
 'latency_p99': {
 'query': 'histogram_quantile(0.99, rate(http_duration_bucket{version="canary"}[5m]))',
 'threshold': 0.5, # Max 500ms
 'comparison': 'less_than'
 },
 'success_rate': {
 'query': 'rate(http_2xx_total{version="canary"}[5m]) / rate(http_requests_total{version="canary"}[5m])',
 'threshold': 0.99, # Min 99% success
 'comparison': 'greater_than'
 },
 'memory_usage': {
 'query': 'container_memory_working_set_bytes{version="canary"}',
 'threshold': 512 * 1024 * 1024, # Max 512MB
 'comparison': 'less_than'
 }
}

Results After 6 Months

Metric Before After
Rollback rate 15% of deploys 3% of deploys
Mean time to detect bad deploy 25 min 8 min
Customer-facing incidents from deploys 4/month 0.5/month
Deploy frequency 1x/day (afraid) 5x/day (confident)

The counterintuitive result: we deploy MORE often now because we're less afraid. And because each deploy is smaller, issues are easier to find.

Start Simple

You don't need Istio or a service mesh for canary deploys. Start with:

  1. Two deployment objects (stable + canary)
  2. A load balancer that supports weighted routing
  3. A script that checks error rates after deploy
  4. A human who decides whether to promote or rollback

Automate from there.

If you want AI-powered canary analysis that automatically promotes or rolls back, check out what we're building at Nova AI Ops.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com