VOOZH about

URL: https://dev.to/airfluke/building-a-production-grade-observability-platform-with-lgtm-stack-dora-metrics-slos-4i47

⇱ Building a Production-Grade Observability Platform with LGTM Stack, DORA Metrics & SLOs - DEV Community


GitHub Repository: https://github.com/AirFluke/meetmind-observability
One command to deploy: docker compose up -d


Introduction

Modern software teams don't just need to know when something is down — they need to understand why it broke, how long users were affected, how fast they recovered, and whether their engineering practices are improving over time.

This is the gap between basic monitoring and true observability.

For Stage 6 of the HNG DevOps track, Team MeetMind built a production-grade observability and reliability platform from scratch using the LGTM stack — Loki, Grafana, Tempo, and Prometheus — alongside DORA metrics, SLI/SLO/Error Budget frameworks, and a fully automated alerting pipeline routing to Slack.

Everything is infrastructure as code. No manual UI configuration. One command brings the entire stack up.


Why LGTM Over Managed Alternatives?

The observability market offers managed alternatives — Datadog, New Relic, Grafana Cloud. So why self-host the LGTM stack?

Cost at scale. Managed platforms charge per host, per metric, per log line. At scale this becomes a significant infrastructure cost. The LGTM stack runs on a single server with no per-metric pricing.

Data sovereignty. Logs contain sensitive data — request bodies, auth tokens, PII. Shipping these to a third-party SaaS introduces compliance risk. Self-hosted Loki keeps logs within your own infrastructure.

No vendor lock-in. Prometheus exposition format and OpenTelemetry are open standards. Every instrumented service, every dashboard, every alert rule is portable. Switching providers means changing an endpoint URL, not rewriting your entire observability layer.

Full control over retention. We configured 30-day retention for both metrics and logs at no additional cost.

Learning depth. Operating the stack yourself forces genuine understanding of how metrics collection, log aggregation, and distributed tracing work — knowledge that transfers regardless of which tools your next employer uses.


Architecture Overview

The platform runs as a Docker Compose stack with nine services, all with automatic restart policies.

Component Role Port
Prometheus Metrics collection and storage 9090
Loki Log aggregation 3100
Tempo Distributed trace storage 3200
Grafana Unified observability frontend 3000
Alertmanager Alert routing to Slack 9093
Node Exporter System metrics (CPU, RAM, disk, network) 9100
Blackbox Exporter HTTP/SSL probing 9115
Pushgateway Receives DORA metrics from GitHub Actions 9091
OTel Collector Receives and routes traces and logs 4317/4318

Data flow:

  • Node Exporter and Blackbox Exporter expose metrics → Prometheus scrapes every 15 seconds
  • GitHub Actions pushes deployment metrics → Pushgateway → Prometheus
  • Applications send traces via OpenTelemetry → OTel Collector → Tempo
  • Applications send logs via OpenTelemetry → OTel Collector → Loki
  • Grafana sits on top of all three — Prometheus, Loki, Tempo — enabling correlated drill-down from a single dashboard

👁

📸 [Screenshot: docker compose ps showing all 9 services Up]


Part 1: Deploying the Full LGTM Stack

Docker Compose — the complete stack

# docker-compose.yml
version: "3.8"

networks:
 observability:
 driver: bridge

volumes:
 prometheus_data:
 loki_data:
 tempo_data:
 grafana_data:

services:
 prometheus:
 image: prom/prometheus:v2.51.0
 container_name: prometheus
 restart: unless-stopped
 command:
 - "--config.file=/etc/prometheus/prometheus.yml"
 - "--storage.tsdb.path=/prometheus"
 - "--storage.tsdb.retention.time=30d"
 - "--web.enable-lifecycle"
 - "--web.enable-remote-write-receiver"
 volumes:
 - ./config/prometheus.yml:/etc/prometheus/prometheus.yml:ro
 - ./alerts:/etc/prometheus/alerts:ro
 - prometheus_data:/prometheus
 ports:
 - "9090:9090"
 networks:
 - observability

 loki:
 image: grafana/loki:2.9.7
 container_name: loki
 restart: unless-stopped
 command: -config.file=/etc/loki/loki-config.yaml
 volumes:
 - ./config/loki-config.yaml:/etc/loki/loki-config.yaml:ro
 - loki_data:/loki
 ports:
 - "3100:3100"
 networks:
 - observability

 tempo:
 image: grafana/tempo:2.4.1
 container_name: tempo
 restart: unless-stopped
 command: -config.file=/etc/tempo/tempo.yaml
 volumes:
 - ./config/tempo.yaml:/etc/tempo/tempo.yaml:ro
 - tempo_data:/var/tempo
 ports:
 - "3200:3200"
 - "4317:4317"
 - "4318:4318"
 networks:
 - observability

 grafana:
 image: grafana/grafana:10.4.2
 container_name: grafana
 restart: unless-stopped
 environment:
 - GF_SECURITY_ADMIN_PASSWORD=admin
 - GF_USERS_ALLOW_SIGN_UP=false
 - GF_FEATURE_TOGGLES_ENABLE=traceqlEditor
 volumes:
 - grafana_data:/var/lib/grafana
 - ./grafana/provisioning:/etc/grafana/provisioning:ro
 - ./grafana/dashboards:/var/lib/grafana/dashboards:ro
 ports:
 - "3000:3000"
 networks:
 - observability

 alertmanager:
 image: prom/alertmanager:v0.27.0
 container_name: alertmanager
 restart: unless-stopped
 volumes:
 - ./config/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
 - ./config/slack.tmpl:/etc/alertmanager/slack.tmpl:ro
 ports:
 - "9093:9093"
 networks:
 - observability

 node-exporter:
 image: prom/node-exporter:v1.7.0
 container_name: node-exporter
 restart: unless-stopped
 volumes:
 - /proc:/host/proc:ro
 - /sys:/host/sys:ro
 - /:/rootfs:ro
 ports:
 - "9100:9100"
 networks:
 - observability

 blackbox-exporter:
 image: prom/blackbox-exporter:v0.25.0
 container_name: blackbox-exporter
 restart: unless-stopped
 volumes:
 - ./config/blackbox.yml:/etc/blackbox_exporter/config.yml:ro
 ports:
 - "9115:9115"
 networks:
 - observability

 pushgateway:
 image: prom/pushgateway:v1.7.0
 container_name: pushgateway
 restart: unless-stopped
 ports:
 - "9091:9091"
 networks:
 - observability

 otel-collector:
 image: otel/opentelemetry-collector-contrib:0.98.0
 container_name: otel-collector
 restart: unless-stopped
 command: ["--config=/etc/otel/otel-collector.yaml"]
 volumes:
 - ./config/otel-collector.yaml:/etc/otel/otel-collector.yaml:ro
 ports:
 - "4319:4317"
 - "4320:4318"
 - "8888:8888"
 networks:
 - observability

One command to bring everything up

docker compose up -d

Infrastructure as Code — non-negotiable

Every configuration file is version-controlled. Nothing is configured through a UI:

config/
├── prometheus.yml # Scrape configs + recording rules
├── alertmanager.yml # Route trees + inhibition rules
├── loki-config.yaml # Log ingestion + 30d retention
├── tempo.yaml # Trace storage + 30d retention
├── otel-collector.yaml # Trace and log pipeline
└── blackbox.yml # HTTP + SSL probe modules

alerts/
├── infrastructure.yml # CPU, memory, disk, host down
├── slo-burnrate.yml # Multi-window burn rate alerts
└── cicd.yml # DORA threshold alerts

grafana/
├── provisioning/ # Datasource + dashboard discovery
└── dashboards/ # 5 JSON dashboards

Prometheus scrape configuration

# config/prometheus.yml
global:
 scrape_interval: 15s
 evaluation_interval: 15s

rule_files:
 - /etc/prometheus/alerts/infrastructure.yml
 - /etc/prometheus/alerts/slo-burnrate.yml
 - /etc/prometheus/alerts/cicd.yml

scrape_configs:
 - job_name: node-exporter
 scrape_interval: 15s
 static_configs:
 - targets: ["node-exporter:9100"]

 - job_name: blackbox-http
 metrics_path: /probe
 params:
 module: [http_2xx]
 static_configs:
 - targets:
 - http://grafana:3000
 - http://prometheus:9090/-/healthy
 - http://loki:3100/ready
 relabel_configs:
 - source_labels: [__address__]
 target_label: __param_target
 - source_labels: [__param_target]
 target_label: instance
 - target_label: __address__
 replacement: blackbox-exporter:9115

 - job_name: pushgateway
 honor_labels: true
 static_configs:
 - targets: ["pushgateway:9091"]

Retention periods:

  • Prometheus metrics: 30 days (--storage.tsdb.retention.time=30d)
  • Loki logs: 30 days (retention_period: 30d in loki-config.yaml)
  • Tempo traces: 30 days (block_retention: 720h in tempo.yaml)

👁

📸 [Screenshot: Prometheus targets page showing all scrapers green]


Part 2: The Four Golden Signals as SLIs

Before writing a single PromQL query or building any dashboard, we defined what reliability means for MeetMind using Google's Four Golden Signals framework.

Why Four Golden Signals beat CPU/RAM monitoring

Traditional monitoring asks "is the server healthy?" The Four Golden Signals ask "is the user experiencing a healthy service?"

A server can have 10% CPU and still serve every request with 5-second latency. CPU monitoring shows green. The Four Golden Signals show red. That's the difference.

Signal 1 — Latency

How long does it take to serve a request? We distinguish successful from error latency — a fast error is not a success.

# p95 latency for successful requests
histogram_quantile(0.95,
 sum(rate(http_request_duration_seconds_bucket{status!~"5.."}[5m])) by (le, job)
)

# p95 latency for error requests (errors are often faster — fail fast)
histogram_quantile(0.95,
 sum(rate(http_request_duration_seconds_bucket{status=~"5.."}[5m])) by (le, job)
)

Signal 2 — Traffic

How much demand is the system handling?

# Requests per second
sum(rate(http_requests_total[1m])) by (job)

Signal 3 — Errors

Rate of failed requests — explicit 5xx, implicit wrong content, policy failures.

# Error rate as a ratio (0 = perfect, 1 = everything failing)
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)

Signal 4 — Saturation

How "full" is the service? We track CPU, memory, and disk.

# Memory saturation
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# CPU saturation
1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

# Disk saturation
1 - (
 node_filesystem_avail_bytes{mountpoint="/", fstype!="tmpfs"}
 / node_filesystem_size_bytes{mountpoint="/", fstype!="tmpfs"}
)

These four PromQL expressions become our SLIs — the measurements we track.


Part 3: SLOs and Error Budgets

The philosophy

An SLI is a measurement.
An SLO is a target for that measurement.
An Error Budget is the allowable gap between perfect and the SLO target.

This framework changes how engineering teams make decisions. Instead of arguing about whether a deployment is "safe enough", the question becomes: "Do we have enough error budget to absorb the risk of this deployment?"

It converts a subjective conversation into an objective one.

Our SLO targets

SLO Target Window Error Budget
Availability 99.5% of HTTP probes return 2xx 30 days 216 minutes
Error rate 99% of requests succeed 30 days 432 minutes
Latency p95 < 500ms Rolling 5m Alert-only

Why 99.5% availability?
This gives us 216 minutes per month — enough for one planned maintenance window without exhausting the budget. A stricter 99.9% would leave only 43 minutes, making any deployment risky.

Why 99% error rate?
One percent failure tolerance allows for transient errors during rolling deployments. Stricter targets require canary deployment infrastructure before they're meaningful.

Why 500ms p95 latency?
Industry standard for interactive APIs. Beyond this threshold, user experience degrades measurably. We chose p95 rather than p99 because optimising for the 99th percentile often requires disproportionate infrastructure investment.

Recording rules for SLIs

# alerts/slo-burnrate.yml
groups:
 - name: slo.recording_rules
 interval: 30s
 rules:
 - record: slo:availability:ratio_rate5m
 expr: avg_over_time(probe_success[5m])

 - record: slo:availability:ratio_rate1h
 expr: avg_over_time(probe_success[1h])

 - record: slo:availability:ratio_rate6h
 expr: avg_over_time(probe_success[6h])

 - record: slo:availability:ratio_rate30d
 expr: avg_over_time(probe_success[30d])

 # Burn rate = how fast we're consuming error budget
 # Error budget = 1 - 0.995 = 0.005
 - record: slo:availability:burn_rate1h
 expr: (1 - slo:availability:ratio_rate1h) / 0.005

 - record: slo:availability:burn_rate6h
 expr: (1 - slo:availability:ratio_rate6h) / 0.005

Error Budget Policy

Budget remaining > 50% → Deploy freely, feature work continues
Budget remaining 25-50% → Investigate incidents, no major changes
Budget remaining < 25% → Reliability sprint, senior review on all deploys
Budget remaining 0% → Feature freeze until budget recovers

Who owns the freeze decision? Engineering lead.
Review cadence? First Monday of each month.

👁

📸 [Screenshot: SLO & Error Budget dashboard showing gauges and burn rate]


Part 4: DORA Metrics and CI/CD Observability

Why DORA metrics connect to business outcomes

DORA metrics answer: "Is our team getting better or worse at delivering software safely?"

Metric Business impact
Deployment Frequency How often value reaches users
Lead Time for Changes How quickly a bug fix ships
Change Failure Rate Cost of broken deployments
Mean Time to Restore Duration of user impact during incidents

DORA benchmarks

Metric Elite High Medium Low
Deploy frequency Multiple/day Weekly Monthly < Monthly
Lead time < 1 hour < 1 day 1d–1w > 1 week
CFR < 5% 5–10% 10–15% > 15%
MTTR < 1 hour < 1 day 1d–1w > 1 week

GitHub Actions pushing DORA metrics to Pushgateway

# .github/workflows/deploy.yml
jobs:
 deploy:
 runs-on: ubuntu-latest
 steps:
 - uses: actions/checkout@v4

 - name: Record deploy start time
 id: timing
 run: echo "start_ts=$(date +%s)" >> $GITHUB_OUTPUT

 - name: Build and deploy
 run: |
 echo "Your actual build and deploy steps here"

 - name: Push DORA metrics on success
 if: success()
 run: |
 LEAD_TIME=$(( $(date +%s) - ${{ steps.timing.outputs.start_ts }} ))
 WORKFLOW="${{ github.workflow }}"

 # Deployment counter
 cat <<EOF | curl --data-binary @- "${PUSHGATEWAY_URL}/metrics/job/github_actions"
 deployment_total{status="success",workflow="${WORKFLOW}"} 1
 EOF

 # Lead time
 cat <<EOF | curl --data-binary @- "${PUSHGATEWAY_URL}/metrics/job/github_actions"
 deployment_lead_time_seconds{workflow="${WORKFLOW}"} ${LEAD_TIME}
 EOF

 - name: Push DORA metrics on failure
 if: failure()
 run: |
 WORKFLOW="${{ github.workflow }}"
 cat <<EOF | curl --data-binary @- "${PUSHGATEWAY_URL}/metrics/job/github_actions"
 deployment_total{status="failure",workflow="${WORKFLOW}"} 1
 EOF

DORA recording rules in Prometheus

groups:
 - name: cicd.recording_rules
 rules:
 # Deployment frequency
 - record: dora:deployment_frequency:rate24h
 expr: sum(increase(deployment_total[24h])) by (workflow)

 # Change Failure Rate = failed / total over 7 days
 - record: dora:change_failure_rate:ratio7d
 expr: |
 sum(increase(deployment_total{status="failure"}[7d])) by (workflow)
 /
 sum(increase(deployment_total[7d])) by (workflow)

 # Mean Time to Restore
 - record: dora:mttr:avg7d
 expr: avg_over_time(deployment_restore_time_seconds[7d])

Toil identified and automated

Toil 1 — Manual alert acknowledgement. Engineers read a Slack alert, open a browser, navigate to Grafana, search for the relevant dashboard. Automation: every alert payload includes a direct link to the exact dashboard. Saves 2–3 minutes per alert.

Toil 2 — Certificate renewal reminders. SSL expiry tracked via calendar reminders. Automation: Blackbox Exporter monitors SSL expiry continuously. SSLCertExpiringSoon alert fires 14 days before expiry automatically.

👁

📸 [Screenshot: DORA metrics dashboard with classification badges]


Part 5: Five Grafana Dashboards — All Provisioned as Code

All dashboards are provisioned from JSON files. The Grafana UI was never used to create or modify any panel.

Grafana provisioning configuration

# grafana/provisioning/datasources/datasources.yaml
apiVersion: 1

datasources:
 - name: Prometheus
 type: prometheus
 uid: prometheus
 access: proxy
 url: http://prometheus:9090
 isDefault: true

 - name: Loki
 type: loki
 uid: loki
 access: proxy
 url: http://loki:3100
 jsonData:
 # This is the key config for trace drill-down
 derivedFields:
 - name: TraceID
 matcherRegex: 'traceID=(\w+)'
 url: "${__value.raw}"
 datasourceUid: tempo
 urlDisplayLabel: "OpeninTempo"
 - name: TraceID_json
 matcherRegex: '"traceId":"(\w+)"'
 url: "${__value.raw}"
 datasourceUid: tempo
 urlDisplayLabel: "OpentraceinTempo"

 - name: Tempo
 type: tempo
 uid: tempo
 access: proxy
 url: http://tempo:3200
 jsonData:
 tracesToLogsV2:
 datasourceUid: loki
 filterByTraceID: true
 customQuery: true
 query: '{service_name="${__span.tags.service.name}"}|="${__trace.traceId}"'

Dashboard 1 — Node Exporter

CPU utilisation total and per-core, memory used/cached/available, disk I/O, network I/O, and load averages at 1/5/15 minutes. Gives instant visibility into whether resource saturation is causing service degradation.

👁

📸 [Screenshot: Node Exporter dashboard with live CPU and memory data]

Dashboard 2 — Blackbox Exporter

External probing: uptime/downtime timeline, HTTP response time, SSL certificate expiry countdown, probe success rate. This dashboard answers "what is the user experiencing?" rather than "what is the server doing?" — a critical distinction.

👁

📸 [Screenshot: Blackbox Exporter dashboard showing probe results]

Dashboard 3 — DORA Metrics

Deployment frequency trend, lead time distribution, CFR raw count and rolling percentage, MTTR with DORA benchmark classification displayed prominently. Classification updates automatically as metrics change.

Dashboard 4 — SLO & Error Budget

SLI vs SLO gauges, error budget remaining as a bar gauge coloured by urgency, burn rate time series with fast/slow burn thresholds marked, SLO compliance history over 7 and 30 day windows.

👁

📸 [Screenshot: SLO dashboard with error budget gauge]

Dashboard 5 — Unified Observability (the most important)

This is the dashboard that makes the entire stack worth building.

A user sees a spike in the error rate panel → clicks through to Loki → sees error logs from that exact time window → clicks the trace ID link → Tempo opens the waterfall → identifies exactly which service, endpoint, and span caused the failure.

This drill-down — metric spike → correlated logs → causing trace — is what separates observability from monitoring.

Monitoring: "Something is wrong"
Observability: "Here is exactly why, where, and when"

👁

📸 [Screenshot: Unified dashboard showing error rate spike]

👁

📸 [Screenshot: Loki logs panel with clickable trace IDs]


Part 6: The Alerting System

All alert rules are version-controlled

Zero alert rules live in Grafana. Every rule is in a .yml file under alerts/.

Infrastructure alerts

# alerts/infrastructure.yml
groups:
 - name: infrastructure.rules
 rules:
 # Recording rules — pre-compute SLIs
 - record: sli:node_cpu_saturation
 expr: 1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

 - record: sli:node_memory_saturation
 expr: 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

 # CPU alerts
 - alert: HighCPUWarning
 expr: sli:node_cpu_saturation > 0.80
 for: 5m
 labels:
 severity: warning
 annotations:
 summary: "HighCPUusageon{{$labels.instance}}"
 description: "CPUis{{$value|humanizePercentage}}(threshold:80%)"
 dashboard_url: "http://YOUR_SERVER_IP:3000/d/node-exporter"
 runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/high-cpu.md"

 - alert: HighCPUCritical
 expr: sli:node_cpu_saturation > 0.90
 for: 10m
 labels:
 severity: critical
 annotations:
 summary: "CriticalCPUon{{$labels.instance}}"
 description: "CPUis{{$value|humanizePercentage}}for10+minutes"
 dashboard_url: "http://YOUR_SERVER_IP:3000/d/node-exporter"
 runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/high-cpu.md"

 # Host down — Blackbox probe fails for 2 minutes
 - alert: HostDown
 expr: probe_success == 0
 for: 2m
 labels:
 severity: critical
 annotations:
 summary: "Host{{$labels.instance}}isdown"
 description: "Blackboxprobefailedfor2+consecutiveminutes"
 runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/host-down.md"

Burn rate alerting — how it reduces alert fatigue

Traditional threshold alerting fires whenever a metric crosses a line. This produces alert storms — dozens of notifications for a single incident. Teams learn to ignore them.

Burn rate alerting answers a different question: "At this rate of failure, how long until our error budget is exhausted?"

Two alerts replace an entire category of noise:

# alerts/slo-burnrate.yml
 - name: slo.alerts
 rules:
 # Fast burn — act immediately
 # 14.4x means 2% of monthly budget gone in 1 hour
 - alert: SLOAvailabilityFastBurn
 expr: slo:availability:burn_rate1h > 14.4
 for: 2m
 labels:
 severity: critical
 annotations:
 summary: "Fasterrorbudgetburnactimmediately"
 description: >
 Burn rate is {{ $value | humanize }}x. At this rate,
 2% of the 30-day budget will be consumed in 1 hour.
 runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/slo-fast-burn.md"

 # Slow burn — investigate before it escalates
 # 5x means 5% of monthly budget gone in 6 hours
 - alert: SLOAvailabilitySlowBurn
 expr: slo:availability:burn_rate6h > 5
 for: 15m
 labels:
 severity: warning
 annotations:
 summary: "Slowerrorbudgetburninvestigatesoon"
 description: >
 Burn rate is {{ $value | humanize }}x over 6h.
 5% of the 30-day budget will be consumed in 6 hours.
 runbook_url: "https://github.com/AirFluke/meetmind-observability/blob/main/runbooks/slo-slow-burn.md"

Alertmanager routing and inhibition

# config/alertmanager.yml
route:
 receiver: slack-default
 group_by: [alertname, severity, instance]
 group_wait: 30s
 group_interval: 5m
 repeat_interval: 4h
 routes:
 - match:
 severity: critical
 receiver: slack-critical
 group_wait: 10s
 repeat_interval: 4h

inhibit_rules:
 # When host is completely down, suppress CPU/memory/latency noise
 - source_match:
 alertname: HostDown
 target_match_re:
 alertname: "HighCPU.*|HighMemory.*|HighLatency.*|DiskSpace.*"
 equal: [instance]

 # Critical suppresses warning for same alert on same host
 - source_match:
 severity: critical
 target_match:
 severity: warning
 equal: [alertname, instance]

Structured Slack payload — plain text is not acceptable

Every alert in #all-hng-alerts includes alert name, severity, host, metric value, Grafana link, and runbook link.

# config/slack.tmpl
{{ define "slack.title" -}}
[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}
{{- end }}

{{ define "slack.body" -}}
{{ range .Alerts }}
*Alert:* {{ .Labels.alertname }}
*Severity:* {{ .Labels.severity | toUpper }}
*Status:* {{ if eq $.Status "resolved" }}✅ RESOLVED{{ else }}🔥 FIRING{{ end }}
*Host:* {{ .Labels.instance }}
*Summary:* {{ .Annotations.summary }}

*Links:*
• <{{ .Annotations.dashboard_url }}|📊 Grafana Dashboard>
• <{{ .Annotations.runbook_url }}|📖 Runbook>

*Started:* {{ .StartsAt.Format "2006-01-02 15:04:05 UTC" }}
{{ end }}
{{- end }}

👁

📸 [Screenshot: Slack showing firing alert with full structured payload]

👁

📸 [Screenshot: Slack showing RESOLVED alert]


Part 7: Runbooks and Incident Management

A runbook for every alert

Every alert links directly to its runbook. An engineer woken at 3am should be able to follow it to resolution without searching.

Each runbook answers six questions:

# Runbook: High CPU Usage

## What is this alert?
HighCPUWarning fires when CPU exceeds 80% for 5+ minutes.

## Likely cause
1. Traffic spike
2. Runaway process
3. Post-deployment regression

## First 3 investigation steps
1. Check running processes:


bash
top -bn1 | head -20
docker stats --no-stream

2. Correlate with traffic on Unified Observability dashboard
3. Check recent deployments in GitHub Actions

## Resolution
- Runaway process: kill -9 <PID>
- Traffic spike: scale horizontally
- Deployment regression: roll back

## Roll back when?
If CPU spike started within 30 minutes of a deployment
and correlates with increased error rate.

## Escalation
Senior engineer if unresolved after 20 minutes.

Blameless Post-Incident Review

We documented a simulated incident where a missing environment variable caused 35% of requests to return 503 for 47 minutes.

Timeline:

Time Event
14:18 Deployment triggered
14:23 503 responses begin
14:29 SLOAvailabilityFastBurn fires (6-min detection lag)
14:36 Trace ID in Loki → Tempo reveals config read failure
14:40 Root cause identified: missing DATABASE_URL env var
14:45 Rollback initiated
15:10 Error rate returns to baseline

Root cause: New environment variable added to code but not to docker-compose.yml.

Detection gap: 6-minute lag between incident start and alert firing. Action item: reduce fast-burn for: clause from 2m to 1m.

Action items:

Action Owner Due
Add post-deploy smoke test DevOps 3 days
Add env var validation to entrypoint App dev 5 days
Reduce fast-burn for: clause to 1m DevOps 1 day

This review is blameless — we focus on systems and processes, not individuals.


Part 8: Game Day Results

Scenario 1 — Deployment Failure

Added exit 1 to the GitHub Actions workflow and pushed. The workflow failed and pushed deployment_total{status="failure"} to the Pushgateway. CICDDeploymentFailed fired in Slack within 2 minutes. DORA dashboard showed CFR increase. Immediately reverted.

👁

📸 [Screenshot: GitHub Actions showing red failed run]

👁

📸 [Screenshot: CICDDeploymentFailed in Slack]

Scenario 2 — Latency Injection

Injected 600ms network latency:

sudo tc qdisc add dev ens5 root netem delay 600ms

HighLatencyWarning fired confirming the alerting pipeline for latency SLO breaches works end-to-end.

# Remove latency
sudo tc qdisc del dev ens5 root

RESOLVED message confirmed recovery detection works.

👁

📸 [Screenshot: Unified dashboard showing latency spike]

👁

📸 [Screenshot: HighLatencyWarning in Slack]

👁

📸 [Screenshot: RESOLVED in Slack after tc removed]

Scenario 3 — Resource Pressure

Used stress-ng to drive CPU above 90%:

stress-ng --cpu 0 --cpu-method matrixprod --timeout 600s &

What we observed:

  • HighCPUWarning entered pending state after CPU sustained above 80%
  • After 5 minutes → HighCPUWarning turned firing in Prometheus
  • Alert arrived in Slack with full structured payload
  • HighCPUCritical entered pending (needs 10min sustained above 90%)
  • After killing stress: both alerts RESOLVED in Slack

This confirmed the full warning → critical → recovery sequence and proved inhibition rules work — critical suppressed the warning notification.

pkill stress-ng

👁

📸 [Screenshot: Prometheus alerts page showing Warning firing]

👁

📸 [Screenshot: Prometheus alerts page showing Critical pending]

👁

📸 [Screenshot: Node Exporter dashboard with CPU spike at 92%]

👁

📸 [Screenshot: HighCPUWarning in Slack]

👁

📸 [Screenshot: RESOLVED in Slack]


Key Learnings

1. Observability is not monitoring.
Monitoring tells you something is wrong. Observability tells you why, where, and when — without needing to SSH into a server.

2. SLOs make reliability decisions objective.
"Is this deployment safe?" is subjective. "Do we have 100 minutes of error budget remaining?" is objective. SLOs turn reliability from a conversation into a measurement.

3. Burn rate alerting eliminates alert fatigue.
Two burn rate alerts replaced what would have been dozens of threshold alerts during our Game Day scenarios. Engineers respond to meaningful signals, not noise.

4. DORA metrics connect engineering to business.
High MTTR isn't just a technical problem — it's lost revenue per minute. Low deployment frequency isn't just slow — it's delayed value delivery. DORA makes this explicit.

5. Everything as code is non-negotiable.
Every dashboard, alert rule, and config that lives only in a UI is technical debt. When the server dies, you want to run docker compose up -d and have everything back — not spend three hours recreating dashboards from memory.


Conclusion

The MeetMind Observability Platform demonstrates that production-grade observability is achievable without managed services. The LGTM stack provides the full observability triad — metrics, logs, and traces — with correlation between all three. SLOs convert vague reliability goals into measurable targets. DORA metrics connect daily engineering decisions to business outcomes. Burn rate alerting replaces alert storms with two meaningful signals.

The entire platform deploys with one command. Every component is version-controlled. Every alert links to a runbook. Every metric spike links to correlated logs and traces.

GitHub Repository: https://github.com/AirFluke/meetmind-observability


Built by Team MeetMind for HNG DevOps Track Stage 6