VOOZH about

URL: https://dev.to/samson_tanimawo/the-on-call-handoff-that-prevents-dropped-incidents-1eag

⇱ The On-Call Handoff That Prevents Dropped Incidents - DEV Community


The Monday Morning Disaster

Every Monday, the same story: the incoming on-call engineer has no idea what happened over the weekend. The outgoing engineer left a cryptic Slack message at 11pm and went to bed.

We lost 2 hours every Monday rebuilding context.

The Structured Handoff

We built a handoff template that takes 15 minutes to write and saves hours of confusion:

# On-Call Handoff: [DATE] → [DATE]
## Outgoing: @engineer_a | Incoming: @engineer_b

### Active Issues
| Issue | Status | Next Step | ETA |
|-------|--------|-----------|-----|
| DB replication lag | Monitoring | Auto-resolves if < 5s | Check at noon |
| Cert expiry api.prod | Fix scheduled | Deploy cert-bot PR #234 | Tuesday AM |

### Incidents This Shift
1. **[P2] Payment timeout spike** — 2024-03-15 02:30 UTC
 - Resolved: Increased connection pool from 20→50
 - Post-mortem: Scheduled for Wednesday
 - Lingering risk: Pool size is a band-aid, need connection pooler

### Upcoming Risks
- Major deploy of auth-service v3 on Tuesday
- Black Friday load test on Thursday
- AWS maintenance window Friday 2-6am UTC

### Helpful Context
- The cache-service has been flaky — restart fixes it (known bug, JIRA-456)
- New on-call runbook for search-service is at [link]
- PagerDuty schedule was updated — check your shifts

### Metrics to Watch
- DB replication lag: should be < 1s (currently 0.8s)
- Payment success rate: should be > 99.8% (currently 99.7%)
- API error rate: baseline is 0.05% (currently 0.04%)

Automating the Handoff

We automated 80% of this with a bot:

def generate_handoff_report(outgoing_shift_start, outgoing_shift_end):
 report = {
 'incidents': get_incidents(outgoing_shift_start, outgoing_shift_end),
 'active_alerts': get_active_alerts(),
 'recent_deploys': get_deploys(hours=48),
 'upcoming_maintenance': get_maintenance_windows(days=7),
 'slo_status': get_slo_status(),
 'open_tickets': get_oncall_tickets(status='open')
 }

 # Auto-generate summary
 summary = []
 if report['incidents']:
 summary.append(f"{len(report['incidents'])} incidents during shift")
 if report['active_alerts']:
 summary.append(f"{len(report['active_alerts'])} active alerts to monitor")
 if any(slo['budget_remaining'] < 30 for slo in report['slo_status']):
 summary.append("WARNING: SLO budget low for some services")

 return format_handoff(report, summary)

The 15-Minute Handoff Call

The bot generates the report. The humans spend 15 minutes on video:

0-5 min: Outgoing reviews active issues and incidents
5-10 min: Walk through upcoming risks and context
10-15 min: Incoming asks questions, confirms understanding

Critical rule: The outgoing engineer is NOT released until the incoming engineer says "I'm good."

The Handoff Score

We rate every handoff:

handoff_score:
 report_completed: +1
 call_happened: +1
 all_incidents_documented: +1
 active_issues_listed: +1
 upcoming_risks_noted: +1
 metrics_baseline_included: +1

 max_score: 6
 target: >= 5

We track this weekly. Teams that score consistently above 5 have 60% fewer "lost context" incidents.

Results

Metric Before After
Monday morning incidents due to lost context 3-4/month 0-1/month
Time to rebuild context 2 hours 15 minutes
Incoming on-call confidence (1-5) 2.3 4.6
Escalations due to missing info 8/month 1/month

The best part: engineers actually look forward to handoffs now because they're quick and useful instead of stressful.

If you want AI-generated on-call handoff reports that capture everything automatically, check out what we're building at Nova AI Ops.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com