VOOZH about

URL: https://dev.to/samson_tanimawo/what-is-multi-agent-sre-a-practical-introduction-5ccj

⇱ What Is Multi-Agent SRE? A Practical Introduction - DEV Community


Every SRE team I've talked to this year is running the same experiment in different corners: "Can we have AI do some of this?" The honest answer is yes, but only if you stop thinking of it as a single AI and start thinking of it as a team of agents. Here's what "multi-agent SRE" actually means in practice.

Why one big model isn't enough

The first instinct is to throw a large language model at an incident. Paste the alert, paste the logs, ask for a root cause. It works on demos. It falls apart in production for three reasons.

First, context limits. A real incident spans services, deploy timelines, runbooks, and recent config changes. You run out of tokens before you run out of relevant data.

Second, specialization. Detection is a different job from triage. Triage is a different job from remediation. One prompt trying to do all three produces shallow results everywhere.

Third, trust. A single opaque model that "decides" everything is scary. You can't audit it. You can't pause it. You can't hand parts of its job to a human and keep the rest running.

The multi-agent approach

A multi-agent system decomposes the incident lifecycle into specialists that coordinate.

Detection agent. Watches raw signals. Classifies them into candidate incidents.

Correlation agent. Collapses related alerts into a single incident record. Removes duplicates. Flags downstream noise.

Investigation agent. Walks logs, traces, deploy history, and the service graph. Proposes a root cause with evidence.

Remediation agent. Translates root cause into a specific, reversible action. Waits for human approval before executing anything.

Post-mortem agent. After resolution, drafts the timeline, the contributing factors, and the action items. Humans edit, ship.

Each agent owns a narrow job. Each one emits structured output the next agent can consume. The handoff between them is not a chat transcript — it's a typed artifact.

What this gives you

Three things single-model systems can't.

Bounded context. Each agent only carries what it needs. The detection agent never sees the runbook. The post-mortem agent never sees raw logs. Context stays small, which keeps quality high.

Inspectable seams. You can read the output of any one agent and know exactly what it decided. If the investigation agent was wrong, you see why, without untangling a 10-page prompt.

Human takeover at any point. A person can step in between any two agents and continue from the artifact. No re-explaining, no lost history.

What breaks if you get it wrong

Two failure modes dominate early multi-agent builds.

Chatty agents. When agents communicate through a shared memory scratchpad instead of typed artifacts, context drifts. Loops form. The remediation agent reads something the investigation agent wrote three turns ago and acts on stale information.

Unscoped permissions. When every agent has the same credentials, one compromised prompt can trigger actions far outside its responsibility. Scope down. Hard.

Where to start

If you want to experiment, start with the narrowest agent: correlation. It's read-only, the input is well-defined (alerts), and the output is obvious (grouped incidents). You can ship it, measure its precision, and roll it back with no blast radius.

Once correlation is stable, add investigation. Then detection. Remediation comes last, and only with approval queues and reversible actions.

That path takes months, not days. That's fine. The teams doing this well are not racing; they're building something they can trust at 3am.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com