VOOZH about

URL: https://dev.to/tacoda/post-mortems-for-agent-runs-32bd

⇱ Post-Mortems for Agent Runs - DEV Community


The agent burned five hours on a refactor that should have taken one. The first hour was fine. The second, it was rewriting a module nobody had asked it to touch. The third through fifth went to rolling back, re-planning, and producing the smaller diff we should have ended up with at the start. The work landed late. The team was annoyed. The lesson was sitting there waiting to be picked up.

We did not pick it up. The work shipped, the agent moved on, and the same failure mode showed up two weeks later in a different module. We paid for the first incident twice because we had treated it as an annoyance instead of a learning moment.

The fix was obvious in retrospect. Agent failures want post-mortems, the same way human incidents do. The practice does not transfer automatically, and most teams have not built the habit.

What the post-mortem is actually for

The point of a post-mortem is not to assign blame. With agents, that is even more true. There is no person to blame, and pretending the agent is a person leads to bad analysis. The agent did what the agent does. The question is where the team’s setup failed to catch it.

The setup is a stack of layers: the task description, the harness rules, the hooks, the tests, the code review, and the human supervising. A failure is a place where one of those layers should have caught the problem and did not.

The post-mortem’s job is to find which layer missed and patch it. The deliverable is a small set of changes — a rule added or modified, a hook tightened, a feature doc written, a task-writing practice updated. Not a list of “things to remember.” A list of changes to the system.

When to run one

Not every failure deserves a full post-mortem. The cost is real and the bar should be too.

I run one when the failure cost at least an hour of redo, or when the same failure type has shown up twice in two weeks. The first condition catches the expensive single incidents. The second catches the cheap repeats that compound into expensive patterns.

The other trigger: a near-miss that would have been a real incident if a particular reviewer had not caught it. Near-misses are the most underused signal in agent work. The harness was almost not enough. The reviewer was the last line of defense. The next failure of the same kind might not have that reviewer in the room.

A near-miss is a free post-mortem. The cost has not been paid yet. The lesson is available for the price of the analysis.

Five sections, one page

The shape that works for me is short and rigid.

What happened. Three sentences, no commentary. The agent was asked to add a validation step to the user-import flow. It produced a 600-line PR that rewrote the entire flow. The PR was caught in review and rolled back.

What the agent was working from. The task description. The harness rules that fired. The context it had loaded. If the failure came from a missing rule, the absence is the finding. If it came from a misread task, the description is the finding.

Where the layers failed. Walk down the stack. For each layer, ask: was this failure mode in scope, and if so, why did the layer not catch it? This is the load-bearing section. The answers are the changes.

What changes. One or two specific edits. A rule added. A hook tightened. A feature doc scoped to a path. A change to how tickets are written for that kind of work. Small enough to land the same week as the post-mortem.

What we are not doing. This is the part teams skip and then regret. A post-mortem produces a temptation to add three rules, two hooks, and a process. Most of them will be wrong. Name the ones you considered and rejected, with the reason. The next post-mortem on a similar failure can revisit them, with evidence.

The whole document is one page. If it grows past a page, the failure is being over-analyzed and the changes are being padded.

Where most teams miss the lesson

The pattern I see in teams that adopt agents but do not improve over time: failures get framed as the agent’s failures. The response is “the agent is not ready for that,” and the rope gets pulled back without a corresponding investment in the harness.

The next failure of the same shape happens to the next contributor, who has no record of the first one. They make the same trust adjustment, alone. The team’s collective knowledge of where the agent is reliable stays the same. The harness does not improve because nobody wrote down what should change.

The fix is the post-mortem and the patching. The failures are the agent’s, in a literal sense. They are also the team’s, in the sense that the team’s setup let the failure through. The “team’s failure” framing is what produces the patching. The “agent’s failure” framing produces resignation.

The patterns that keep showing up

A few categories of finding come up over and over, across the teams I have talked to.

The agent acted on an ambiguous request. The task description had two readings and the agent picked one silently. The patch is on the task-writing side, on the harness side (a rule that flags ambiguity), or both.

The agent worked in an unfamiliar area without orienting. The diff touched a module the agent had not read. The conventions of that module did not survive the touch. The patch is a feature doc for the module, or a rule about orienting before editing.

The agent followed an aspirational rule too literally. The harness said prefer composition and the agent rewrote a working inheritance hierarchy. The patch is to scope the rule or soften it. State it as a preference for new code, not a rewrite trigger.

The agent escalated when it should have proceeded, or proceeded when it should have escalated. The escalation ladder is mistuned. The patch is on the ladder itself: a new rung, a tightened rung, a removed rung.

The agent’s mental model of the code drifted from reality. A refactor renamed something. The harness still references the old name. The agent reads the harness and confidently uses the missing name. The patch is to update the harness when the rename happens. The harness is a document. Documents need maintenance.

None of these are exotic. They are failures the team already half-knows about. The post-mortem makes the half-knowledge explicit and ships a change.

Once a week, once a quarter

I write the post-mortem within a week of the failure. Later than that and the details fade; the analysis becomes a story rather than a forensic exercise. The changes I produce, I land the same week.

I read the last quarter of post-mortems once a quarter, looking for cross-cutting patterns. Sometimes three of them all point to the same gap and the right patch is a single change that subsumes the three smaller patches. The quarterly review is where that consolidation happens.

The cost is small. An hour for the post-mortem itself. An hour or two for the patches. Half a day per quarter for the review. The compounding return is most of why our harness is the shape it is today.

Run your first post-mortem this week

The first post-mortem is the hardest. There is no template, the team is not sure what they are looking for, and the temptation is to make it about the agent’s competence instead of the team’s setup. Lean against that temptation. The practice only pays off if it produces changes to the setup.

The next one writes itself a little easier. The shape emerges by the third or fourth document. The structure above is what mine settled into after about six months; yours will look slightly different and that is fine.

Pick one signal to start. If the agent burned more than an hour going the wrong direction is too specific, try any time someone says the agent made the work harder this week. Looser triggers, narrower analysis. The point is to make the failure visible and the patch concrete.

Three things to do this week:

  1. Pick the most recent failure where somebody said that took way too long. Write one page. What happened, what the agent had loaded, which layer should have caught it, what one change you are making.
  2. Land the change before Friday. A rule. A hook. A line in the task template. Whatever the post-mortem named.
  3. Set a calendar reminder for ninety days out: read the post-mortems in order and look for a pattern.

The agent’s failures are not noise. They are the most accurate map you have of where your setup is weak. Stop letting them go unanalyzed.

The next failure will tell you something. The post-mortem is how you hear it.