VOOZH about

URL: https://dev.to/saurav_bhattacharya/your-agent-passed-every-eval-and-still-cost-4000-a-day-3ndl

⇱ Your Agent Passed Every Eval and Still Cost $4,000 a Day - DEV Community


Here is a failure mode nobody puts on their roadmap: the agent works. It answers correctly. It passes your golden set. Your model-as-judge gives it a 9.2. Your hallucination checks are green. And then finance forwards you the inference bill and asks, politely, what exactly you have been doing for the last three weeks.

Most eval suites measure one axis: was the output correct? That is the axis everyone copies from the demos and the leaderboards. But "correct" is necessary, not sufficient. A production agent has at least three more dimensions that determine whether it survives contact with reality — cost, latency, and tool-call efficiency — and almost nobody scores them. So they regress silently, release after release, until they become an incident instead of a number on a dashboard.

I want to make the case that operational metrics are evals, not monitoring afterthoughts, and show how to wire them in.

"Correct" is the cheapest thing to measure and the least complete

Think about what actually changes between two versions of an agent. You tweak a system prompt. You add a "think step by step" nudge. You bump the model. You add a retrieval step "just to be safe." Every one of those changes can leave correctness flat while quietly doubling token usage or adding three seconds of latency or sending the agent into a 14-step tool-calling spiral to answer a question that used to take two.

Your correctness eval will not catch any of that. It is, by design, blind to how the answer was produced. It only looks at the final string. Which means the most expensive regressions in agentic systems are exactly the ones a naive eval suite is structurally incapable of seeing.

The fix is not more correctness tests. It is treating the trace — the full record of how the agent reached its answer — as a first-class eval target.

The two halves: scoring the output vs. capturing the path

This is where two tools that ship as a unit earn their keep, and you need both because they measure different things.

agent-eval scores and gates the agent's output. It runs your assertions — deterministic checks, model-as-judge, drift, hallucination — and, critically, it can assert on operational metrics too. It is the thing that fails the build when a number crosses a line.

AgentLens captures the trace: every model call and tool step, the resolved inputs that actually went over the wire, the raw outputs that came back, token counts, and wall-clock timings per step. Without that trace, an operational eval signal is undebuggable — you'd know cost went up 40% but have zero idea which step did it. agent-eval tells you the bill doubled; AgentLens tells you it was the retrieval step firing three times because of a bad cache key.

You score the output. You capture the path. The path is what makes the score actionable instead of just alarming. One without the other is half a workflow.

What an operational eval actually looks like

Let's make it concrete. AgentLens gives you a structured trace per run; agent-eval lets you assert against it. Here's the shape:

import { defineEval, runEval } from "agent-eval";
import { getTrace } from "agentlens";

// A "case" is one task we want the agent to handle.
const eval = defineEval({
 name: "refund-lookup",
 cases: [
 { input: "Where is my refund for order 88231?", expectResolved: true },
 ],

 // Correctness — necessary, but not the whole story.
 scorers: [
 async ({ output, expected }) => ({
 name: "resolves_refund",
 pass: output.includes("refund") && expected.expectResolved,
 }),

 // Operational scorers, pulled straight from the captured trace.
 async ({ runId }) => {
 const trace = await getTrace(runId);

 const totalTokens = trace.steps.reduce((n, s) => n + s.tokens, 0);
 const toolCalls = trace.steps.filter(s => s.type === "tool").length;
 const latencyMs = trace.endedAt - trace.startedAt;

 return [
 { name: "token_budget", pass: totalTokens <= 6000, value: totalTokens },
 { name: "tool_calls", pass: toolCalls <= 4, value: toolCalls },
 { name: "latency_p95", pass: latencyMs <= 8000, value: latencyMs },
 ];
 },
 ],
});

const report = await runEval(eval);
if (!report.passed) {
 // Same gate as a failing correctness test. No special-casing.
 process.exit(1);
}

Three things make this work, and they are all opinions I will defend.

1. The budgets live next to the correctness assertions. Not in a Grafana panel someone glances at on Fridays. In the same file, enforced by the same process.exit(1). A 5x token regression should fail the build with the same authority as a wrong answer, because operationally it is just as much a defect.

2. The numbers come from the trace, not from re-instrumentation. You are not sprinkling console.time through your agent. AgentLens already recorded every step's tokens and timing as a side effect of running. agent-eval just reads it back. If your operational metrics require a separate instrumentation pass, you will skip it, and you know you will.

3. You assert on value, not just pass. Store the number. Because the day a budget fails, your very next question is "by how much, and since when?" — and that is a trend, not a boolean.

Catching the slow bleed

The dangerous regressions are rarely a cliff. They are a slow bleed: 4,100 tokens, then 4,400, then 4,900, each one under budget, until one day it isn't and you have no idea which of the last forty PRs did it.

Because agent-eval persists the trace-derived values per run, you diff them:

import { compareRuns } from "agent-eval";

const diff = await compareRuns({ base: "main", head: "PR-512" });

for (const m of diff.metrics) {
 if (m.delta > m.value * 0.15) {
 console.warn(
 `⚠ ${m.case}/${m.name}: ${m.baseValue}${m.headValue} ` +
 `(+${Math.round((m.delta / m.baseValue) * 100)}%) — see trace ${m.headRunId}`
 );
 }
}

A 15% jump in tokens on a PR is a conversation, even if every absolute budget still passes. And because the warning carries the AgentLens headRunId, the reviewer is one click from the exact step that moved. The eval says that it regressed; the trace says why. You don't argue about it in the PR thread — you open the trace.

The uncomfortable part

Adopting this means admitting your agent has a cost and latency profile that is a product surface, not an implementation detail. The "just add another tool call to be safe" reflex is exactly how a $400/day agent becomes a $4,000/day agent, one defensible little change at a time. None of those changes is wrong in isolation. Their sum is the incident.

So put a number on it. Score the output with agent-eval, capture the path with AgentLens, and let the two together fail your build when the agent gets correct but expensive. Correctness keeps you honest with your users. Operational evals keep you honest with everyone who has to run the thing in production — which, eventually, is you.

The agent that passes every correctness eval and still bankrupts the feature is not a hypothetical. It is the default outcome of measuring only the half of the system that is easy to measure.