![]() |
VOOZH | about |
Jun 17, 2026
A harness wraps your AI model. A self-harness lets the model improve that wrapper on its own. Here is how the weakness-mining, proposal, and validation loop works — and why it consistently produces 15–52% benchmark gains without touching the base model.
Jun 29, 2026
Most teams conflate prompt writing with context design, loop orchestration, and harness code. They are four layers of the same stack. Here is how they nest, what breaks when you skip one, and which layer to fix when agents fail.
Jun 19, 2026
If you need agent loops today, start at explainx.ai/loops — around 100 copy-ready workflows with kickoff prompts, guardrails, and new entries every week. Matthew Berman's Forward Future library is a similarly strong option with 26 practitioner-contributed recipes; here is how both compare and what early adopters like Theo (t3.gg) are running in production.
When an AI model solves a complex task autonomously — browsing the web, writing code, running tests, fixing errors, and iterating until the output passes review — it is easy to credit the model. The model reasoned well. The model wrote good code. The model figured it out.
But almost always, a second system made that possible. It decided what context to give the model. It routed the model's output to the right tool. It checked whether the result was acceptable. It handled the errors. It ran the loop again when the first attempt failed.
That second system is the agent harness.
The harness is why the same model can fail at a task when called once and succeed at the same task when wrapped in the right scaffolding. It is also why, when researchers report benchmark gains without changing the model, they almost always changed the harness.
An agent harness is the orchestration layer that sits between your AI model and the environment it needs to act in. It manages the full execution lifecycle of an agentic task:
Without the harness, you have step 3 and step 4 only — a single prompt and a single response. The harness is what turns a language model into an agent.
In 2026, the most striking evidence for harness importance comes from benchmarks. LangChain's Deep Agents team achieved significant gains on Terminal-Bench 2.0 using the same underlying model — only the harness changed. The scaffolding around the model — how context was assembled, how tool outputs were formatted, how retries were managed — produced better results than a model upgrade would have.
This is not an isolated finding. It is the pattern:
Better harness on the same model > same harness on a better model — in many real-world tasks.
The reason is structural. The model only sees what the harness gives it. If the harness gives the model noisy context, the model produces noisy output. If the harness truncates relevant information to fit a context window, the model reasons from an incomplete picture. If the harness has no verification step, the model has no signal that it was wrong. The model cannot compensate for harness failures with capability alone.
The entry point. The harness receives a goal (sometimes called an objective, spec, or task) and converts it into the first prompt the model sees. Good task definitions:
The task definition layer is where loop engineering starts — you define the exit condition before the loop begins.
The model has a context window. The harness decides what fills it.
For short tasks, this is simple: put the task and prior tool outputs in the prompt. For long tasks spanning many tool calls or long documents, the harness must:
Poor context management is the most common cause of harness failure on long tasks. The model loses track of the goal, repeats steps it already completed, or starts contradicting its own prior work.
The harness calls tools on behalf of the model. This includes:
The tool layer is responsible for sandboxing (ensuring tool calls can't cause unintended damage), timeout handling (a hanging subprocess shouldn't freeze the whole harness), and output normalisation (converting raw tool results into a format the model can use).
The harness decides when to call the model again and when to stop.
Iteration triggers:
Exit conditions:
The loop controller is where the "agent-ness" lives. A model without a loop controller isn't an agent — it's an API call.
The most important component and the one most often skipped.
The verification layer checks whether the task is actually done. A good verification check is:
Examples of strong verification:
Examples of weak verification:
Loop engineering is essentially the practice of designing good verification layers and connecting them to loop controllers.
What happens when the loop can't converge? The harness needs explicit handling for:
Without explicit failure handling, harnesses fail in opaque ways: infinite loops, silent partial results, or crashes that surface as confusing downstream errors.
The most basic harness: call the model, run the verification, loop if it fails.
goal → model call → tool execution → verify
↑________________________| (if fail, retry)
↓ (if pass, exit)
This is what Claude Code's /loop command implements. It works well for tasks with fast, cheap verification (test suites, lint checks).
The model first generates a plan (a list of steps), then the harness executes each step in sequence, calling the model for each one.
goal → model (plan) → [step 1 → model → tool] → [step 2 → model → tool] → verify → exit
Used in agentic coding workflows where the task is complex enough to benefit from explicit decomposition.
The harness coordinates multiple model calls in parallel or in sequence, each specialised for a subtask. A coordinator model routes work to specialist agents (coder, reviewer, tester, documenter) and aggregates results.
coordinator model
├── coder agent → code
├── reviewer agent → review
└── tester agent → test results
→ aggregate → verify → exit
This pattern is described in the Anthropic managed agents architecture and is the likely pathway to ASI via multi-agent collectives.
A harness that can modify itself — updating its own tool list, memory schema, or verification criteria based on what worked and what didn't. This is what the self-harness research explores and what Matt Pocock cautioned against when it applies to auto-generated CLAUDE.md instructions.
| Custom Harness | Framework (LangChain, LangGraph) | Agent Platform (Claude Code, Devin) | |
|---|---|---|---|
| What it is | Code you write from scratch | Library of harness components | Fully built harness with UI/CLI |
| Flexibility | Maximum | High (configurable) | Low (fixed patterns) |
| Time to first run | Days–weeks | Hours–days | Minutes |
| Best for | Unique verification logic, specific domains | Standard agentic patterns | Common dev tasks |
| Maintenance | Full ownership | Framework updates | Platform handles it |
For a fourth option — minimal but extensible — see Pi (pi.dev). For open source with 75+ providers and terminal + desktop, see OpenCode.
The choice depends on how standard your task is. The more your task looks like "write code, run tests, fix until green," the more an existing platform handles it. The more you need custom verification, unusual tool combinations, or specific orchestration logic, the more you want a custom harness.
Clarifying the boundary:
A common misconception is that a good harness compensates for a weak model. It doesn't — it extracts more of what the model is capable of. There is a floor: if the model genuinely cannot solve the problem even with unlimited retries and perfect context, the harness cannot fix that.
If you are building a harness for the first time, the sequence that works:
Do not add planning layers, parallel execution, or multi-agent orchestration until the simple loop works reliably. Complexity in harnesses compounds — a subtle bug in a simple harness is easy to find; the same bug inside a planning layer inside a multi-agent system is not.