Agentic AI is the big thing right now, with names like OpenClaw and NemoClaw filling column inches, telling you to use or to stay away. Agents can organize your computer or delete your inbox, and Microsoft wants to put them into everything Windows.

The thing is, the capacities of advanced LLMs like Claude are increasing faster than we can write use cases for them. It's not been six months since orchestrators for Claude and other LLMs started taking over GitHub, but that's a long time in frontier model time, and the LLMs have gotten exponentially better since.

How much better? Well, a new paper tested designed orchestrators vs self-organizing LLMs, and the creators of those GitHub projects aren't going to like the results. Or maybe they will, because it partly proves the value of pre-designed hierarchies, but only when the LLMs are able to self-organize within that structure. The TL: DR? LLM agents are much more capable than we thought and only need gentle coaxing to deliver their best results when given a problem.

The multi-agent architecture has a problem

When one output becomes an input, inaccuracies are multiplied

Building any system at scale is hard, because you have to ensure the validity of data no matter where it came from or what path it took. We all know AI agents can hallucinate, lie, make things up, or be inaccurate in part of their results while appearing to give worthwhile feedback.

With multi-agent orchestrators, that problem multiplies, because every inaccuracy compounds. How far? Well, Google's DeepMind tested this in 2025, with 180 configurations across 5 agent architectures and three major LLMs. The result? Unstructured multi-agent networks amplify errors up to 17.2 times compared to single-agent baselines.

Unstructured multi-agent networks amplify errors up to 17.2 times compared to single-agent baselines.

Seventeen times worse. At that point, you might as well put potential outcomes on a spinning dartboard and choose via a single dart thrown by a blindfolded player. You might get a more accurate answer.

The research also showed that any performance gains didn't scale beyond four agents, as coordination overhead ate away any benefits. That seems in stark contrast to the industry players I know who use anywhere between 6 and 20 agents at a time to break down complex tasks. But even if those individual agents could get to 99% reliability, compound math is compound math and that 1% is going to be an issue past one agent, let alone 20.

Research lags behind practice

As we've seen recently with OpenClaw, it's faster to build something than it is to make it secure. AI models are now building themselves, and research on their interactions can't run until the models are released to researchers. It's the same lag when building tools for AI, in that you hope that the models won't get powerful enough to make your tools obsolete by the time you've released them. And with multi-agent tools, that time has come.

A recent paper challenges the multi-agent hierarchy

Self-organizing LLMs outperformed the many

Okay, so the paper in question is "Drop the Hierarchy and Roles: How Self-Organizing LLM Agents Outperform Designed Structures," and it's interesting not only for the results but also for how thoroughly they tested their hypothesis. They used 25,000 tasks spanning eight LLM models, with anywhere from four to 256 agents and eight coordination protocols. The results showed the best improvement with a hybrid approach, where the rough structure was mapped out but the individual agents were able to self-organize to fit their own roles.

The practical implication: give agents a mission, a protocol, and a capable model -- not a pre-assigned role.

Now, that doesn't mean there's no value in the coordinator model. Clearly, there is, otherwise the hybrid models wouldn't win out. It's similar to prompt engineering, but applied to scale. Giving autonomous agents a mission to achieve, a protocol to follow, and a relevant model to use is no different from prompting a chatbot, except it doesn't require any human intervention afterward.

But they also surfaced plenty of other nuggets. Less capable models like GLM-5 worked better with rigid, assigned roles and an orchestrator hierarchy. Strong models like Claude Sonnet 4.6 and DeepSeek v3.2 performed best with minimal guidelines, and the open-source models were within 95% of the performance of the closed-source models, showing that costs can be reduced without sacrificing quality.

Self-organizing agents are more accurate, for now

While the hybrid model is ahead right now, LLM research is evolving ever more rapidly. Single agents can create sub-agents and organize their own small workforce, and with more capable models, the potential is there for not having to define as much of a structure, more of a "how" each task should be attempted. It's impressive, but it's also fascinating to see how computers organize their workflows, compared to the organizational charts that humans have adjusted over the years.