Splitting work across a team of agents feels like the obvious upgrade. In practice every handoff is a new place to fail. Here is what the failure data actually shows, and when more agents genuinely pay off.

Several agents passing work between each other, with the handoffs lighting up as the points where the task breaks

A single agent runs one chain. A team of agents adds a handoff at every seam, and the seams are where the work breaks.

Short answer. Multi-agent systems fail more often because every handoff between agents is a fresh chance for the work to go wrong, and most of those failures are coordination, not capability. The Berkeley MAST study read 1,642 real execution traces across seven popular frameworks and found failure rates of 41% to 86.7%. The fixes for that are harder than swapping in a smarter model. Use one agent by default; reach for many only when the work is genuinely parallel.

Key facts.

Across 1,642 annotated traces from seven popular multi-agent frameworks, failure rates ran 41% to 86.7%, with one benchmark (AppWorld) failing 86.7% of the time (Cemri et al., MAST, 2025).
The failures cluster into three buckets: system design 41.8%, inter-agent misalignment 36.9%, and task verification 21.3%. The single most common mode is step repetition at 15.7%, followed by reasoning-action mismatch at 13.2% (MAST).
Multi-agent is not always worse. Anthropic's orchestrator-worker research system beat a single-agent Claude Opus 4 by 90.2% on their internal research eval, on breadth-first tasks that parallelize well (Anthropic Engineering, 2025).
That win is not free: the same system burned roughly 15 times the tokens of a chat, versus about 4 times for a single agent (Anthropic), and the Cognition team argues most multi-agent setups are fragile because context cannot be shared thoroughly enough between agents (Don't Build Multi-Agents, 2025).

Why does adding agents add failures?

Because every agent boundary is a new interface, and interfaces leak. When one agent hands work to another, it passes a compressed summary, not its full reasoning, and the receiving agent fills the gaps with its own assumptions. The Berkeley MAST analysis put numbers on this: 36.9% of failures were inter-agent misalignment, things like one agent ignoring another's input, derailing the task, or quietly resetting the conversation. Another 41.8% were system design issues such as agents disobeying the task specification or never recognizing the task was done. A single agent carries one continuous context and never has to re-explain itself to a teammate, so it skips this entire class of failure. You do not get coordination for free; you pay for it in new ways to break.

What actually goes wrong between agents?

The most common failure is mundane: step repetition, 15.7% of all cases, where agents redo work because no one owns the state. Close behind is reasoning-action mismatch at 13.2%, where an agent says one thing and does another, and the next agent trusts the words. Roles blur, so two agents solve the same piece and conflict, or each assumes the other handled it. Cognition's team illustrates it well: ask a swarm to build a Flappy Bird clone, and one sub-agent builds a Super Mario style background because nothing pinned down the shared intent. The lead agent then has to reconcile work built on conflicting assumptions. None of this is a weak model. It is the cost of splitting one coherent task across minds that cannot see each other's full context.

$Two AI agents mid-handoff with a glowing fracture splitting the thread of light between them$

The break happens at the seam. Across 1,642 MAST traces, most multi-agent failures are coordination at the handoff, not raw model capability.

So when do more agents actually help?

When the work is genuinely parallel and wider than one context window. Anthropic's research system is the clean example: a lead agent spawns several sub-agents that each explore an independent direction at the same time, then synthesizes. On breadth-first queries, like finding every board member across hundreds of companies, that beat a single agent by 90.2%, because the task decomposes into chunks that do not depend on each other. The tell is independence. If the subtasks can run without talking to each other and only meet at the end, multi-agent shines. If each step depends on the last, you are just adding handoffs to a chain that wanted to stay whole.

Single agent or many? The honest comparison.

Dimension	Single agent	Multi-agent
Best for	Sequential, tightly-coupled work	Parallel, independent subtasks
Failure surface	One chain, compounding errors	Every handoff adds coordination failures
Context	One continuous thread	Fragmented, re-summarized at each seam
Cost	About 4x a chat in tokens	Around 15x; coordination is expensive
Right default	Yes, start here	Only when the work is truly parallel

The strategic move is to stop treating multi-agent as the default upgrade. One agent with the right tools and full context is the stronger baseline for most work, and you reach for a team only when the task genuinely splits into independent pieces. The hard part is knowing which of your workflows decompose cleanly and which only look like they do. That is a pattern question, not a framework question, and it is exactly the reliability OptimalARC builds as the Pattern Intelligence Layer: which parts of a workflow are safe to parallelize, and which must stay whole.

Frequently asked questions

Are multi-agent systems just worse than single agents?
No. They fail more on tightly-coupled, sequential work because of coordination overhead, but they win on genuinely parallel tasks. Anthropic's research system beat a single agent by 90.2% on breadth-first queries. The question is whether your task decomposes into independent pieces.

What is the most common multi-agent failure?
Step repetition, 15.7% of cases in the MAST study, where agents redo work because no one clearly owns the state. Most multi-agent failures are coordination and verification problems, not weak models.

If multi-agent can score higher, why not always use it?
Cost and fragility. The same system that scored higher used roughly 15 times the tokens of a chat, and context cannot be shared fully between agents, so subtasks drift on conflicting assumptions. The gain only shows up when the work is truly parallel.

How do I decide between one agent and many?
Test for independence. If the subtasks can run without talking to each other and only combine at the end, multi-agent helps. If each step depends on the previous one, keep it a single agent and avoid the handoffs.

Why do multi-agent systems fail more often than a single agent?

Why does adding agents add failures?

What actually goes wrong between agents?

So when do more agents actually help?

Single agent or many? The honest comparison.

Frequently asked questions

Join the discussion

Why do customer-support AI agents fail in production, and how do you make them reliable?

What is a retry death spiral, and how do I stop it?

Why did my agent's cost explode when it moved from pilot to production?