A single agent runs one chain. A team of agents adds a handoff at every seam, and the seams are where the work breaks.
Short answer. Multi-agent systems fail more often because every handoff between agents is a fresh chance for the work to go wrong, and most of those failures are coordination, not capability. The Berkeley MAST study read 1,642 real execution traces across seven popular frameworks and found failure rates of 41% to 86.7%. The fixes for that are harder than swapping in a smarter model. Use one agent by default; reach for many only when the work is genuinely parallel.
Key facts.
- Across 1,642 annotated traces from seven popular multi-agent frameworks, failure rates ran 41% to 86.7%, with one benchmark (AppWorld) failing 86.7% of the time (Cemri et al., MAST, 2025).
- The failures cluster into three buckets: system design 41.8%, inter-agent misalignment 36.9%, and task verification 21.3%. The single most common mode is step repetition at 15.7%, followed by reasoning-action mismatch at 13.2% (MAST).
- Multi-agent is not always worse. Anthropic's orchestrator-worker research system beat a single-agent Claude Opus 4 by 90.2% on their internal research eval, on breadth-first tasks that parallelize well (Anthropic Engineering, 2025).
- That win is not free: the same system burned roughly 15 times the tokens of a chat, versus about 4 times for a single agent (Anthropic), and the Cognition team argues most multi-agent setups are fragile because context cannot be shared thoroughly enough between agents (Don't Build Multi-Agents, 2025).
Why does adding agents add failures?
Because every agent boundary is a new interface, and interfaces leak. When one agent hands work to another, it passes a compressed summary, not its full reasoning, and the receiving agent fills the gaps with its own assumptions. The Berkeley MAST analysis put numbers on this: 36.9% of failures were inter-agent misalignment, things like one agent ignoring another's input, derailing the task, or quietly resetting the conversation. Another 41.8% were system design issues such as agents disobeying the task specification or never recognizing the task was done. A single agent carries one continuous context and never has to re-explain itself to a teammate, so it skips this entire class of failure. You do not get coordination for free; you pay for it in new ways to break.
What actually goes wrong between agents?
The most common failure is mundane: step repetition, 15.7% of all cases, where agents redo work because no one owns the state. Close behind is reasoning-action mismatch at 13.2%, where an agent says one thing and does another, and the next agent trusts the words. Roles blur, so two agents solve the same piece and conflict, or each assumes the other handled it. Cognition's team illustrates it well: ask a swarm to build a Flappy Bird clone, and one sub-agent builds a Super Mario style background because nothing pinned down the shared intent. The lead agent then has to reconcile work built on conflicting assumptions. None of this is a weak model. It is the cost of splitting one coherent task across minds that cannot see each other's full context.
The break happens at the seam. Across 1,642 MAST traces, most multi-agent failures are coordination at the handoff, not raw model capability.
So when do more agents actually help?
When the work is genuinely parallel and wider than one context window. Anthropic's research system is the clean example: a lead agent spawns several sub-agents that each explore an independent direction at the same time, then synthesizes. On breadth-first queries, like finding every board member across hundreds of companies, that beat a single agent by 90.2%, because the task decomposes into chunks that do not depend on each other. The tell is independence. If the subtasks can run without talking to each other and only meet at the end, multi-agent shines. If each step depends on the last, you are just adding handoffs to a chain that wanted to stay whole.
Single agent or many? The honest comparison.
| Dimension | Single agent | Multi-agent |
|---|---|---|
| Best for | Sequential, tightly-coupled work | Parallel, independent subtasks |
| Failure surface | One chain, compounding errors | Every handoff adds coordination failures |
| Context | One continuous thread | Fragmented, re-summarized at each seam |
| Cost | About 4x a chat in tokens | Around 15x; coordination is expensive |
| Right default | Yes, start here | Only when the work is truly parallel |
The strategic move is to stop treating multi-agent as the default upgrade. One agent with the right tools and full context is the stronger baseline for most work, and you reach for a team only when the task genuinely splits into independent pieces. The hard part is knowing which of your workflows decompose cleanly and which only look like they do. That is a pattern question, not a framework question, and it is exactly the reliability OptimalARC builds as the Pattern Intelligence Layer: which parts of a workflow are safe to parallelize, and which must stay whole.
Frequently asked questions
Are multi-agent systems just worse than single agents?
No. They fail more on tightly-coupled, sequential work because of coordination overhead, but they win on genuinely parallel tasks. Anthropic's research system beat a single agent by 90.2% on breadth-first queries. The question is whether your task decomposes into independent pieces.
What is the most common multi-agent failure?
Step repetition, 15.7% of cases in the MAST study, where agents redo work because no one clearly owns the state. Most multi-agent failures are coordination and verification problems, not weak models.
If multi-agent can score higher, why not always use it?
Cost and fragility. The same system that scored higher used roughly 15 times the tokens of a chat, and context cannot be shared fully between agents, so subtasks drift on conflicting assumptions. The gain only shows up when the work is truly parallel.
How do I decide between one agent and many?
Test for independence. If the subtasks can run without talking to each other and only combine at the end, multi-agent helps. If each step depends on the previous one, keep it a single agent and avoid the handoffs.

