The default 2026 instinct is to break every AI workflow into specialised agents: a planner, a retriever, a drafter, a critic, an editor. The diagram looks impressive. The system that ships is usually slower, more expensive, and harder to evaluate than the single-agent version would have been. Multi-agent is a tool, not a default.
The default trap.
A single-agent system has one prompt, one model, one set of tools. It is easy to evaluate, easy to debug, and predictable in cost. A multi-agent system has handoffs, intermediate state, multiple model calls, and several layers of failure to attribute. The split is justified only when the workflow actually benefits from specialisation that one agent cannot provide. Most workflows do not.
When the split pays off.
- Specialised reasoning patterns. A planning step needs cold methodical reasoning; a drafting step needs creative latitude. A reasoning-tuned planner plus a creative drafter beats a single agent told to do both.
- Different tool sets per role. The agent that queries the CRM should not have access to the email send tool. Role-scoping the tools is a safety mechanism.
- Verification independent from generation. A critic agent that did not produce the draft can catch errors the drafter is blind to. The independence is the whole point.
- Parallel work. Three retrieval queries run in parallel return faster than three sequential calls. The split is for concurrency, not for specialisation.
Three orchestration patterns we see.
- Pipeline. A sequence of agents passes structured state along. Step one outputs the input to step two. Predictable, debuggable, the right shape for most workflows.
- Supervisor and workers. A supervisor agent plans and delegates to one of several specialist workers, then integrates their outputs. The shape behind most customer-support and research workflows.
- Group chat. Several agents share a conversation and decide when to speak. Powerful for open-ended problem solving, hard to bound in cost and latency, rarely the right shape for a production workflow.
Handoff fidelity is the real problem.
The headline failure mode of multi-agent systems is information loss at the handoff. Agent A summarises five documents into a four-sentence brief for Agent B. The brief is plausible but drops the constraint that mattered. Agent B drafts confidently against the missing piece. The drafter looks wrong; the failure was upstream.
Practical defenses: structure the handoff (typed JSON, not free-form prose), include the source artefacts (Agent B can re-read what Agent A read), and version the handoff schema so any change is a deliberate breaking event.
Multi-agent failure modes.
- Specialisation theatre. Five “agents” calling the same model with five slightly different prompts. The split adds latency without changing capability.
- Critic complicity. A critic agent given too few instructions agrees with everything the drafter produced. The critic is decoration, not validation.
- State drift. Each agent rewrites or paraphrases the state. After three hops the state has lost the original request. Pin the source artefact; forbid silent rewriting.
- Loop pathology. An agent decides the work is not done and re-invokes itself or the previous agent. Without an iteration cap, the system loops on cost without converging.
Evaluating the flow, not just the agents.
Per-agent rubrics are necessary but not sufficient. The flow has its own properties:
- End-to-end coverage. Did the full pipeline produce an answer that meets the original request, not just the last agent's scoped output?
- Handoff faithfulness. Does the brief passed between agents preserve the constraints from the source?
- Latency budget. Production multi-agent workflows should have a wall-clock budget; the rubric checks it.
- Cost budget. Same shape, applied to the cumulative model spend per request.
Start single-agent. Split only when the eval scoreboard shows a specific deficit that specialisation actually fixes. A diagram that looks sophisticated is not evidence of a workflow that works.
Common questions.
Should we use an agent framework? Frameworks accelerate the boilerplate (state passing, tool registration, observability hooks) but they do not decide the orchestration. The architecture decisions are yours; the framework just runs the loop.
How many agents is too many? Past three or four coordinated agents in one workflow, attribution of failures becomes harder than the system is worth. Compose smaller multi-agent units into a pipeline rather than building one large committee.
Where does observability fit? Every handoff is a trace boundary. See the AI observability glossary entry for the layer that records prompts, retrievals, tool calls, and outputs across the flow.
Related glossary entries: multi-agent workflow, function calling, chain-of-thought.



