Multi-agent systems in document AI

Why one agent stops scaling at the third edge case

A single ReAct loop is a beautiful thing on the demo bench. It reads a document, decides on a tool, calls it, reads the result, decides on the next tool, and emits a structured answer. Production breaks it for one boring reason: the prompt that handles the happy path is not the prompt that handles edge cases, and pasting them all into one context turns the agent into a generalist that is mediocre at every step.

The first symptom shows up around the third edge case. The team adds a clause to the prompt to handle scanned documents with rotated stamps. A week later, another clause for multi-page contracts where the signature page is at the back. A week after that, a clause for invoices where the buyer is a sub-entity of the legal counterparty. By the fourth round the prompt is 6,000 tokens of caveats, the agent is reasoning about whether to apply rule 11 or rule 14 before it even looks at the page, and accuracy on the original happy path quietly drops by two points because the model is distracted.

The second symptom shows up in the audit log. Every action — read page, call OCR fallback, score confidence, route to human — is a step in the same agent's trace. When the auditor asks "who decided this case did not need review?", the answer is "the agent did, in step 14 of a 23-step trace where steps 1–9 were extraction and step 14 was the routing decision." That is not an answer that survives the conversation. Each decision the operator cares about wants its own owner.

Multi-agent design fixes both. Specialization keeps each prompt narrow and the per-step accuracy high. Decomposition gives each decision its own accountable owner, which is what the audit trail and the CAIO operating model actually want.

The three orchestration patterns that survived 2025

A year ago the multi-agent literature read like a catalogue of every plausible architecture. By the end of 2025, three patterns had absorbed the bulk of production deployments. The names are not new — they have been in the distributed-systems and AI textbooks for decades — but their shapes are now concrete enough that engineering teams pick between them on the same whiteboard.

Hierarchical — the maestro and the specialists

One maestro agent owns the case end-to-end and delegates each step to a specialist sub-agent. The classifier decides the document type, the extractor pulls fields, the validator runs the rules, the escalator opens a human task when a threshold is crossed, the auditor stamps the trace. The maestro never does the work itself; it picks the next specialist, hands over the relevant slice of context, and merges the results back into the case state.

This is the pattern that pays back fastest in document AI. The maestro is the only agent that needs the full case context, which keeps token cost flat per step instead of compounding. The specialists are small, cheap and easy to evaluate in isolation. When a new edge case appears, you add a specialist instead of rewriting one giant prompt. The trade-off is that the maestro becomes the single point of failure: if it picks the wrong specialist or passes the wrong context, the case derails and nothing downstream can recover.

Peer-to-peer — the negotiation pattern

No central coordinator. Agents talk to each other directly through a message bus, claim work, and negotiate handoffs. The extractor finishes a page and posts "fields ready for invoice X"; the validator picks the message up, runs the rule pack, and posts either "validated" or "needs reconciliation with PO." Anyone subscribed to the relevant topic acts on it. The agents are peers and the choreography lives in the messages, not in a manager.

The pattern wins when the workflow has true parallelism and the agents rarely need to look at each other's state — multi-document raciocínio where two extractors can run on two PDFs at the same time, then a reconciler picks up both outputs once they land. It loses when the steps are mostly sequential and the negotiation overhead dominates: a 3-step pipeline does not benefit from a Kafka topic between every stage.

Blackboard — the shared scratchpad

A shared workspace — the blackboard — holds the evolving case. Every specialist reads from it, posts its contribution back, and a controller decides what runs next based on what is on the board. The blackboard is the single source of truth; the agents are stateless workers around it.

This is the pattern that fits regulated, multi-source workflows best. The blackboard captures the full case state, the order of contributions, who wrote what and when, which becomes the audit artifact almost for free. Three-way match between PO, invoice and goods receipt is the textbook example: the matching specialist needs all three on the board before it can run, and the audit needs to show which agent put each piece there. The cost is operational — the blackboard is a stateful service the team has to run, version and back up — and most early implementations under-invest in it, then pay for that under-investment when it grows past a few thousand cases a day.

Which pattern fits which workflow

Picking the wrong pattern is the most expensive mistake we see in the first six months of an agentic program. The decision is usually obvious once the workflow shape is named honestly, but most teams pick the pattern that sounded good in a talk and then bend the workflow to fit it.

Workflow shape	Pattern that fits	What breaks if you pick wrong
Sequential extraction with branching by document type	Hierarchical — maestro routes per case.	Peer-to-peer adds message-bus latency for no gain; blackboard adds persistence cost the case never needs.
Multi-source reasoning with parallel inputs	Blackboard — agents wait on shared state, contribute in any order.	Hierarchical forces the maestro to wait on every input; peer-to-peer loses provenance the auditor will ask for.
Streaming intake with independent cases	Peer-to-peer — each case is an event, agents claim and process.	Hierarchical bottlenecks on the maestro; blackboard turns the bus into a database.
Regulated multi-stage extraction with separate auditable artifacts	Blackboard — every specialist's contribution is an artifact on the board.	Hierarchical buries the per-stage artifact inside the maestro's trace; peer-to-peer scatters it across topics.
Short pipelines with 3–4 known steps	Hierarchical, kept boring.	Either of the others is over-engineering. We've seen teams ship a Kafka cluster to coordinate three agents that should have been a switch statement.

The honest rule we use internally: start hierarchical. Move a step to peer-to-peer when you have evidence that the step has real parallelism. Move the case to a blackboard when the audit shape demands per-contribution provenance the maestro's trace can't carry cleanly.

The five failure modes that show up in production

The first multi-agent system in production almost never fails on quality. It fails on coordination. The five failure modes below are the ones we have seen recur across teams; none of them are exotic, and all of them have a defensive pattern that ships in the same week as the agents themselves.

Loops

Two agents send the case back and forth without making progress. The extractor flags "low confidence on field X", the validator routes the case back to the extractor with "please retry", and the extractor produces the same low-confidence value because nothing about the input changed. The case bounces until a watchdog kills it or the user notices the queue is full.

Defence: every handoff carries a hop counter and a reason code. The maestro (or the blackboard controller) refuses any handoff where the same agent pair has touched the case more than N times without the input changing. The case moves to human review with the loop trace attached.

Deadlocks

Two agents wait for each other. The reconciliation specialist is waiting for the validator to confirm field A; the validator is waiting for the reconciliation specialist to resolve a discrepancy on field B. Neither will act until the other does, and the case sits at zero progress.

Defence: every agent declares its dependencies up front. The orchestrator does a cycle check before scheduling. If it can't, the blackboard pattern helps — the controller sees that no agent has acted in the last t seconds and routes the case to human resolution rather than letting it rot.

Poisoned context

One specialist passes a contaminated payload to the next. The classifier labels the document as "freight invoice" when it is actually a credit note; the extractor reads it with the freight schema and produces a confident but wrong set of fields; the validator's rules pass because the schema is internally consistent. Each agent did its job correctly given its input; the cascade is broken because step one was wrong.

Defence: every specialist re-validates the slice of context it depends on, even if it trusts the previous one. The extractor re-checks the classifier's verdict against two lightweight features (header keyword, total sign) before applying the schema. Cheap; almost free in tokens; catches about 80% of the cascades.

Schema drift between agents

Specialists evolve at different speeds. The extractor's output schema gains a field the validator does not yet know about; the validator silently drops it; downstream consumers see a regression they cannot trace because every individual agent passed its own tests.

Defence: a versioned contract between agents, enforced by a typed message envelope. The maestro (or the blackboard) rejects any message that does not match the schema version both ends agreed on. This is the engineering discipline most teams skip in the first quarter and then retrofit painfully in the second.

Cost runaway

Five agents on every case sounds harmless until the volume scales. A case that used to be one 3,000-token call is now five 1,500-token calls, the maestro adds its own 2,000 tokens of routing context per step, and the per-case cost is 3–4x what the single-agent baseline was. The throughput gain is real, but the unit economics quietly stop working.

Defence: a budget envelope per case, enforced by the maestro. Each specialist gets a token allowance; the maestro tracks the running total and refuses to dispatch the next agent if the next call would exceed the cap. Cases that approach the cap route to human review by design. The cap is the same lever we covered in our reasoning-model routing post — the principle is identical, the implementation just lives one level higher in the stack.

Observability without the black box

The hardest thing about multi-agent systems is not making them work — it is being able to explain what they did, six weeks later, to a person who was not in the room. Every team we have talked to that scaled past 50,000 cases a month converged on the same observability shape, with three layers the platform has to emit by default.

Case-level trace, not agent-level trace

The unit of observation is the case, not the agent. The trace shows the case moving through specialists, with each specialist's input, output, model version, prompt version, latency and cost. The auditor reads it as a story: "this invoice arrived, classifier said freight, extractor pulled nine fields, validator flagged one, reconciler matched against PO 4471, case auto-approved." If the trace reads as a story, the audit conversation is a discussion. If it reads as a log dump, the conversation becomes an investigation.

Per-specialist evaluation as a first-class artifact

Specialists are evaluated independently against their own gold set. The extractor's accuracy on invoice schema, the classifier's confusion matrix, the validator's false-positive rate on rule R12 — each lives on a dashboard the owner of that specialist looks at every week. The mistake teams make is to evaluate only the case-level outcome, because the case-level number hides which specialist regressed.

Disagreement detection

Specialists overlap on purpose. The classifier emits a label; the extractor's schema choice implies a label; if the two disagree, the case goes to a second-opinion specialist or a human. This is the cheapest quality signal in multi-agent systems and the one most often skipped: the agents are designed to cooperate, but a tiny adversarial overlap surfaces more regressions than a quarterly evaluation does.

We covered the wire format we use for case-level traces in our agentic tracing post; the multi-agent extension is that every span has an explicit agent_id and agent_version, and the case-level trace is the concatenation of every specialist's contribution rather than a single agent's loop.

Multi-agent systems do not fail on intelligence. They fail on coordination, and coordination is engineering work, not prompt work.

What we ship at Cogneris

Cogneris was built as a multi-agent system from day one. The pattern we ship is hierarchical with a blackboard-style audit substrate underneath — the case has a single maestro per tenant policy, but every specialist's contribution lands as an artifact on a per-case shared state that the audit trail reads directly. The five specialists every customer gets out of the box are familiar by now: a classifier, an extractor, a validator, an escalator and an auditor. New tenants pick up the same set; new use cases add specialists, not prompt clauses.

The choices that show up in the architecture are the ones we have defended in earlier posts and that the multi-agent shape made easier to deliver, not harder:

Per-tenant maestro policy — the maestro's routing rules, dispatch budget and escalation thresholds are tenant-scoped, so one customer's claims pipeline can be tuned for speed and another's for conservatism without forking the platform.
Per-specialist model choice — the classifier runs on a small model, the extractor on a frontier VLM when the page demands it, the validator on rules with a small LLM as fallback. The VLM-first collapse we wrote about lives inside the extractor specialist; the rest of the squad does not pay the latency cost of a frontier call.
Case-level audit trail — each specialist's input, output, model, prompt and cost is on the case record, retrievable per case for the retention window the tenant configures. The same artifact answers the non-deterministic audit question and the regulator's evidence request.
Budget envelope per case — the maestro tracks tokens and dollars per case, refuses to dispatch a specialist that would push the case over the cap, and routes the case to human review by design when the cap is reached. The platform does not silently overrun the unit economics.
Kill switch per specialist — every specialist has a feature flag that takes it out of the dispatch pool. If the validator's false-positive rate spikes after a model change, the tenant's operator pulls that specialist out of the loop and the maestro falls back to a human review step until the regression is resolved.

Where we still do not use multi-agent

Honest list of the workflows where we deliberately keep one agent — not because we cannot do multi-agent, but because the simpler shape is the right one.

Single-document, single-decision flows. A signed consent form that needs a yes/no extraction does not benefit from a squad. One agent, one prompt, one structured output, done. Splitting it into classifier + extractor + validator adds latency and cost the workflow cannot recover.

Sub-100ms-per-page real-time paths. Embedded authorisation, live KYC, fraud-score-on-swipe — the round-trip overhead of multi-agent coordination is incompatible with the latency budget. The single-agent (or no-agent, deterministic) path stays in those flows; the agentic squad runs offline as a quality sampler.

Pilot phase on a new use case. The first 200 cases of a new workflow run on one agent, one prompt, one schema. We add a specialist when the data justifies it — a recurring failure mode the single agent cannot absorb without bloating the prompt — not when the proposal slide says "agentic." The pilot is for learning the workflow; the multi-agent split is for scaling it.

Closing thought

Multi-agent systems are not a product feature; they are an engineering shape. The teams that ship them well treat coordination as first-class infrastructure — message envelopes, budget envelopes, hop counters, kill switches, per-specialist evaluation, case-level audit — and the teams that struggle treat them as a prompt-engineering exercise and discover the coordination problems in production. The 35–55% throughput gain over a single agent is real; so is the 3–4x cost runaway if the envelope is not enforced. The choice between the two is engineering, not modelling, and that is the choice Cogneris was built around.

If you are evaluating where multi-agent makes sense in your own backoffice — claims, due diligence, reconciliations, onboarding — see our product page for the specialists we ship out of the box, or talk to our team and we will walk you through the routing pattern your workflow actually needs.

One agent is not enough anymore.