The shape of the old contract

The classic IDP architecture was a four-stage pipeline that hadn't materially changed in a decade: classify the document, OCR the pages, extract a fixed set of fields, hand the JSON to a downstream system. Everything past that JSON belonged to RPA, to a workflow engine, or to a human operator who had to interpret the fields against a policy document.

The architecture was an artifact of what the technology could actually do. Classification was a model that returned one of N labels. Extraction was a model that returned values for named slots. Neither model knew anything about the business rule those values would later be evaluated against. The "intelligence" stopped at the page boundary.

That split was reasonable when extraction was the hard part. It stopped being reasonable when extraction got cheap.

What broke the contract

Three things compounded over the last 24 months. None of them is dramatic on its own. The combination is.

Multimodal LLMs absorbed the classification and extraction stages

A 2024-vintage IDP pipeline had separate models for layout analysis, OCR, classification, and field extraction. A 2026-vintage pipeline has one multimodal model that consumes the page image and produces structured fields directly, with the classification implicit in the schema it picks. Three pipeline stages collapsed into one model call. The accuracy on printed-text extraction crossed 98% on standard benchmarks, and on handwritten content crossed 95% — numbers the OCR-plus-NLP stack used to bottom out at 70%.

The headline metric is accuracy. The architectural consequence is more interesting: once a single model is doing the perception work, the question of "what should we do with the result" can run inside the same context. The cost of asking the next question dropped to roughly the cost of one more turn.

The deterministic floor stayed deterministic

The temptation in 2024 was to throw the entire pipeline at an LLM. That worked in demos and broke in production — high-volume invoices, structured forms, batch claims processing all have layouts where a deterministic OCR plus a parser is faster, cheaper, and more predictable than a multimodal call. The teams that survived the transition didn't replace deterministic OCR. They put it underneath.

The hybrid pipeline that won looks like this: deterministic OCR plus parser handles the 80% of pages with predictable structure, fast and cheap. The multimodal model handles the long tail — handwriting, complex tables, novel layouts, exception cases — where the determinism doesn't pay back. The router that decides which path to take is itself part of the system that needs auditing.

Schemas grew a "what next" column

The third change is the quietest. The schema that an extraction pipeline produces stopped being a flat map of field names. It grew structured intent: confidence per field, cross-field consistency flags, suggested next actions, evidence pointers back to source regions. The schema started carrying enough metadata for the downstream system to know not just what was extracted, but how much to trust it and what would normally happen to a document that looked like this.

Once that metadata exists, the question of whether a separate "decision" stage even belongs in the pipeline becomes architectural rather than philosophical.

The new reference architecture

The architecture that's become consensus, with rough variations across vendors, has four layers. The first two are recognizably IDP. The last two are new.

Layer What it does What changed in 2026
Perception Turn pages into structured candidate fields with confidence and provenance. Hybrid OCR + multimodal LLM, routed by document type. Single model call for the long tail.
Validation Apply schema, cross-field, and business rules. Surface what looks wrong. Validation became a first-class stage with its own observability — most successful injection and drift incidents show up here first.
Reasoning Frame the validated extraction against policy. Decide what should happen. New stage. Either a tightly-scoped second model or a deterministic rules layer; sometimes both, with the model proposing and the rules adjudicating.
Action Execute the decision: approve, route, request more, escalate. Tools and integrations the agent can call. The blast radius scales with what's wired in here.

The split between perception and reasoning isn't decorative. It's the same privilege- separation argument we made for prompt injection defenses: the model that reads the document is not the model that decides what to do with the result. That split exists for security reasons. It also exists because debugging a system that conflates the two is miserable.

The pipeline doesn't replace the operator. It replaces the part of the operator's day that was just typing.

What this looks like end-to-end

Take a worked example: a workers' compensation claim packet. Twenty-eight pages, a mix of the carrier's intake form, a clinic's medical report, two scanned receipts, and a one-page employer attestation. In the old contract, IDP extracted maybe 35 fields and handed them to a claims operator who spent forty minutes deciding whether to approve, request more documentation, or route to a senior adjuster.

In the new contract:

  • Perception classifies each page, extracts fields with confidence scores, and returns a structured packet. The intake form goes through deterministic parsing. The medical report goes through the multimodal model because the diagnosis codes appear in handwritten margins. The receipts go through OCR plus a layout heuristic.
  • Validation notices that the date of injury on the intake form is eleven days before the date on the medical report — flagged as a temporal anomaly, not as a hard error. Validation also notices the employer attestation is missing a signature block — flagged as a document-completeness gap.
  • Reasoning consults policy: temporal gaps under 14 days are tolerated if the medical report includes a delayed-presentation note (it does); missing employer signatures auto-trigger a request-for-information instead of a rejection. The reasoning layer proposes "approve with conditions, send RFI for employer signature" and explains which clauses of the policy it relied on.
  • Action executes: opens the claim in approved-with-conditions state, sends the templated RFI email to the employer's HR contact, posts an audit entry with full provenance back to the source pages.

The operator doesn't disappear. They review the cases the reasoning layer flagged for review, they handle the cases policy explicitly excluded from automation, and they spot- audit the rest. Their throughput moves by an order of magnitude. Their job stops being typing and starts being judgment.

The parts that are quietly harder

The architecture diagram is clean. The implementation has four places where the work is more than the slide suggests.

Policy is not a document, it's a constellation

The reasoning layer needs the rules expressed in something the system can evaluate. In most enterprises, "the policy" lives across a 200-page handbook, a dozen exception memos, a Slack channel where the senior adjusters resolve edge cases by precedent, and a handful of unwritten conventions. Turning that into something a reasoning layer can apply consistently is a months-long encoding project, not a config file.

The teams that get this wrong try to feed the handbook directly into a model and hope for the best. The teams that get it right treat the policy encoding as the same kind of artifact as code — versioned, reviewed, tested against historical cases, owned by a named person.

Validation is the most leveraged stage and the least glamorous

Skipping straight from perception to action is a common mistake in early agentic deployments. It works in the demo. It fails in production the first time the multimodal model returns a confident but wrong field, because nothing between the model and the action layer noticed.

Validation is where most of the resilience lives. Cross-field consistency, schema adherence, range checks, business-rule evaluation, surprise detection on confidence distributions — none of it is novel computer science. All of it is the difference between a pipeline that automates 35% of cases reliably and one that automates 70% of cases with a 4% silent error rate that nobody catches until quarter-end.

The audit trail has to survive the model swap

Models will change. The model you deployed in March will not be the model you deploy in September; the provider will deprecate it, or a better one will ship, or pricing will move. Every decision your system made under model A needs to be reproducible-enough that an auditor in 2027 can answer "why was this claim approved" without re-running the model.

That's a logging discipline, not a feature. We've written about the audit-trail schema we run at Cogneris — the short version is that the prompt, the model identifier, the perception output, the validation result, and the reasoning rationale all have to be captured at the moment of the decision, not reconstructed after. Pipelines that don't do this from day one rebuild them under regulator pressure later.

The cost curve isn't flat

A pure-LLM pipeline that ran $0.40 per packet in a 2024 proof-of-concept can run $0.04 per packet in a 2026 production system, but only if the routing actually works. The teams we see hit the high-cost end of that range are the ones that send every page through the most capable multimodal model "to be safe." The teams that hit the low-cost end run a small classifier first, route 80% of traffic through cheaper paths, and reserve the expensive model for the long tail. Same accuracy. Tenth of the bill.

What this means for ops

The job description for a backoffice operations leader running an agentic IDP system is not the job description from 2024. The day-to-day shifts in three concrete ways.

  • The exception queue replaces the work queue. Operators stop processing cases linearly and start triaging the cases the system flagged. The queue is shorter and stranger. The training material has to change.
  • Policy ownership becomes a real role. Someone has to maintain the encoded policy as the underlying business rules change. In most teams, that person didn't exist 18 months ago. They report to ops, not to engineering, because they're translating business intent into something the system applies.
  • Quality assurance shifts from spot-checking output to monitoring distributions. The right question moves from "is this individual case correct" to "is the rate of approve-with-conditions consistent with last quarter, and if not, why." The dashboards look more like an SRE's than an auditor's.

The honest take on adoption

The 67% number in the survey is a measure of evaluation, not deployment. The deployment number is meaningfully smaller — somewhere in the 20–25% range across enterprise IDP programs by our read of the same data set. The gap between evaluation and production is not a technology gap. It is the gap between "we tried it on twenty documents" and "we have an audit trail, a policy owner, a validation discipline, a model-swap process, and a retraining loop."

The technology cleared the bar for production a year ago. The organizational scaffolding is what's catching up now. Pipelines that get into production in 2026 do so because the team owning them treated the move from "extraction" to "decision" as an operating-model change, not as a model upgrade.

Closing thought

The interesting thing about the old IDP contract is that it was designed around a capability boundary that no longer exists. Once a single model can read the page, apply the schema, and reason about what the schema means, splitting the work back into "extraction" and "everything after extraction" is a choice — usually a defensible one, sometimes not. The teams worth watching in 2026 are the ones treating that split as a deliberate architectural decision rather than an inherited one.

For the Cogneris reference pipeline — perception, validation, reasoning, action — and the audit trail we ship by default, see our product page or talk to our team. We're happy to walk through specific document types and where the boundary should sit for yours.