What "execution" actually adds to extraction

The old shape of the work was clean and well understood. A pipeline received a document, ran OCR and a classifier, extracted fields against a schema, and emitted a payload. The payload landed in a queue, a workflow engine, or an operator's screen, and a different team owned what came next. The split was useful: IDP vendors competed on accuracy, BPM vendors competed on orchestration, and the integration between them was a SOAP contract or a webhook with a JSON body.

What changed isn't the accuracy of extraction. We already covered why OCR moved from 70% to 98% in 2026 and where the pipeline absorbed the upgrade. What changed is that the agent now has hands. The same call that returns the fields can also: open the ticket, post the entry to the ledger, send the request-for-information email, place the hold, schedule the payment, fire the webhook to the CRM, and write the audit-trail entry that explains why. Execution isn't a new step in the pipeline — it's a new responsibility the pipeline now carries.

The three building blocks that made this practical in 2026:

  • Tool-use APIs that hold up under load — function calling stopped being a demo and started being a reliability surface. The same model that extracts the fields can call into your systems through a typed tool catalog with retries, timeouts, and idempotency keys baked into the contract.
  • Reasoning over multi-document state — workflows live in packets, not pages. PO, invoice, GRN; referral, clinical note, benefits summary; KYC ID, proof of address, source-of-funds letter. The model holds the packet, runs the policy, and emits a decision with the evidence chain attached. We walked through that shift in the reasoning-models piece — execution is what the reasoning step gets pointed at.
  • An audit trail that survives the loop — when the agent fires N actions, each one needs a provenance record that an auditor can replay months later without rerunning the model. We use the same schema we wrote about for non-deterministic outputs, extended so every tool call gets its own row.

None of this is exotic in 2026. What's exotic is the operational discipline to wire all three together in a way that survives a real workload — and that's the part the slide tends to skip.

Three workflows, rewritten

The clearest way to see the shift is to put a familiar workflow next to its execution-driven version. We'll take three: customer onboarding, insurance claims first-notice-of-loss, and accounts payable invoice processing. The "before" column is the 2024–2025 reference; the "after" column is what the same workflow looks like when IDP carries execution.

Onboarding: KYC packet to active account

Stage Extraction-only pipeline Execution-driven pipeline
Capture Customer uploads ID, proof of address, source-of-funds letter. Fields extracted; payload sent to onboarding queue. Same capture. Agent extracts, runs cross-document checks (name match, date consistency, address normalization), and tags the packet's confidence shape per field.
Policy Operator opens the case, runs the playbook manually, requests missing items by email if needed. Agent runs the policy: if confidence is high and no flags, opens the account through the core-banking tool; if anything is borderline, fires a templated request-for-information email; if any sanction or sanctions-list hit, escalates with the evidence packet.
Decision Operator approves, denies, or requests more info — usually within 1 business day for clean cases. Agent decides in minutes for the 70–80% of clean cases, with the operator owning the residual queue that's actually ambiguous.
Audit Operator notes in the case. Reconstructed at audit time from emails and tickets. Every tool call, every model decision, every evidence reference recorded in the audit trail at write time.

Claims FNOL: first notice to triage decision

Stage Extraction-only pipeline Execution-driven pipeline
Intake Email, photo, repair quote arrive in the FNOL inbox. IDP extracts policy number, loss date, description; payload to claims queue. Same intake. Agent extracts, matches the policy in-force, runs coverage check, computes a tentative reserve based on the loss description.
Routing Triage operator reads the case, decides whether it's straight-through, complex, or fraud-suspect. Agent applies the segmentation policy and routes: low-value clean cases to straight-through processing, complex cases to a specialist queue with a pre-built brief, fraud-suspect cases to SIU with the flags attached.
Customer Acknowledgment email sent by the workflow engine, typically same day. Acknowledgment, next steps, and any document requests sent in minutes, parameterized by the case shape.
Operator load Triage operators handle 100% of incoming FNOLs. Triage operators handle the 20–30% the agent couldn't decide — the same operators move into the role of validating the agent's harder calls instead of doing intake on the easy ones.

Accounts payable: invoice arrival to payment

Stage Extraction-only pipeline Execution-driven pipeline
Receive Invoice arrives by email or portal. IDP extracts header and line items; payload to AP workflow. Same. Agent extracts and immediately pulls the matching PO and GRN through the ERP tool, runs 3-way match.
Exceptions 3-way-match mismatches land in an exceptions queue for an AP clerk to chase. Agent classifies the mismatch (quantity, price, missing GRN, tax mismatch), and either auto-resolves the resolvable ones (e.g., approved tolerance), opens a templated vendor email for missing GRN, or escalates with full context for the irresolvable ones.
Approval Routed by amount and cost center to the right approver; chasing approvals is most of the cycle time. Agent posts the entry, requests approval via the configured channel, and follows up automatically based on the SLA — no human chasing a button click.
Pay Treasury batches approved invoices for payment runs. Agent schedules the payment within the configured payment terms and writes the audit-trail entry that ties extraction, match, approval, and payment together as one chain.

The pattern across all three is the same: extraction stayed roughly where it was; the rest of the workflow moved into the pipeline. The operator count didn't drop because the agent does the operator's job — it dropped because the agent does the parts of the job that weren't worth a human in the first place.

What changes in the architecture

The slide says "agent." The architecture says four uncomfortable things.

The pipeline now owns failure. When IDP returned a payload, a failure was someone else's problem — the workflow engine retried, the operator filled in the gap, the integration team owned the contract. When IDP fires the tool calls, the pipeline inherits the responsibility for what happens when the ERP times out, when the email bounces, when the policy doesn't cover the case the document describes. Designing the agent's tool surface like an internal API — typed contracts, idempotency keys, structured errors — is no longer a nice-to-have; it's where most of the operational work moves.

Reversibility becomes a first-class property. An extraction that returns the wrong customer ID is recovered by editing a field. An action that opens an account, posts a journal entry, or sends a payment instruction is recovered by an entirely different process. The architecture has to know which actions are reversible cheaply, which require an approval gate, and which earn a human in the loop regardless of model confidence. Most programs that stall in production stalled here.

Policy moves out of the operator's head. The fast-model pipeline could ship while the policy was "what experienced operators do." The execution pipeline cannot. If the agent is to apply the policy, the policy has to be encoded somewhere the agent can read, reason against, and cite. That work — the work of writing down what was tribal — is the part of agentic IDP that consistently takes longer than the model integration. We've seen programs spend three weeks on the model and three months on the policy. The order is rarely the other way around.

Observability stops being optional. When the pipeline does one thing, you log the one thing. When the pipeline fires a chain of actions, every link in the chain is a place for the program to drift quietly. We treat the tracing layer as a first-order concern: every model call, every tool call, every decision branch is a span; the trace is the artifact ops, engineering, and compliance all read from the same place.

What breaks when IDP starts firing actions

Three failure shapes show up early. None of them are fatal; all of them are surprising the first time.

Silent miswires

The agent fires the right tool with subtly wrong arguments. The vendor master record has two records for the same legal entity at different addresses; the agent picks the wrong one; the payment goes to the right vendor at the wrong bank account. The model isn't confused — the data was ambiguous and the agent committed without surfacing the ambiguity. The fix is a tagging pass that explicitly flags resolution ambiguity and routes those cases to a human, not the model's confidence score on the extraction itself.

Partial transactions

The agent posts the journal entry, then the approval tool times out, then the retry loop runs again and posts a duplicate entry. The model didn't fail; the tool surface did, and the pipeline didn't enforce idempotency at the right boundary. The fix is operational hygiene that AP teams have done for decades — idempotency keys, transaction IDs, write-once semantics on the tools that move money — applied to the agent's tool catalog instead of the workflow engine's job table.

Escalation drift

The escalation queue, two months in, holds the cases the agent decided not to decide. The operators handle them. Nobody looks at the rate at which those cases pile up, or at whether the agent's threshold for escalation has drifted with each model update. The program reports "85% straight-through" and is technically right, but the residual is growing slower than the volume, which means the operators are quietly absorbing the drift. The fix is a weekly review of the escalation rate by case class, not by aggregate — the aggregate hides the regressions that matter.

What to put in place before flipping the switch

The shortlist for going from extraction to execution without breaking the workflow you're replacing.

  • A typed tool catalog with explicit semantics — every action the agent can take is a named tool with a typed input, a typed output, an idempotency contract, and a written description of when to use it. The agent doesn't reach into general-purpose endpoints; it calls the catalog. The catalog is reviewed like an API.
  • A written policy the agent reasons against — not a wiki, not a handbook. A structured policy with named branches, explicit exceptions, and a designated "escalate" exit. The translation work is real; the agent's quality is bounded by the quality of the policy more than by the model.
  • A reversibility map per action — every tool call is tagged as cheaply-reversible, expensively-reversible, or irreversible. The router refuses to call irreversible tools without human approval, regardless of confidence, until the program has earned the trust to remove the gate.
  • An audit trail that records the chain, not just the answer — the extraction, the policy decision, every tool call, the evidence cited, the model version, the prompt version. We use the same schema across all of it, so the auditor isn't stitching artifacts from three systems six months after the fact.
  • A rollback to the prior workflow — the program needs a flag that routes a class of cases back to the extraction-only pipeline with operator handling. Cheap to build, easy to forget, expensive to need without it. The day a provider ships a regression and the agent starts auto-approving the wrong thing, the flag is what stops the bleeding inside an hour instead of inside a day.
  • An evaluation set that scores actions, not just extractions — the old eval set scored fields. The new eval set scores "did the agent take the right action on this case." That's a different dataset, a different label, and a different review cadence. Programs that ship without it discover that field-level accuracy and action-level correctness are not the same number.
Execution-driven IDP isn't extraction with a bigger surface area. It's a pipeline that takes responsibility for the workflow it used to feed.

Where the win actually comes from

The temptation when you read the slide is to compute the savings as "operator hours avoided." That number is real, but it's not what makes the business case hold up. Three compounding effects tend to dwarf it in practice.

Cycle time collapses, and revenue follows. An onboarding workflow that ran in 1 business day now runs in minutes. The conversion rate at the top of the funnel goes up because the abandonment that lived in the 24-hour gap stops happening. The operations savings are real; the conversion lift is usually bigger.

Variance drops. Operator-driven workflows are bimodal — fast on a good day, slow on a bad one, slower at month-end, slower still when someone is on leave. The agent doesn't care what day it is. For workflows where customer experience is driven by the worst case, not the average, this is the part the CFO eventually notices on churn.

Exception handling gets better, not worse. Counterintuitive, but the operators who used to triage everything now spend their time on the cases that earned their attention. The hard cases get more time, not less. We've watched programs where the operator satisfaction scores went up after automation, because the work that remained was the work people actually wanted to do.

Where it doesn't fit

Three places we'd push back if a customer asked us to wire execution into the pipeline on day one.

Workflows where the policy genuinely changes case by case. If the decision is a judgment call that depends on context the document doesn't carry, the model is going to either invent a policy or escalate everything. Both outcomes are worse than leaving the operator in place. The right move there is to give the operator a better brief, not to take them out of the loop.

Workflows with no reversible failure mode. Some actions cannot be gated without breaking the workflow they're embedded in — wire transfers above a threshold, regulated filings, terminal customer communications. The execution layer can prepare the artifact and stage it; the human still presses the button. Programs that try to remove that human invariably pay for it once and then put the human back.

Workflows nobody has actually mapped. "Map the workflow" sounds like consultant-speak until you try to encode the policy and realize three teams disagree on what the rules actually are. The right first project is the one where the policy is already written down somewhere defensible — not the one where the automation will force a policy conversation that needed to happen anyway. Doing both at once is how pilots stall.

Closing thought

The framing we keep coming back to is that document AI in 2026 is not the part of the pipeline that reads the page. It's the part of the pipeline that owns what the page triggers. The extraction got good enough that nobody competes on it anymore; the execution is where the operating leverage lives, and the engineering — tool catalogs, encoded policy, reversibility maps, audit trails that survive an agent — is where the programs that ship separate from the programs that demo well.

If the pitch from the vendor still ends at "we extract the fields," the pitch is from 2024. If the pitch ends at "we extract the fields and then the workflow engine takes over," the pitch is from 2025. The 2026 pitch — the one worth taking seriously — ends at "we extract, we decide, we act, and the audit trail explains every step the same way every time." That's a different shape of product, and it's what the next round of document AI programs are being judged against.

For the reference architecture Cogneris runs — perception, validation, reasoning, action, with the audit trail and policy encoding wired in by default — see our product page or talk to our team. We're happy to walk through where execution belongs in your workflow and where leaving the operator in the loop is still the right call.