Template vs zero-shot extraction

Two definitions, stated precisely

The argument gets muddy because people use both terms loosely, so start with definitions tight enough to reason about.

Template-based extraction is any approach that relies on prior knowledge of the document's structure. In its oldest form that is zonal OCR — read the text inside fixed coordinates on a page you know the shape of. In its modern form it is a per-class configuration: a profile that says "this is a Vendor-X invoice, the total lives in this region or under this label, parse it this way" — possibly with anchors and relative positioning so it survives small shifts. The defining property is not the coordinates; it is that the system was told, ahead of time, what this document looks like. Everything good and bad about templates follows from that one fact.

Zero-shot extraction is the opposite stance: the model is given a document it has never been configured for and a target schema — the fields you want and their types — and asked to populate the schema directly. No template, no per-layout training, no labelled examples of that vendor. Modern vision-language models do this well because they read layout and language together, so "the number to the right of the words Amount Due, even though I have never seen this form" is a question they can answer. The defining property is that the system was told nothing about this specific document's shape — only what you want out of it.

One clarification that saves a lot of confusion: zero-shot is not the same as schema-less. A good zero-shot pipeline is heavily constrained — by a typed schema, by field descriptions, by validation rules — it just isn't constrained by layout. And "few-shot" sits between the poles: a handful of examples nudges the model without the full cost of a maintained template. The spectrum is layout-knowledge, from "fully specified" to "none", and the trade-offs move monotonically along it.

A template encodes what you already know. Zero-shot handles what you don't. The mistake is using either one for the job the other was built for.

The eight dimensions a buyer actually weighs

Nobody chooses on accuracy alone. The real decision balances setup cost, maintenance, unit cost, determinism and a few things that only show up at scale. Here is the honest comparison.

Dimension	Template-based	Zero-shot (VLM)
Setup / cold-start	Per-class effort before the first document clears — define, test, tune. Onboarding a new vendor is a task.	None. A new format works on its first document, which is the whole point.
Accuracy on stable, high-volume layouts	Very high and very consistent — it is reading a known place on a known form.	High, but with variance the template doesn't have; the odd field drifts where a rule would not.
Accuracy on novel / long-tail layouts	Fails — there is no template, so there is no answer until someone builds one.	Its home turf: reads a format it has never seen at usable accuracy.
Cost per page	Near zero at run time — deterministic parsing is cheap.	A model call per document; real money at high volume, though falling fast.
Determinism	Same input, same output, every time — reproducible by construction.	Non-deterministic; needs pinned settings, validation and an audit trail to be reproducible.
Maintenance	Templates rot — a layout change silently breaks one, and you only learn downstream.	No per-template upkeep; maintenance moves to prompts, schema and the eval set.
Explainability	Trivial — the value came from a named region by a fixed rule.	Needs work — provenance and per-field confidence have to be engineered in, not assumed.
Failure shape	Fails loudly — no match, empty field, obvious miss.	Fails quietly — a plausible, well-typed, wrong value that passes a shape check.

Read the last row twice, because it is the one people underweight and it dominates risk. A template that breaks tends to return nothing, and nothing is easy to catch. A zero-shot model that errs tends to return something that looks right — the correct type, a sensible magnitude, a date that parses — and that is far more dangerous in a pipeline that auto-approves, because it sails through the cheap validations. The defence is not "trust the model less"; it is per-field confidence and a verification step, which we come back to.

Where template-based still wins

The category is not legacy, and anyone telling you VLMs retired it is selling something. Templates win cleanly in four situations, and they win decisively:

High volume, one stable layout — millions of the same form a month (a carrier's own claim form, a fixed government filing, an EDI-adjacent document). The setup cost amortises to nothing and the per-page cost advantage compounds into a real margin line.
Hard determinism requirements — when the same input must produce the same output for audit or regulatory reasons, and "the model is usually right" is not an acceptable sentence to write in a control.
Strictly positional fields — boxes on a tax form, checkbox grids, fixed-pitch tables where position is the meaning and a rule reads it more reliably than reasoning about it.
Cost-bound commodity extraction — extremely high volume against a thin margin, where even a small per-page model cost would invert the unit economics. The cost build-up decides it.

The common thread: a known, stable, high-volume document where the head of the distribution is fat. Template the head. The mistake is assuming the head is the whole distribution — it almost never is.

Where zero-shot wins

Zero-shot earns its cost wherever the per-template economics stop making sense, which is most of the real world's variety:

The long tail of layouts — hundreds or thousands of distinct formats, each at low volume, where you will never build and maintain a template per vendor. One model handles all of them, and the ones you have not seen yet.
Cold-start and fast onboarding — a new customer or vendor produces correct output on day one instead of waiting on a templating backlog. Time-to-first-value goes from weeks to the first document.
Semantic, non-positional fields — "the governing-law clause", "the renewal terms", "the diagnosis" — things defined by meaning, not coordinates, where a rule has nothing to anchor to and reasoning is the only path. This is the extraction-to-decision territory.
Messy, variable, human-authored input — documents that never had a fixed layout to begin with: emails, letters, scanned notes, photographed receipts at an angle.

The common thread here is the mirror image: variety over volume, meaning over position, and a tail too long and too thin to template economically. That tail is where the cost and the errors of a manual or template-only operation actually live.

The hybrid that ships

Almost no serious production system is purely one or the other. The architecture that survives contact with real document mix is a router, not a religion — and the design is the same shape we argue for elsewhere: classify first, then send each document to the cheapest method that clears it at the required accuracy.

Route by class and confidence, not by ideology

Identify the document, and if it matches a known high-volume class with a healthy template, parse it the cheap deterministic way. If it doesn't — new vendor, unknown format, low-volume oddity — fall through to zero-shot. The router reads the same two signals on every document: do I recognise this, and how sure am I. That is the whole control surface.

Use zero-shot as the cold-start and the fallback

Zero-shot is what you run before a template exists and what catches everything a template misses, so coverage is never gated on the templating backlog. New formats are handled from the first document; the template, if it is ever worth building, comes later and only for volume that justifies it.

Treat templates as a cache of what zero-shot learned

The most useful reframing of the last two years: a template is a cache. When zero-shot extraction handles the same layout enough times with high confidence, the system can crystallise that into a cheap deterministic profile automatically — promoting a hot layout from the expensive path to the cheap one. You get zero-shot's coverage and template economics on the head of the distribution, and the human templating effort drops toward zero. Cold tail stays on the model; hot head graduates to the cache.

Gate both on per-field confidence, verify the rest

Whichever path produced a field, the question downstream is the same: how sure are we, and does it pass validation. High-confidence, validated fields clear straight through. Low-confidence ones — from either method — route to a human, and that exception margin is where the work concentrates. This is the ReAct-style loop that lets a field be re-checked against the source rather than trusted blind, and it is what makes the quiet-failure row of the table survivable.

The failure mode that bites each one

Both approaches fail in production. Knowing the shape of each failure is most of the operational maturity.

Templates fail silently to drift. A supplier reflows their invoice, moves the total, changes a label — and a template that matched on position now reads the wrong number or nothing. The danger is that this happens to one layout among hundreds, so the aggregate accuracy barely moves and nobody notices until a downstream reconciliation breaks weeks later. The defence is monitoring per-class extraction health, not a global accuracy number that hides the one class that fell over.

Zero-shot fails confidently to plausibility. The model returns a value that is the right type and a believable magnitude but simply wrong — a transposed figure, a date from the wrong line, the second total instead of the first. It passes a naïve schema check precisely because it is well-formed. The defence is per-field confidence calibrated against held-out ground truth, cross-field validation (do the line items sum to the total?), and a verification pass that reads the value back against the document — never "the JSON parsed, ship it".

Notice the symmetry: the template's failure is loud but easy to miss in aggregate; the model's failure is quiet but catchable with the right instrumentation. Neither is a reason to avoid the approach. Both are reasons to measure the thing that actually breaks instead of the thing that is easy to chart.

What this means for the document layer

If there is one practical takeaway, it is that the template-vs-zero-shot question is the wrong altitude to decide a platform on. The right question is whether the system can run both, route between them per document, and report the things that make either one trustworthy. Three properties matter more than the toggle:

Schema-driven, layout-optional. You define the fields and types you want once; the platform decides per document whether a template or zero-shot is the cheaper way to fill that schema. The schema is the contract; the extraction method is an implementation detail the system optimises, not a fork the customer has to choose and maintain.

Confidence and provenance on every field, regardless of path. A field extracted by a template and a field extracted zero-shot should arrive with the same metadata — a calibrated confidence and a pointer to where in the source it came from — so the decision to auto-clear or escalate is made on evidence, not on which engine happened to run. Without this, the quiet-failure row is a time bomb.

An audit trail that reconstructs the decision. Non-determinism is fine in production as long as it is accountable. A per-field audit trail — what was read, by which method, with what confidence, against which span of the document — is what turns "the model guessed" into "here is exactly why this value was chosen", which is the difference between a demo and something a regulated buyer signs off.

Closing thought

Template-based and zero-shot extraction are not competitors so much as two ends of one dial, and the entire skill is knowing where to set it for a given document. Template the fat, stable, high-volume head where determinism and cost dominate. Send the long, varied, semantic tail to zero-shot where coverage and time-to-value dominate. Route between them on recognition and confidence, let hot layouts graduate from the model to a cheap cache, and gate everything on per-field confidence with a verification step so the model's quiet failures and the template's silent drift both get caught. The teams that argue about which approach is "the future" tend to ship neither well. The ones that win treat it as a routing problem and spend their energy on the measurement that keeps both halves honest.

At Cogneris we build the document layer around the schema, not the engine: you specify the fields, and the platform routes each document to the method that clears it most cheaply at your accuracy bar — deterministic where a layout is known and stable, zero-shot where it isn't — with calibrated per-field confidence, provenance, and an audit trail on every value regardless of how it was read. If you want to see where templates versus zero-shot actually land on your own document mix, talk to our team and bring one messy folder — we will show you the routed answer on real pages.

Templates or zero-shot, decided by the tail.