The token tax in document AI

The number that retired "it's software, so 80% margin"

For two decades the most reliable fact about a software business was its gross margin. You sold the same product to the next customer at roughly zero marginal cost, and so the textbook number settled in the 75–85% range — high enough that the rest of the model, the revenue multiple included, was built on top of it as an assumption nobody bothered to restate. The agentic cycle quietly removed the assumption. When the product is an agent that reasons before it acts, serving the next unit of work is not free: every decision burns tokens, and those tokens are a bill from a frontier provider that arrives every month. The most useful way to name what that does to the model is a tax — a token tax — because it is levied on the thing you sell, scales with usage, and cannot be waved away by selling harder.

The size of it is what reorganised board decks this year. A business that would have reported 80% gross margin as classic SaaS reports something in the 50–60% band once inference is counted honestly, and the gap is not a rounding error — it is on the order of 30 points of margin that simply is not there. The precise figure varies by workload and is argued about, as every headline number is, but the direction is not in dispute by anyone who has read an agentic company's actual cost of revenue. The interesting question is not "is the margin lower" — the close already says it is — but "what does a 30-point structural gap change", and the answer reaches further up the model than most teams expect.

A token tax is not an operating expense you optimise next quarter. It is a cost of goods sold you carry from the first contract. SaaS earned its multiple on a margin agentic software does not start with.

Why inference is COGS, not an expense you can wave away

The first mistake is filing inference under the wrong line. A lot of early agentic P&Ls treated model spend as an R&D or infrastructure cost — a fixed bet on building the product — because that is where cloud compute lived in the SaaS era. Inference does not behave that way. It is variable, it is per-transaction, and it rises with exactly the thing you are billing for. That is the textbook definition of cost of goods sold, and putting it anywhere else does not make it cheaper; it only makes the gross margin on the slide a fiction that the first audited close corrects in public.

This is the same physics we wrote about from the supply side in the compute moat and the inference-margin question — the cost of serving an intelligent output is a real, recurring line, not a one-time build. The good news, such as it is, is that the input price keeps moving in your favour: the price of frontier inference keeps resetting downward, which lifts the floor over time. The bad news is that the same period made models that think before they answer the default for hard work, and a reasoning call costs several times a classic completion. Falling unit price and rising tokens-per-task are pulling in opposite directions, and the net is a margin that is structurally lower than SaaS and noisier quarter to quarter. You manage that. You do not assume it away.

Where the token tax actually comes from

The tax is not one charge; it is a stack of them, and naming the layers is the start of controlling them. A classic SaaS request returns a row from a database. An agentic request runs a loop: it reads the document, plans, calls the model to reason, calls it again to check itself, retrieves context, and often does the whole circuit more than once before it is confident enough to act. Each turn of that loop is a metered call, and autonomy is precisely the dial that adds turns. The table below is the same unit of work — "handle one document" — priced as a SaaS feature and as an agentic flow, so the gap is visible rather than asserted.

Cost line	Classic SaaS	Agentic flow
Compute per request	A query against a database — fractions of a cent, effectively fixed.	Several metered model calls per task, variable and rising with autonomy.
Reasoning tokens	None. There is nothing to "think".	The test-time compute the model spends before it answers — invisible on the output, very visible on the bill.
The loop multiplier	One request, one response.	Plan, act, observe, retry — the ReAct circuit can run the model 10–20× for one decision.
Retry & verification	Cached, near-free.	Self-checks and re-runs on low confidence — necessary for quality, paid for in tokens.
Gross margin it implies	75–85%.	50–60%, before any of it is optimised back.

Read down the right column and the lesson writes itself: the token tax is the price of the autonomy that makes the product worth buying. You cannot delete it without deleting the value, which is why "just use a cheaper model everywhere" is not the answer. The answer is to spend the expensive tokens only where they change the outcome — which means knowing, per task, whether the difficulty justifies the bill.

The valuation mistake: a SaaS multiple on an agent's margin

The reason this is a board topic and not a FinOps footnote is that margin is an input to valuation, and the wrong margin assumption misprices the whole asset. A revenue multiple is, underneath, a bet on the cash a dollar of revenue eventually throws off — and a dollar of revenue at 55% gross margin throws off materially less than a dollar at 80%. Apply the SaaS multiple, built on the SaaS margin, to an agentic business carrying the token tax, and you have overpriced it by roughly the gap between those two numbers, compounded by every other multiple that sits downstream of margin.

This is distinct from two adjacent conversations it gets confused with, and the distinction is the whole point. It is not outcome-based pricing, which is a question about the revenue line — how you charge. And it is not the tactical levers of token economics, which are about shaving the bill after the fact. The token-tax thesis is structural: the unit economics of work-delivered-as-software have a different margin floor than software-delivered-as-software, and that floor has to be modelled from the pricing up, not discovered at the first guidance revision. A founder who hides the tax from the board until the numbers force it into the open does not get the benefit of the doubt twice.

The levers that recover margin — in points, not vibes

None of this means the 50–60% floor is a life sentence. It means margin recovery is an engineering discipline with a number attached, not a slogan, and the levers that matter are the ones you can quantify in points of gross margin rather than gesture at. Four do most of the work:

Right model by task difficulty — route the easy, high-volume work to a small fast model and reserve the frontier reasoning model for the cases that genuinely need it. Most document work is not hard; paying frontier prices for trivial classification is the single most common way the bill triples for no quality gain.
Prompt caching — the shared context an agent re-sends on every call (the schema, the instructions, the stable corpus) is the cheapest token in the stack when it is cached instead of re-billed. On a high-volume flow this is points of margin, not a rounding error.
Distillation to a smaller model — once a frontier model has shown how a specific task should be done, a smaller model trained on that behaviour holds most of the accuracy at a fraction of the cost for that one task. The expensive model becomes the teacher, not the runtime.
Open-weight for the commodity layer — the interchangeable, predictable parts of the workload can run on an open-weight model you operate, with the frontier API kept for the edge cases. The commodity work is where the volume — and therefore the bill — concentrates.

Each of these is a line item in a margin-recovery plan, and a serious one states the expected gain in points and measures whether it landed. The pattern that fails is the opposite: a vague intention to "optimise costs later" with no instrumentation to say whether last month's changes moved the margin at all. You cannot manage a tax you do not measure.

The token tax is not paid once. It is paid per decision, forever — so the lever that matters is not a one-off cut but a cost per decision you watch the way SaaS watched cost per seat.

Margin per decision, alongside cost per decision

The metric that survives this shift is not the one most dashboards report. Plenty of teams now track cost per decision — the all-in token spend to handle one document, one invoice, one claim — and that is the right denominator. But cost alone tells you the bill, not the business. The number that tells you whether the business works is margin per decision: what you charge for that unit of work minus what it cost you to produce, including the reasoning tokens that never show up on the output. A flow can have a falling cost per decision and a collapsing margin per decision at the same time, if the price you charge is falling faster than the cost — which is exactly what happens when a vendor competes on price without modelling the floor underneath.

Reported per product and per flow, margin per decision is the instrument that keeps the token tax visible at the altitude where it gets decided. It turns the abstract "our margin is structurally lower" into "this specific flow is below floor at this price, fix the routing or change the price". That is the operational other half of an AI operating model that survives contact with a real close: not just shipping agents, but knowing the margin each one runs on, the way the last era knew its cost per seat.

The four ways the token tax sinks a business

Even teams that accept the thesis watch it fail in recognisable ways. Naming them is most of the defence:

The flat unlimited contract over a variable cost — selling an all-you-can-use plan priced on a SaaS instinct, on top of a per-call inference bill that the heaviest customers turn into a loss. The biggest accounts become the least profitable, and the model only discovers it at scale, when it is hardest to reprice.
The SaaS multiple on the agentic business — the board, or the buyer, values the company on a revenue multiple calibrated for 80% margin and is surprised by the cash conversion of a 55%-margin business. The misprice is structural, so it does not self-correct; it gets corrected, publicly, at the first close.
Reasoning switched on globally — flipping the model into "thinking" mode for every request because it lifts quality on the hard cases, and watching the bill triple on the easy ones that never needed it. Autonomy and reasoning are dials, not switches; left on globally they are the fastest way to burn the margin.
The tax hidden from the board — keeping the token cost inside engineering's dashboard and out of the P&L until a guidance revision forces it into the open. The cost was never the real problem; the surprise was. A board that learns about a structural margin gap from a miss stops trusting the deck.

The trade-off, stated honestly

There is a genuine tension here, and pretending otherwise is how the thesis gets oversold in the other direction. More autonomy and more reasoning generally buy more quality and more of the work done without a human — and they cost more tokens, which is to say more of the margin floor. Less reasoning buys margin and costs you on the hard cases the cheap model gets wrong. Neither end of the dial is free, and the honest framing is not "minimise tokens" — a vendor that starves the agent to protect margin ships a worse product — but "spend tokens where they change the result and nowhere else".

For most document workflows the binding constraint is that the difficulty is wildly uneven: a large majority of documents are routine and a small minority are genuinely hard, and the margin is made or lost on whether the system can tell them apart before it decides how hard to think. So the default leans toward cheap, fast handling for the bulk, with the expensive reasoning reserved — by a router that estimates difficulty first — for the cases that earn it. The exception is real: a workflow where almost every document is adversarial or multi-document can rationally run the expensive path more often, and should price for it. Spend the token where it moves the outcome. The mistake is paying the tax flat, on every document, as if they were all the hard one.

What this means for the document layer

Document work is where the token tax is easiest to underestimate, because a prototype hides it perfectly. An engineer can wire an extraction demo that reads an invoice in an afternoon, running the frontier reasoning model on every field because it is one document and the bill is invisible. The same design at a million documents a month is a margin emergency. Three properties separate a document platform that carries the tax well from one that lets it compound:

It routes by difficulty, not by default. A platform that runs the expensive model on the routine 90% is paying frontier prices for work a small model closes, and the margin shows it. The extraction-to-decision path should estimate how hard a document is before it decides how hard to think — cheap and fast for the bulk, reasoning held for the cases that need it.

It makes the token cost legible per flow. You cannot hold a margin you cannot see. Cost and margin per decision, reported per tenant and per flow, are what turn the token tax from a quarterly surprise into a number you steer — the difference between managing the floor and being managed by it.

It pays for the audit trail it already needs. A per-field audit trail — what was read, by which model, at what confidence, against which span of the source — is not only a governance artefact. It is the same telemetry that tells you which calls were worth their tokens, which is why a system that already keeps the record gets margin instrumentation for free.

Closing thought

The board conversation that ages well in 2026 is not "is AI a software business". It is "what is the margin floor of this particular software business, and have we modelled it from the price up". Agentic software is real software with real leverage — but it starts roughly 30 points of gross margin below the SaaS the multiple was built on, because the token tax is a cost of goods sold, not an expense to optimise someday. A company that prices for the floor, routes its tokens by difficulty, and reports margin per decision keeps the leverage without the surprise. A company that books a SaaS margin it does not have, sells flat plans over a variable cost, and hides the tax until the close gets neither. The metric that predicts which one you are is not the growth rate on the cover slide — it is whether the deck states the gross margin honestly, and whether the number underneath it is one the business can actually hold.

At Cogneris we build the document layer to carry the token tax instead of pretending it away: the model is a rented, swappable implementation detail the platform routes by difficulty — small and fast for the routine, ReAct reasoning reserved for the documents that earn it — so the expensive tokens are spent where they change the result, not flat on every page. Cost and margin per decision are reported per tenant out of the box, the per-field audit trail doubles as the margin instrumentation, and our pay-per-page pricing is built on the real cost floor rather than a SaaS margin we would have to claw back at the first close. If you are modelling the unit economics of a document workflow and want to see the token tax on your own documents rather than in the abstract, talk to our team and bring the flow — we will show you the cost and the margin per decision before you sign anything.

It's software. The margin disagrees.