Agent washing: real agents vs RPA

The number that named the bluff

For two years "agentic" was the word that opened budgets. It is now also the word that closes them. The most-repeated forecast of the cycle is that more than 40% of agentic-AI projects will be cancelled by the end of 2027 — not paused, cancelled — drawn from a survey of more than 3,400 organisations that have already spent on the technology. The three reasons given are unglamorous and familiar: cost that climbed faster than anyone modelled, business value too diffuse to defend at renewal, and risk control that never matched what the system could touch. The precise percentage will be argued about, as every headline number is. The gap it describes will not: a large slice of what got funded as "agents" is about to be quietly switched off.

The forecast did one useful thing — it forced a vocabulary on a market that had been getting away without one. The aggravating factor now has a name: agent washing. It is the practice of taking an assistant, a chatbot, a rules engine or an RPA script and relabelling it an "agent" because the label sells, while the thing underneath cannot do the one thing the word is supposed to mean. Of the thousands of vendors describing themselves as agentic, only a fraction ship a system that decides and acts on a goal rather than executing a fixed sequence. For the board, the disillusionment phase is not a reason to retreat from agentic AI. It is a reason to get precise about what was actually bought.

The demo is designed to survive the room. The system has to survive production, the renewal, and the first incident. Those are different tests, and only one of them is on the slide.

What "agent washing" actually is

Strip the marketing and there is a single technical question underneath the whole category: when the context changes, does the system change what it does? A real agent, given a goal, perceives state, decides a next action, acts, observes the result and decides again — a loop, not a line. Rebranded RPA follows a branch it was told about in advance; when reality steps outside the branch, it does not replan, it fails or it does the wrong thing confidently. A chatbot with a nice tone answers; it does not own an outcome. None of these are useless — a deterministic rules engine is often the right tool — but calling them agents hides exactly the property a board is paying a premium for.

The tell is almost never visible in the demo, because a demo is a rehearsed happy path where the context never changes. It shows up the first time a document arrives in a layout nobody scripted, a counter- party behaves off-pattern, or an exception lands that the decision tree never enumerated. That is the moment that separates the ReAct-style loop that replans from the flowchart that derails. So the first diligence move is not to watch the agent succeed. It is to break the happy path on purpose and watch what the system does next.

The claim	What rebranded RPA does	What a real agent does
"It's autonomous"	Runs a fixed branch. Off-script input fails or mis-fires with no signal.	Replans when context changes; chooses a different action toward the same goal.
"It decides"	Returns a pre-mapped answer for pre-mapped inputs.	Weighs evidence, scores confidence, and routes low-confidence cases to a human.
"It handles exceptions"	Exceptions are whatever someone enumerated in advance.	Recognises the unenumerated case and escalates instead of guessing.
"It learns"	Static until a developer edits the rules.	Every correction feeds back and measurably lifts the next run.
"It's safe"	Safety is implicit because the script is narrow.	Guardrail, kill switch and human-in-the-loop sized to what it can touch.

Read down the right column and you have the actual specification of an agent worth its premium. Read down the middle and you have a perfectly respectable automation that should be priced and governed as what it is. The fraud is not the technology in either column — it is putting the middle column behind the right-column word and the right- column invoice.

Autonomy you can measure, not autonomy you're told about

"Autonomous" is the word that does the most lifting in an agentic pitch and gets the least scrutiny, because it sounds like a property of the model when it is actually a property you have to test. The measurable version is concrete: take a representative sample of real cases, including the messy tail, and check what fraction the system carries end to end without a human touching it — and, just as important, whether the cases it does not finish are the ones it correctly handed off rather than the ones it got wrong silently. A system with a high finish rate and an honest escalation rate is autonomous. A system with a high finish rate and no escalation is not autonomous; it is overconfident, and the difference will surface as a loss.

This is why "how many agents do you have" is the wrong maturity question and "how many flows reach a goal without a human, at what error rate" is the right one. The first counts deployments; the second counts autonomy. A board that grades the programme on agent headcount is measuring the same vanity metric that stalls pilots before production — activity that photographs well and predicts nothing. Ask for the autonomy number on real cases, ask to see the escalations, and the agent-washed system has nowhere to hide.

ROI per flow, not ROI for the programme

The second thing the 40% have in common is that nobody could say what any single flow was worth. "The agentic programme" is not a unit you can defend at renewal; a flow is. The discipline that keeps a project out of the write-off column is to attribute value one flow at a time — this invoice-matching flow removed this many review-minutes at this error rate against this baseline — and to make that number, measured in production, the gate for whether the flow continues. Programme-level ROI is where unprofitable flows hide inside the average until the day the average itself gets cut.

This is the same gap we wrote about as the ROI gap: the money was spent, the demo worked, and no instrumented number ever proved a flow paid back. Outcome attribution per flow is not a finance nicety bolted on afterwards — it is the thing that lets you kill the three flows that lose money and double the two that don't, instead of cancelling all five because the blended line looked weak. A POC with no defined cancellation criterion is not a pilot; it is an open-ended subscription to optimism. Write the kill criterion before the spend, per flow, and the programme stops being a single bet you either keep or write off whole.

A programme you can only evaluate as a whole is a programme you will eventually cancel as a whole. The flow is the unit that survives diligence, because the flow is the unit that has a number.

Risk control sized to the blast radius

The third reason projects die is the one boards are least comfortable pricing: a system was given autonomy out of proportion to what it could damage. An agent that drafts an internal summary and an agent that approves a payment are not the same risk, and governing them identically is wrong in both directions — it either smothers the harmless one in approvals or lets the dangerous one act without a brake. The control has to be proportional to the blast radius: how much can this agent affect, how reversible is it, and how fast would anyone notice if it went wrong.

Proportional control is a design, not a disclaimer, and it has concrete parts: a guardrail that bounds the actions the agent may take at all, a kill switch that stops it within one action when something looks wrong, and a human-in-the-loop placed where the cost of a mistake — not the volume of cases — justifies the friction. We have argued the full version of this elsewhere as proportional agent governance: low-risk, reversible, high-volume work runs with a light touch and post-hoc audit; high-risk, irreversible, low-volume work runs with a human gate and a tighter leash. The agent-washed product usually fails this test from the other side — it has no real autonomy and no real controls, because both were assumed rather than built.

The total cost that never makes the slide

The cost line on an agentic business case is almost always the token price of a single model call, and that is the number least connected to the bill. A real agent reasons in a loop: it calls the model repeatedly within one task — perceive, decide, act, observe, decide again — so a task is not one call but ten or twenty, and the cost of test-time reasoning is multiples of the classic single-shot call. The flows that quietly go negative are the ones where the loop runs hot on hard documents and nobody put a ceiling on the steps. The falling price of inference helps, but it does not save a system that reasons without a budget.

An honest total cost of ownership includes the loop, not just the call: the reasoning tokens per task, the cost of every exception that still goes to a human, the integration and maintenance that keep the thing wired to the system of record, and the ongoing evaluation that proves it has not drifted. A business case built on the single-call price and the demo's success rate is the precise input that produces a write-off eighteen months later, when the real bill arrives and the attributable value never did. Put the loop on the slide. A vendor that cannot give you cost-per-decision including reasoning is telling you they have not measured the number that decides whether the flow survives.

The four ways a board ends up in the 40%

The cancellations are not random; they cluster around four decisions made before a line of the system ran. Naming them is most of the defence:

Buying the word, not the property — the "agent" claim is taken at face value off a happy-path demo, and nobody breaks the path to check whether the system replans or just follows a tree. The autonomy was the premium; it was also the thing never verified.
Funding a POC with no kill criterion — the pilot starts with a budget and no definition of what failure looks like, so it cannot fail, so it runs until someone cancels it from fatigue rather than from a number. An open-ended pilot is a slow write-off.
Grading maturity by agent count — the programme is measured by how many agents were deployed instead of how many flows reached a goal in production at an acceptable error rate. The dashboard looks busy and the value line stays flat.
Ignoring the loop in the cost model — the business case priced one model call, the system reasons in twenty, and the inference bill on the hard tail turns a flow that looked profitable into one that quietly bleeds. The cost nobody modelled is the cost that cancels the project.

The trade-off, stated honestly

There is a real tension here, and pretending otherwise is how the next over-sold cycle starts. Genuine autonomy is what makes an agent worth more than an automation — it absorbs the long tail of cases you could never script — and it is also what makes the system less predictable and harder to cost, because a thing that decides for itself can decide in ways you did not enumerate. Rebranded RPA is the opposite bargain: cheap, predictable, fully auditable, and brittle the instant reality leaves the script. Neither is free, and the honest question is not "which is better" but "how much autonomy does this specific flow actually need, and can I afford the unpredictability it buys".

For a lot of work the answer is genuinely "not much" — a stable, high-volume, well-understood flow is often better served by a deterministic rule than by a reasoning loop, and saying so is not anti-agentic, it is honest sizing. The flows that justify real autonomy are the ones with a fat tail of formats and exceptions that no rule set ever catches up to, where the cost of the cases a script drops is higher than the cost of the loop that handles them. The board filter is to pay for autonomy exactly where the tail justifies it, run the predictable middle on cheaper deterministic rails, and refuse to pay the agent premium for either a script wearing the word or a loop running where a rule would do. Get that allocation right and you are in the 60% by construction, not by luck.

What this means for the document desk

Document work is where agent washing is easiest to sell and easiest to expose, because document tasks look simple from the demo and hide their whole difficulty in the tail. Anyone can show an agent reading a clean invoice; the question that separates a real agent from a relabelled OCR-plus-rules pipeline is what happens to the invoice in a layout nobody templated, in a second language, with a line item that does not reconcile. Three properties tell them apart on real documents:

It replans on the document it has never seen. A real document agent does not fail when the format drifts; it reads, notices the fields it expected are not where it expected them, and works the problem rather than returning a confident wrong answer. That is the extraction-to-decision loop doing its job — and the single cleanest test of whether the word "agent" was earned.

It escalates instead of guessing. The moment a document agent is more valuable than a script is the moment it knows what it does not know — a low-confidence field, an exception outside its competence — and routes that case to a human instead of writing a plausible value into a ledger. A system with no escalation rate is not more autonomous; it is less honest.

It leaves an audit trail you can read without the vendor. A per-field audit trail — what was read, by which method, at what confidence, against which span of the source — is what lets you prove the autonomy is real rather than asserted, hold a quality SLA, and answer a regulator's "how do you know" with a record instead of a marketing claim. Agent washing cannot survive a trail; that is exactly why the washed product never has one.

Closing thought

The board conversation that ages well in 2026 is not "are we doing agentic AI" — almost everyone is, and a large fraction of it is about to be switched off. It is "is the thing we bought a real agent, and is it pointed at a flow that needs one". The filter is small and it is unforgiving: autonomy you measured on real cases rather than watched in a demo, ROI attributed per flow with a kill criterion written before the spend, risk control sized to the blast radius, and a cost model that includes the reasoning loop and the human exception. A board that applies that filter buys fewer agents and keeps more of them. A board that approves on demo enthusiasm and agent-count dashboards is, with actuarial reliability, filling the 40%.

At Cogneris we would rather lose the word than wash it. The document agent we ship is an agent where the tail justifies one — a ReAct loop that replans on documents it has never seen, escalates what it cannot resolve, and runs the predictable middle on cheaper deterministic rails so you are not paying a reasoning premium where a rule would do. The autonomy is measurable on your own cases, the value is attributable per flow, the controls are sized to what each flow can touch, and the per-field audit trail makes all of it something you can verify rather than take on faith. If you are holding an "agent" you are not sure is one, talk to our team and bring the flow — we will help you run the autonomy test on your own documents, including the cases where the honest answer is that you do not need an agent at all.

A real agent, or RPA in a new coat.