Evaluation guide

Benchmark document extraction like production.

A practical benchmark guide for comparing document AI vendors on the dimensions that decide whether an extraction workflow will survive real traffic.

Measure the whole workflow

Accuracy matters, but it is not the only metric. Teams should test citation quality, table fidelity, schema stability, latency, confidence calibration, webhook behavior, review load, and integration effort.

Build the test set

Use 25 to 100 real documents across easy, typical, and ugly cases. Include scans, photos, multi-page packets, missing fields, handwriting, checkboxes, signatures, tables, and documents that should fail validation.

Document mix

Include invoices, statements, IDs, contracts, claims, payroll, tax, and packet uploads.

Image quality

Test native PDFs, scanned PDFs, mobile photos, shadows, rotation, blur, and crops.

Failure cases

Add missing pages, inconsistent values, duplicate files, wrong document types, and fraud signals.

Score what operators see

Track clean-through rate, fields sent to review, reviewer time, false confidence, duplicate detection, retry behavior, and the cost of explaining a wrong extraction after the fact.

Include OCR, JSON, tables, and citations

A fair benchmark separates text recognition from structured extraction. Score raw OCR text, typed JSON fields, table and line-item fidelity, source citations, validation results, and whether confidence scores actually predict review risk.

Benchmark dimensions

DimensionHow to testWhy it matters
Field accuracyCompare every extracted value to a labeled answer key.Separates impressive demos from dependable workflow data.
Citation qualityCheck whether each value links to the right page, span, or box.Reviewers need evidence, not just a number.
Table fidelityScore rows, columns, line items, totals, merged cells, and missing rows.AP, banking, claims, and logistics workflows depend on tables.
Schema stabilityRun the same packet repeatedly and diff JSON shape and field names.Downstream systems break when schemas drift silently.
Confidence calibrationCheck whether low-confidence fields are actually more likely to be wrong.Bad calibration creates either review overload or false automation.
Workflow behaviorTest async jobs, webhooks, retries, idempotency, and failure states.Production volume is where API demos usually crack.

Use a weighted scorecard

Weight the benchmark by business risk. A low-stakes receipt workflow can optimize for speed and price. A mortgage, KYC, claims, or AP approval workflow should give more weight to citations, validation, review routing, audit trail, and false-confidence handling.

vendor,document_type,field_accuracy,citation_quality,table_fidelity,schema_stability,review_load,latency,total_score
Cogneris,invoice,0.96,0.94,0.92,0.98,0.08,2.4,0.94
Vendor A,invoice,0.97,0.71,0.88,0.82,0.19,1.8,0.84

Run the same workflow test for every vendor

Use identical documents, field schemas, prompts or templates, retry policy, and review thresholds. Record whether each vendor returns typed JSON, source citations, confidence scores, validation results, webhook callbacks, and recoverable error messages.

Related pages