Document Extraction Benchmark: OCR & JSON

Measure the whole workflow

Accuracy matters, but it is not the only metric. Teams should test citation quality, table fidelity, schema stability, latency, confidence calibration, webhook behavior, review load, and integration effort.

Build the test set

Use 25 to 100 real documents across easy, typical, and ugly cases. Include scans, photos, multi-page packets, missing fields, handwriting, checkboxes, signatures, tables, and documents that should fail validation.

Document mix

Include invoices, statements, IDs, contracts, claims, payroll, tax, and packet uploads.

Image quality

Test native PDFs, scanned PDFs, mobile photos, shadows, rotation, blur, and crops.

Failure cases

Add missing pages, inconsistent values, duplicate files, wrong document types, and fraud signals.

Score what operators see

Track clean-through rate, fields sent to review, reviewer time, false confidence, duplicate detection, retry behavior, and the cost of explaining a wrong extraction after the fact.

Include OCR, JSON, tables, and citations

A fair benchmark separates text recognition from structured extraction. Score raw OCR text, typed JSON fields, table and line-item fidelity, source citations, validation results, and whether confidence scores actually predict review risk.

Benchmark dimensions

Dimension	How to test	Why it matters
Field accuracy	Compare every extracted value to a labeled answer key.	Separates impressive demos from dependable workflow data.
Citation quality	Check whether each value links to the right page, span, or box.	Reviewers need evidence, not just a number.
Table fidelity	Score rows, columns, line items, totals, merged cells, and missing rows.	AP, banking, claims, and logistics workflows depend on tables.
Schema stability	Run the same packet repeatedly and diff JSON shape and field names.	Downstream systems break when schemas drift silently.
Confidence calibration	Check whether low-confidence fields are actually more likely to be wrong.	Bad calibration creates either review overload or false automation.
Workflow behavior	Test async jobs, webhooks, retries, idempotency, and failure states.	Production volume is where API demos usually crack.

Use a weighted scorecard

Weight the benchmark by business risk. A low-stakes receipt workflow can optimize for speed and price. A mortgage, KYC, claims, or AP approval workflow should give more weight to citations, validation, review routing, audit trail, and false-confidence handling.

vendor,document_type,field_accuracy,citation_quality,table_fidelity,schema_stability,review_load,latency,total_score
Cogneris,invoice,0.96,0.94,0.92,0.98,0.08,2.4,0.94
Vendor A,invoice,0.97,0.71,0.88,0.82,0.19,1.8,0.84

Run the same workflow test for every vendor

Use identical documents, field schemas, prompts or templates, retry policy, and review thresholds. Record whether each vendor returns typed JSON, source citations, confidence scores, validation results, webhook callbacks, and recoverable error messages.

Best platforms Best OCR APIs Document AI for RAG Pricing comparison Compare Reducto Compare LandingAI ADE

Benchmark document extraction like production.