Measure the whole workflow
Accuracy matters, but it is not the only metric. Teams should test citation quality, table fidelity, schema stability, latency, confidence calibration, webhook behavior, review load, and integration effort.
Build the test set
Use 25 to 100 real documents across easy, typical, and ugly cases. Include scans, photos, multi-page packets, missing fields, handwriting, checkboxes, signatures, tables, and documents that should fail validation.
Document mix
Include invoices, statements, IDs, contracts, claims, payroll, tax, and packet uploads.
Image quality
Test native PDFs, scanned PDFs, mobile photos, shadows, rotation, blur, and crops.
Failure cases
Add missing pages, inconsistent values, duplicate files, wrong document types, and fraud signals.
Score what operators see
Track clean-through rate, fields sent to review, reviewer time, false confidence, duplicate detection, retry behavior, and the cost of explaining a wrong extraction after the fact.
Include OCR, JSON, tables, and citations
A fair benchmark separates text recognition from structured extraction. Score raw OCR text, typed JSON fields, table and line-item fidelity, source citations, validation results, and whether confidence scores actually predict review risk.
Benchmark dimensions
| Dimension | How to test | Why it matters |
|---|---|---|
| Field accuracy | Compare every extracted value to a labeled answer key. | Separates impressive demos from dependable workflow data. |
| Citation quality | Check whether each value links to the right page, span, or box. | Reviewers need evidence, not just a number. |
| Table fidelity | Score rows, columns, line items, totals, merged cells, and missing rows. | AP, banking, claims, and logistics workflows depend on tables. |
| Schema stability | Run the same packet repeatedly and diff JSON shape and field names. | Downstream systems break when schemas drift silently. |
| Confidence calibration | Check whether low-confidence fields are actually more likely to be wrong. | Bad calibration creates either review overload or false automation. |
| Workflow behavior | Test async jobs, webhooks, retries, idempotency, and failure states. | Production volume is where API demos usually crack. |
Use a weighted scorecard
Weight the benchmark by business risk. A low-stakes receipt workflow can optimize for speed and price. A mortgage, KYC, claims, or AP approval workflow should give more weight to citations, validation, review routing, audit trail, and false-confidence handling.
vendor,document_type,field_accuracy,citation_quality,table_fidelity,schema_stability,review_load,latency,total_score Cogneris,invoice,0.96,0.94,0.92,0.98,0.08,2.4,0.94 Vendor A,invoice,0.97,0.71,0.88,0.82,0.19,1.8,0.84
Run the same workflow test for every vendor
Use identical documents, field schemas, prompts or templates, retry policy, and review thresholds. Record whether each vendor returns typed JSON, source citations, confidence scores, validation results, webhook callbacks, and recoverable error messages.