Glossary · OCR

OCR. What it is, what it isn't.

OCR · Optical Character Recognition · Last updated 2026-05-07

Definition

OCR — Optical Character Recognition — is the conversion of pixel data (scanned images, photos of documents, screenshots) into machine-readable text. It's the foundational layer underneath every intelligent document processing system, and increasingly a commodity. Modern OCR engines hit 99%+ character accuracy on clean documents, and even mediocre OCR is good enough when the downstream model is an LLM that can correct context from a few mis-read characters.

Where OCR fits in a document AI pipeline

OCR is one stage in a longer pipeline. The full pipeline looks like this:

  • Ingestion — accept the file, normalize the format.
  • OCR — convert pixels to text. (This is the part most people mean when they say "document OCR".)
  • Layout analysis — find regions, tables, signatures.
  • Classification — identify the document type.
  • Extraction — pull the specific fields that matter for that type.
  • Validation — run cross-field and business-rule checks.
  • Integration — push to ERP, CRM, or warehouse.

Buying a product that markets itself as "OCR" gets you stage 2. Buying a product that markets itself as IDP or Document AI gets you stages 1 through 7.

Common pitfalls

Obsessing over OCR accuracy. A 99% character accuracy looks great until you realize a 12-field document still has roughly a 1-in-9 chance of containing at least one incorrect field. The metric that matters is document-level auto-approval rate, not character accuracy.

Confusing OCR vendors with IDP vendors. Tesseract, AWS Textract, Google Cloud Vision, and Azure Computer Vision are OCR products. They give you text. You still need to build the classifier, the extractor, the validator, and the audit trail — which is most of the work.

Skipping OCR entirely. Some Document AI systems pass document images directly to multimodal LLMs, bypassing the explicit OCR step. This works well on document types the model has seen in training, but breaks down on unusual layouts where a dedicated OCR pass adds robustness. The right answer is usually both — OCR as a fallback, multimodal as the primary path.

What to ask a vendor

  • Which OCR engine is in your pipeline, and is it swappable per tenant?
  • How do you handle low-quality inputs — faded thermal print, smartphone photos with glare, multi-language documents?
  • Do you fall back to OCR when the multimodal LLM is uncertain, or vice versa?
  • What's the OCR latency budget in your end-to-end SLA?

Related terms

Back to the full glossary