Extraction API. PDF in, JSON out.
Post a document, receive a structured payload that matches a template or an inline schema — with per-field confidence and a full audit trail for every value the model emitted.
Overview
A single POST returns a typed JSON payload for the fields you asked for.
- Template-based — extract against one of 40+ shipped templates (invoice, contract, KYC, payroll, claim).
- Schema-on-the-fly — pass an inline JSON schema and the model fills it.
- Per-field confidence — every value carries a
confidencescore and a bounding-box reference. - Two modes — synchronous for files under 25 pages, asynchronous with a webhook callback for anything larger.
Related product pages: document extraction API, document parsing API, and OCR API.
Endpoints
Two routes, same response shape.
Request
Multipart upload, or a signed URL the platform pulls from.
# curl — invoice template, multipart
curl -X POST https://api.cogneris.ai/v1/extract \
-H "Authorization: Bearer $COGNERIS_KEY" \
-F "template=invoice" \
-F "file=@./acme-2026-04.pdf"
{ "source": { "url": "https://files.example.com/po-9912.pdf" }, "schema": { "vendor_name": "string", "po_number": "string", "line_items": "array<{description:string, qty:number, unit_price:number}>", "total_amount": "number" }, "options": { "include_bounding_boxes": true } }
Parameters
- template — slug of a shipped or tenant-defined template. Mutually exclusive with
schema. - schema — inline field map. The model treats it as the contract for the response.
- source — either
file(multipart) or{ url }(signed HTTPS URL, fetched server-side). - options.include_bounding_boxes — adds a
bboxper field, page-indexed. - options.locale — hint for date and number parsing (defaults to tenant locale).
{ "schema_version": "vendor-onboarding-v3", "schema": { "vendor": { "legal_name": "string|required", "tax_id": "string|required", "addresses": "array<{type:string, line1:string, city:string, country:string}>" }, "documents": "array<{type:string, issue_date:date, expiry_date:date}>", "line_items": "array<{description:string, quantity:number, amount:currency}>" }, "options": { "include_citations": true, "include_confidence": true, "validation_profile": "vendor_onboarding" } }
Response
A wrapper with data, meta and has_errors, matching the platform contract.
{ "data": { "vendor_name": { "value": "Acme Industries Ltd.", "confidence": 0.98, "bbox": [82, 114, 412, 136] }, "po_number": { "value": "PO-9912", "confidence": 1.00 }, "line_items": [ /* … */ ], "total_amount": { "value": 12480.50, "confidence": 0.99 } }, "meta": { "job_id": "ext_01J9MR3K…", "template": "invoice", "model": "flx-extract-2026-04", "pages": 3, "latency_ms": 2410, "audit_url": "https://app.cogneris.ai/audit/ext_01J9MR3K" }, "has_errors": false }
Schema design
Treat schemas as API contracts, not prompt hints.
For production extraction, define fields with names, types, required status, allowed enums, normalization rules, and validation rules. Use arrays for line items and transactions; use nested objects for parties, addresses, dates, and evidence groups. Version every schema that feeds downstream systems.
- Type values explicitly — string, number, date, currency, boolean, object, or array.
- Separate extraction from validation — extract the value first, then apply business rules.
- Keep optional fields visible — return null with evidence status instead of silently dropping fields.
- Use schema versions — include the version in the request and audit log when fields change.
Line items and nested fields
Use arrays for repeatable data such as invoice lines, bank transactions, payroll earnings, policy coverages, and shipment charges. Keep nested objects for addresses, parties, identities, dates, and evidence groups so downstream systems can map fields without reparsing strings.
"line_items": [ { "description": { "value": "Annual platform subscription", "confidence": 0.97 }, "quantity": { "value": 12, "confidence": 0.94 }, "unit_price": { "value": 249.00, "currency": "USD", "confidence": 0.96 }, "citation": { "page": 2, "bbox": [64, 312, 711, 336] } } ]
Citations and evidence
Every important field should point back to the document.
Cogneris can return page numbers, source spans, bounding boxes, confidence, and reviewer state for extracted values. Citations are returned with the field, not as a separate viewer-only artifact, so downstream systems can store and display the evidence alongside the structured data.
"total_amount": { "value": 12480.50, "confidence": 0.99, "citation": { "page": 3, "text": "Total due 12,480.50" }, "bbox": [412, 701, 536, 724] }
Async & webhooks
For files over 25 pages or batches, fire and forget.
POST /v1/extract/async returns a job_id immediately (HTTP 202). When the job
finishes, we call your registered webhook with the same payload shape you'd see on the sync route.
Retries follow exponential backoff for up to 24 hours; signed with HMAC-SHA256 over the raw body.
{ "event": "extraction.completed", "job_id": "ext_01J9MR3K", "status": "completed", "document": { "id": "doc_01J9MR2X", "pages": 47 }, "review": { "required": true, "reason": "low_confidence_required_field" }, "data": { "...same shape as sync response": true } }
SLA — 95th-percentile job completion is under 90 seconds for documents up to 200 pages. See Webhooks for delivery semantics.
Confidence & human review
Confidence is a scalar, not a verdict — calibrate it to your risk tolerance.
- ≥ 0.95 — auto-accept band; the model is grounded to a bounding box and the value matched template hints.
- 0.80 – 0.95 — review band; route to a human if the field is financial or legal.
- < 0.80 — flagged; the agent could not ground the value and admits it.
Every score is reproducible: the audit_url in meta opens the agent trace —
prompts, tool calls, page snippets and the final reasoning that produced each value.
Limits & pricing
- File size — 100 MB per request.
- Pages — 25 sync, 500 async, 10,000 per batch call.
- Rate — 600 requests/min per API key, burst to 1,200 for 30 seconds.
- Pricing — pay-per-page; see pricing for the tier table.
Errors
Full error catalogue in Error handling.
Next: Classification
Detect document type before you route it to a template.