Most document extraction systems work on simple cases.
They can extract a few fields from a clean document.
They can handle a known format.
They can produce reasonable outputs in early testing.
This creates the impression that the problem is largely solved.
It is not.
The Real Problem Emerges at Scale
The difficulty in document extraction does not come from single documents.
It comes from real workflows:
- Many documents contributing to one outcome
- Large schemas with dozens or hundreds of fields
- Variations in how the same information is expressed
- Data that must be normalized and reconciled across sources
This is what shows up in practice:
EOB documents with tables split across pages.
Insurance loss runs with inconsistent formats.
Brokerage statements where field values depend on context across multiple sections.
Bank statements with subtle structural differences.
Oil and gas invoices with deeply nested line items.
Patient notes that span multiple visits and must be interpreted together.
At this point, extraction is no longer a point problem.
It becomes a systems problem.
Most existing approaches were not designed for this.
Why Existing Systems Break
Traditional extraction systems are built on a set of assumptions:
- One document at a time
- Fixed schemas
- Limited structure
- Minimal cross-document reasoning
These assumptions hold for simple use cases.
They break under real-world complexity.
When pushed beyond these limits, systems exhibit consistent failure modes:
- Schemas must be simplified or constrained
- Fields are missed or returned as null
- Tables degrade when they span pages
- Context is lost across sections of a document
- Maintenance grows with every new document variation
These are not edge cases. They are the norm in production.
This Is an Architectural Constraint
Improving model accuracy alone does not resolve these issues.
The limitation is architectural.
A system designed for single-document extraction cannot be extended indefinitely to handle multi-document reasoning, large schemas, and structural variability.
A different approach is required.
A System Designed for Real Workflows
In Agentic Document Extraction (ADE), we focus on the full problem:
not extracting fields from documents, but producing structured, consistent outputs from complex document workflows.
This requires several capabilities:
Infinite schema support
Schemas are not constrained by length or depth.
Exhaustive extraction
All relevant information is captured without silent failure.
Master schema generation
A unified schema can represent variations across document types.
Source-grounded outputs
Every value can be traced back to its origin.
At a systems level, complex tasks are decomposed into smaller steps and resolved iteratively. This enables the system to handle scale without degrading accuracy.
Conclusion
Document extraction appears solved when evaluated on simple inputs.
It fails when applied to real-world complexity.
The gap is not in model capability alone, but in system design.
Closing that gap requires rethinking the problem as one of architecture, not just extraction.
That is the problem we are solving, and it is why I am so excited about this release.
Try the Playground: See the workflow on your own documents.
Read the API docs: Full documentation for extraction.
