Benchmarks: Answer 99.16% of DocVQA Without Images in QA: Agentic Document ExtractionRead more

Why Extraction Breaks When Schemas Drift

How ADE keeps schemas accurate across document sources and updates

Ava Xia

Ava Xia

Share On :
Why Extraction Breaks When Schemas DriftWhy Extraction Breaks When Schemas Drift
Product

TL;DR Extraction schemas break when new document sources use different field names and formats. A schema built for one carrier's EOBs fails on another's. LandingAI's extraction capabilities addresses that in the schema itself: alternative names capture label variation, formatting guidance reduces downstream cleanup, and the schema building workflow lets developers prompt-edit a schema, and propose updates as the document set changes. The result: one schema handles multiple document sources and stays current as documents change.

A schema is the template that tells an extraction system which fields to pull from a document and what format to expect. Think of it as a contract: "pull the claim ID, the date of service, and the patient responsibility amount, then return them as a string, a string, and a number."

An EOB (Explanation of Benefits) is the statement an insurer sends after a medical visit, showing what was billed, what insurance covered, and what the patient owes. It is one of the most common document types in healthcare data processing.

Schemas are usually built against a single document source. An insurance company processing EOBs might build one that works perfectly on Aetna's forms. But the moment Blue Cross or Cigna EOBs arrive, fields have different names, dates use different formats, and the schema silently returns nulls. Six months later, Aetna updates its form layout and even the original source breaks.

LandingAI's Extraction capability handles both problems from the schema design, so extraction stays accurate across sources and over time.

Different Carriers Label the Same Fields Differently

Take a single data point: the unique identifier for a claim. Aetna calls it "Claim Number." Blue Cross calls it "Reference ID." Cigna calls it "EOB Number." All three mean the same thing, but a schema with a field called claim_id won't match any of them automatically.

That is why semantic context inside the schema matters. A field like claim_id should be able to carry multiple label variants rather than forcing the pipeline to fork per carrier:

{
  "type": "object",
  "properties": {
    "claim_id": {
      "description": "The unique identifier for the claim on the EOB.",
      "x-alternativeNames": [
        "Reference ID",
        "Claim Number",
        "EOB Number"
      ],
      "type": "string"
    }
  }
}

For alternative names, you don't need an exhaustive dictionary of every possible label; the API simply needs a representative sample of variations. By providing a few examples of how different carriers label a field, the API learns to recognize the underlying intent, meaning the system can often identify a new carrier's 'Reference Number' even if it wasn't explicitly in your initial list.

The same idea applies to formatting. One carrier may represent the date of service as 03/15/2025, another as March 15, 2025. Formatting guidance in the schema gives the extraction workflow a target output shape before the value hits downstream code.

The API uses semantic reasoning to synthesize and reformat different-looking data into precise structures:

FieldNatural Language ConstraintSource InputNormalized Output
billing_amount"Float, no symbols or commas""$1,250.50 USD"1250.50
procedure_code"Extract 5-digit CPT code only""Office visit (99213)"99213
provider_name"Format as 'Last, First' and capitalize""dr. jonathan smith"Smith, Jonathan
treatment_time"Convert all durations to total minutes""1hr 20min"80

Documents Change, but Schemas Don't Update Themselves

Cross-carrier variation is only half the problem. Documents also change over time.

When a carrier redesigns a form, fields may appear, disappear, or become more granular (e.g., Patient Responsibility splitting into separate values). When this happens, your existing schema keeps running, but it is no longer aligned with the incoming documents.

You have two primary ways to adapt to these changes:

  • Updates: You can feed the new documents into the build schema API and instruct it to generate the updated schema automatically.
  • Prompt-Based Editing: If you already know the required change — such as dividing patient_responsibility into more specific subfields — you can describe the adjustment directly in the prompt.

Why this matters for enterprise developers

For enterprise developers, the gain is not abstract:

  • Less carrier-specific field mapping
  • Less silent breakage when labels or layouts change
  • A cleaner path to update schemas with prompts instead of brittle manual rewrites
  • Versioned schema changes when the EOB set evolves
  • Less downstream normalization code for dates and amounts
  • Less pressure to fork one schema into many slightly different variants

A stronger schema does not just improve maintainability. It also improves extraction consistency because the extract step has better semantic context to work from.

  • Try the Playground: Generate a schema on a mixed EOB set and inspect the alternative names and formatting guidance it produces.

  • Read the API docs: Full documentation for extraction.