Benchmarks: Answer 99.16% of DocVQA Without Images in QA: Agentic Document Extraction Read More

New Course! From OCR to Agentic Document Extraction Enroll Free Now

Pricing Choose a platform to continue

arrow icon

Agentic Document Extraction
A new suite of agentic vision APIs — document extraction, object detection, and more.

Right image

arrow icon

LandingLens
An end-to-end, low-code platform to label, train, and deploy custom vision models.

Right image

Login Choose a platform to continue

arrow icon

Agentic Document Extraction
A new suite of agentic vision APIs — document extraction, object detection, and more.

Right image

arrow icon

LandingLens
An end-to-end, low-code platform to label, train, and deploy custom vision models.

Right image

Start for Free Choose a platform to continue

arrow icon

Agentic Document Extraction
A new suite of agentic vision APIs — document extraction, object detection, and more.

Right image

arrow icon

LandingLens
An end-to-end, low-code platform to label, train, and deploy custom vision models.

Right image

The ADE Storybook Part 2: The Splitter

The ADE Storybook: Part 2
Ava Xia

Welcome back, Developers!

In Part 1, we mastered the “Baby Path”: we took a single, clean invoice and turned it into JSON. But in the Kingdom of Unstructured Data, things are rarely that tidy. Often, you will be handed a “Packet”—a 50-page PDF containing Invoices, Pay Stubs and Bank Statements all scanned together into one messy file. If you try to extract an “Invoice Number” from a Bank Statement, your commands will fail.

We need a way to untangle this mess. We need “The Splitter”.

The Splitter

Think of splits as an assistant who takes a messy stack of mixed papers and automatically sorts them into near, labeled folders.

You upload ONE file containing different document types. ADE automatically separates them by type and even identifies distinct documents of the same type (like two different pay stubs).

Imagine you scanned 3 pages into a single PDF:

  • Page 1: Bank statement for January
  • Page 2: Pay stub from January 15
  • Page 3: Pay stub from January 30
Imagine you scanned 3 pages into a single PDF

How to Split

Step 1: Define the Sorting Rules

We need to tell the Splitter what to look for. We do this by creating a list of “Classes.”

  • Name: The type you want (e.g., “Pay Stub”).
  • Description(Optional): An Explanation to assist AI to better understand the type.
  • Identifier (Optional): A specific keyword unique to that type (like “Payment Date”).
# Define your sorting rules
split_classes = [
    {
        "name": "Bank Statement",
        "description": "A monthly financial summary."
    },
    {
        "name": "Pay Stub",
        "description": "A document detailing an employee's earnings.",
        "identifier": "Payment Date" # Optional. Helps if you have multiple stubs
    }
]

Step 2: Run Split

Once your rules are set, you feed your parsed document into The Splitter.

import json
from pathlib import Path
from landingai_ade import LandingAIADE

client = LandingAIADE()

# Revisit Part 1 for document parsing basics. We need the parsed output for splitting.
response = client.split(
    markdown=Path("/path/to/parsed_mixed_document.md"),    split_class= json.dumps(split_classes),
        save_to="output_folder"  # optional: saves as {input_file}_split_output.json
)

What ADE Returns:

What ADE Returns

The Combo Move (Sort → Extract)

Now that we have a clean and separated “folder”, we can loop through the sorted pile and only use the Extractor on the documents we actually need.

The Full Workflow:

from pydantic import BaseModel, Field

# 1. Define the Reference List (Schema) for the document we care about
class PaystubSchema(BaseModel):
    employee_name: str = Field(..., description="The name of the employee receiving the pay")
    net_pay: float = Field(..., description="The final amount paid after taxes")
    pay_period_end: str = Field(..., description="The date the pay period ended")

# 2. Loop through the sorted documents
for split in split_response.splits:
    
    # We only care about Paystubs right now
   if split.classification == "Pay Stub":
      
       # We merge the pages within one split to provide the full context for extraction
       merged_markdown = "\n".join(split.markdowns)
      
       # Run the extraction
       extract_response = client.extract(
           schema=pydantic_to_json_schema(PaystubSchema),
           markdown=merged_markdown,
           save_to="output_folder"  # optional: saves as {input_file}_extract_output.json
       )
      
       print(extract_response.extraction)

Now we have taken out all the relevant information we want for all pay stubs in this messy mixed document.

The Combo Move (Sort → Extract)

🏆 Victory!

Today, you’ve already learned the first three great arts:

  • The How (Parsing): Translating the messy files into the machine language of Markdown.
  • The Which (Splitting): Sorting a jumbled pile into neat, logical stacks of documents.
  • The What (Extraction): Performing as “Extraction” to pull the gold from the right pages using a Schema.

👉 Coming Up in Part 3: The Data Map

You have mastered the How (Parsing), the What (Extraction), and the Which (Splitting). But what about Where?

In the next chapter, “Trust the Map,” we will reveal the secrets of The Navigator (Visual Grounding)—how to find the exact pixel coordinates of your data on the page. If the Extractor finds a “Total Amount,” the Navigator shows you precisely where on the page that value is written. Stay tuned!