Welcome back, Developers!
In Part 1, we mastered the “Baby Path”: we took a single, clean invoice and turned it into JSON. But in the Kingdom of Unstructured Data, things are rarely that tidy. Often, you will be handed a “Packet”—a 50-page PDF containing Invoices, Pay Stubs and Bank Statements all scanned together into one messy file. If you try to extract an “Invoice Number” from a Bank Statement, your commands will fail.
We need a way to untangle this mess. We need “The Splitter”.
The Splitter
Think of splits as an assistant who takes a messy stack of mixed papers and automatically sorts them into near, labeled folders.
You upload ONE file containing different document types. ADE automatically separates them by type and even identifies distinct documents of the same type (like two different pay stubs).
Imagine you scanned 3 pages into a single PDF:
- Page 1: Bank statement for January
- Page 2: Pay stub from January 15
- Page 3: Pay stub from January 30

How to Split
Step 1: Define the Sorting Rules
We need to tell the Splitter what to look for. We do this by creating a list of “Classes.”
- Name: The type you want (e.g., “Pay Stub”).
- Description(Optional): An Explanation to assist AI to better understand the type.
- Identifier (Optional): A specific keyword unique to that type (like “Payment Date”).
# Define your sorting rules
split_classes = [
{
"name": "Bank Statement",
"description": "A monthly financial summary."
},
{
"name": "Pay Stub",
"description": "A document detailing an employee's earnings.",
"identifier": "Payment Date" # Optional. Helps if you have multiple stubs
}
]
Step 2: Run Split
Once your rules are set, you feed your parsed document into The Splitter.
import json
from pathlib import Path
from landingai_ade import LandingAIADE
client = LandingAIADE()
# Revisit Part 1 for document parsing basics. We need the parsed output for splitting.
response = client.split(
markdown=Path("/path/to/parsed_mixed_document.md"), split_class= json.dumps(split_classes),
save_to="output_folder" # optional: saves as {input_file}_split_output.json
)
What ADE Returns:

The Combo Move (Sort → Extract)
Now that we have a clean and separated “folder”, we can loop through the sorted pile and only use the Extractor on the documents we actually need.
The Full Workflow:
from pydantic import BaseModel, Field
# 1. Define the Reference List (Schema) for the document we care about
class PaystubSchema(BaseModel):
employee_name: str = Field(..., description="The name of the employee receiving the pay")
net_pay: float = Field(..., description="The final amount paid after taxes")
pay_period_end: str = Field(..., description="The date the pay period ended")
# 2. Loop through the sorted documents
for split in split_response.splits:
# We only care about Paystubs right now
if split.classification == "Pay Stub":
# We merge the pages within one split to provide the full context for extraction
merged_markdown = "\n".join(split.markdowns)
# Run the extraction
extract_response = client.extract(
schema=pydantic_to_json_schema(PaystubSchema),
markdown=merged_markdown,
save_to="output_folder" # optional: saves as {input_file}_extract_output.json
)
print(extract_response.extraction)
Now we have taken out all the relevant information we want for all pay stubs in this messy mixed document.

🏆 Victory!
Today, you’ve already learned the first three great arts:
- The How (Parsing): Translating the messy files into the machine language of Markdown.
- The Which (Splitting): Sorting a jumbled pile into neat, logical stacks of documents.
- The What (Extraction): Performing as “Extraction” to pull the gold from the right pages using a Schema.
👉 Coming Up in Part 3: The Data Map
You have mastered the How (Parsing), the What (Extraction), and the Which (Splitting). But what about Where?
In the next chapter, “Trust the Map,” we will reveal the secrets of The Navigator (Visual Grounding)—how to find the exact pixel coordinates of your data on the page. If the Extractor finds a “Total Amount,” the Navigator shows you precisely where on the page that value is written. Stay tuned!




