Benchmarks: Answer 99.16% of DocVQA Without Images in QA: Agentic Document ExtractionRead more

The ADE Storybook: Part 2

Ava Xia

Ava Xia

Share On :
The ADE Storybook: Part 2The ADE Storybook: Part 2
Developers

Welcome back, Developers!

In Part 1, we mastered the “Baby Path”: we took a single, clean invoice and turned it into JSON. But in the Kingdom of Unstructured Data, things are rarely that tidy. Often, you will be handed a “Packet”—a 50-page PDF containing Invoices, Pay Stubs and Bank Statements all scanned together into one messy file. If you try to extract an “Invoice Number” from a Bank Statement, your commands will fail.

We need a way to untangle this mess. We need “ The Splitter”.

The Splitter

Think of splits as an assistant who takes a messy stack of mixed papers and automatically sorts them into near, labeled folders.

You upload ONE file containing different document types. ADE automatically separates them by type and even identifies distinct documents of the same type (like two different pay stubs).

Imagine you scanned 3 pages into a single PDF:

  • Page 1: Bank statement for January
  • Page 2: Pay stub from January 15
  • Page 3: Pay stub from January 30

Imagine you scanned 3 pages into a single PDF

How to Split

Step 1: Define the Sorting Rules

We need to tell the Splitter what to look for. We do this by creating a list of “ Classes.”

  • Name: The type you want (e.g., “Pay Stub”).
  • Description(Optional): An Explanation to assist AI to better understand the type.
  • Identifier (Optional): A specific keyword unique to that type (like “Payment Date”).
# Define your sorting rules
split_classes = [
    {
        "name": "Bank Statement",
        "description": "A monthly financial summary."
    },
    {
        "name": "Pay Stub",
        "description": "A document detailing an employee's earnings.",
        "identifier": "Payment Date" # Optional. Helps if you have multiple stubs
    }
]

Step 2: Run Split

Once your rules are set, you feed your parsed document into The Splitter.

import json
from pathlib import Path
from landingai_ade import LandingAIADE

client = LandingAIADE()

# Revisit Part 1 for document parsing basics. We need the parsed output for splitting.
response = client.split(
    markdown=Path("/path/to/parsed_mixed_document.md"),    split_class= json.dumps(split_classes),
        save_to="output_folder"  # optional: saves as {input_file}_split_output.json
)

What ADE Returns:

What ADE Returns

The Combo Move (Sort → Extract)

Now that we have a clean and separated “folder”, we can loop through the sorted pile and only use the Extractor on the documents we actually need.

The Full Workflow:

import json
from pydantic import BaseModel, Field
from landingai_ade.lib import pydantic_to_json_schema

# 1. Define the Reference List (Schema) for the document we care about
class PaystubSchema(BaseModel):
    employee_name: str = Field(..., description="The name of the employee receiving the pay")
    net_pay: float = Field(..., description="The final amount paid after taxes")
    pay_period_end: str = Field(..., description="The date the pay period ended")

# 2. Group the documents by identifier first
# This handles the edge case where one document is fragmented across multiple splits
grouped_paystubs = {}

for split in split_response.get("splits", []):
    # We only care about Paystubs right now
    if split.get("classification") == "Pay Stub":
        doc_id = split.get("identifier")
        
        # Only group if an identifier exists
        if doc_id:
            if doc_id not in grouped_paystubs:
                grouped_paystubs[doc_id] = []
            
            # Collect the markdowns for this specific identifier
            grouped_paystubs[doc_id].extend(split.get("markdowns", []))

# 3. Loop through the grouped documents
for doc_id, markdowns in grouped_paystubs.items():
    
    # We merge all pages for this identifier to provide the full context for extraction
    merged_markdown = "\n".join(markdowns)
    
    # Run the extraction using a unique filename per identifier to prevent overwriting
    extract_response = client.extract(
        schema=pydantic_to_json_schema(PaystubSchema),
        markdown=merged_markdown,
        save_to=f"output_folder/{doc_id}_extract_output.json" 
    )
    
    print(f"Results for Paystub ID {doc_id}:")
    print(extract_response.extraction)

# Now we have taken out all the relevant information we want for all pay stubs in this messy mixed document.

Now we have taken out all the relevant information we want for all pay stubs in this messy mixed document.

The Combo Move \(Sort → Extract\)

🏆 Victory!

Today, you’ve already learned the first three great arts:

  • The How (Parsing): Translating the messy files into the machine language of Markdown.
  • The Which (Splitting): Sorting a jumbled pile into neat, logical stacks of documents.
  • The What (Extraction): Performing as “Extraction” to pull the gold from the right pages using a Schema.

👉 Coming Up in Part 3: The Data Map

You have mastered the How (Parsing), the What (Extraction), and the Which (Splitting). But what about Where?

In the next chapter, “ Trust the Map,” we will reveal the secrets of The Navigator (Visual Grounding)—how to find the exact pixel coordinates of your data on the page. If the Extractor finds a “Total Amount,” the Navigator shows you precisely where on the page that value is written. Stay tuned!