Skip to main content
PreviewAgent Builder is currently in preview and may change before general availability.

When to use

Use fan-out extraction when your app needs to process documents that vary significantly in structure and content. A single extraction prompt cannot handle the differences between an invoice, a bill of lading, and a fuel receipt — each has different fields, layouts, and validation rules. This pattern is the right choice when:
  • You receive mixed document types in a single pipeline (email attachments, bulk uploads)
  • Each document type has a distinct schema with specialized fields
  • You need high extraction accuracy per type rather than a generic best-effort pass
  • The number of document types may grow over time without rearchitecting the workflow

Architecture

The pattern follows two phases: classification, then type-specific extraction.
Document


TEXT_UNDERSTANDING (classify type)


Condition (fork by type)
  ├──► TEXT_EXTRACTION (Type A) ──►─┐
  ├──► TEXT_EXTRACTION (Type B) ──►─┤
  ├──► TEXT_EXTRACTION (Type C) ──►─┤
  └──► TEXT_EXTRACTION (Type N) ──►─┘


                              Merge results

Phase 1: Classification

A TEXT_UNDERSTANDING node receives the document and classifies it into one of the known types. The classification prompt constrains the output to an enumerated list, so the Condition node can route deterministically.

Phase 2: Type-specific extraction

Each branch contains a TEXT_EXTRACTION node (using the Extract Data from File capability) configured with:
  • A prompt tailored to that document type, instructing the model which fields to look for
  • A response schema defining the exact JSON structure expected for that type
  • Extraction strategy settings optimized for the document format (text-heavy PDFs vs. scanned images)
A merge point downstream collects results from whichever branch executed.

Implementation

Step 1: Configure the classification node

Add a TEXT_UNDERSTANDING node and configure it to classify the document type. Example classification prompt:
You are a document classifier. Analyze the provided document and determine its type.

Respond with exactly one of the following values:
- BOL
- INVOICE
- RATE_CONFIRMATION
- LUMPER_RECEIPT
- FUEL_RECEIPT

Base your classification on the document layout, headers, and field labels.
If the document does not match any known type, respond with UNKNOWN.
Example response schema:
{
  "type": "object",
  "properties": {
    "document_type": {
      "type": "string",
      "enum": ["BOL", "INVOICE", "RATE_CONFIRMATION", "LUMPER_RECEIPT", "FUEL_RECEIPT", "UNKNOWN"]
    },
    "confidence": {
      "type": "number",
      "description": "Classification confidence between 0 and 1"
    }
  },
  "required": ["document_type", "confidence"]
}

Step 2: Add the Condition node

Add a Condition node after the classification node. Configure branches based on the document_type value:
BranchConditionTarget
BOLdocument_type == "BOL"TEXT_EXTRACTION (BOL)
Invoicedocument_type == "INVOICE"TEXT_EXTRACTION (Invoice)
Rate confirmationdocument_type == "RATE_CONFIRMATION"TEXT_EXTRACTION (Rate confirmation)
Lumper receiptdocument_type == "LUMPER_RECEIPT"TEXT_EXTRACTION (Lumper receipt)
Fuel receiptdocument_type == "FUEL_RECEIPT"TEXT_EXTRACTION (Fuel receipt)
Defaultdocument_type == "UNKNOWN"Manual review queue

Step 3: Configure type-specific extraction nodes

Each branch gets its own TEXT_EXTRACTION node with a prompt and schema optimized for that document type. The Extract Data from File capability handles PDFs, images (via OCR), and other supported file types automatically.
Prompt:
Extract all relevant fields from this Bill of Lading document.
Pay special attention to carrier information, shipment details, and freight charges.
If a field is not present, return null for that field.
Response schema:
{
  "type": "object",
  "properties": {
    "bol_number": { "type": "string" },
    "carrier": {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "mc_number": { "type": "string" },
        "scac_code": { "type": "string" }
      }
    },
    "shipper": {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "address": { "type": "string" },
        "city": { "type": "string" },
        "state": { "type": "string" },
        "zip": { "type": "string" }
      }
    },
    "consignee": {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "address": { "type": "string" },
        "city": { "type": "string" },
        "state": { "type": "string" },
        "zip": { "type": "string" }
      }
    },
    "ship_date": { "type": "string" },
    "delivery_date": { "type": "string" },
    "pieces": { "type": "integer" },
    "weight": { "type": "number" },
    "freight_charges": { "type": "number" }
  }
}
Keep schemas as flat as possible for simple document types (like fuel receipts and lumper receipts). Reserve nested objects for complex types that genuinely have grouped fields (like BOL with shipper/consignee blocks).

Configuration reference

ComponentNode typeKey settings
ClassifierTEXT_UNDERSTANDINGPrompt with enumerated types, constrained response schema
RouterConditionBranch per document type, default branch for unknown types
Extractor (per type)TEXT_EXTRACTION (Extract Data from File)Type-specific prompt, tailored response schema, extraction strategy
MergeMerge / End nodeCollects output from whichever branch executed
The TEXT_EXTRACTION node uses the Extract Data from File capability, which supports PDFs, DOCX, XLSX, PPTX, and image formats (JPG, PNG, TIFF). Images are automatically converted to PDF before processing via OCR.

Scaling to many document types

This pattern scales well because adding a new document type requires only:
  1. Adding the new type to the classification prompt’s enumerated list
  2. Adding a new branch in the Condition node
  3. Adding a new TEXT_EXTRACTION node with the type-specific prompt and schema
No existing branches are modified. This makes the pattern suitable for domains with dozens of document types.
DomainExample document types
LogisticsBOL, invoice, rate confirmation, lumper receipt, fuel receipt, proof of delivery, customs declaration
MortgageProduct sheets, income statements, bank statements, tax returns, appraisal reports, title documents, regulatory disclosures
InsuranceClaims forms, medical records, police reports, repair estimates, coverage declarations

Variations

Parallel extraction

Instead of classifying first, run extraction for all document types simultaneously and pick the result with the highest confidence. This trades compute cost for lower latency and avoids classification errors.
Parallel extraction works best when you have a small number of document types (under 5). Beyond that, the cost of running every extractor on every document becomes impractical.

Hierarchical classification

For large document sets, classify in two stages: first into a broad category (financial, shipping, legal), then into a specific type within that category. This reduces the number of options the classifier evaluates at each stage.
Document


TEXT_UNDERSTANDING (broad category)

  ├──► TEXT_UNDERSTANDING (financial subtypes) ──► Condition ──► Extractors
  ├──► TEXT_UNDERSTANDING (shipping subtypes)  ──► Condition ──► Extractors
  └──► TEXT_UNDERSTANDING (legal subtypes)     ──► Condition ──► Extractors

Confidence fallback

Add a confidence threshold to the classification step. If the classifier returns a confidence below the threshold, route the document to a manual classification queue instead of risking an incorrect extraction.
Condition logic:
  confidence >= 0.8  → route to type-specific extractor
  confidence < 0.8   → route to manual classification
Start with a confidence threshold of 0.8 and adjust based on your observed accuracy. Track classification accuracy over time to identify document types that need prompt refinement.

Last modified on March 16, 2026