Extract Data from File

Overview

The Extract Data from File node extracts text and structured data from documents and images within Agent Builder workflows. It supports multiple extraction strategies so you can balance accuracy, speed, and cost based on your document types.

Supported file formats

Category	Formats
Documents	PDF, DOCX, XLSX, PPTX, HTML, CSV, Markdown, AsciiDoc
Images	JPG, JPEG, PNG, TIFF, BMP, WEBP

.xls files are automatically converted to .xlsx and .txt files to Markdown before parsing. Images are processed directly through OCR by the Document Parser service.

Configuration

To add the node to an Agent Builder workflow:

Open your workflow

Open your workflow in Agent Builder.

Add the node

Add an Extract Data from File node from the Document Operations category.

Configure extraction settings

Configure the extraction settings described below.

Extract Data from File node configuration

Document source

Document Source

select

required

The source system for the document. Select Document Plugin to use files stored in the FlowX Documents Plugin.Default: Document Plugin

Use test file

Use Test File

boolean

Toggle ON to use a test file during workflow configuration and testing, without requiring a live file path from process data.Default: OFF

File path

File Path

string

required

The path to the input file to process. This can reference a file stored in the Documents Plugin.

When Use Test File is turned off, map this field to a process variable or workflow data key that contains the file path at runtime.

Response key

responseKey

string

required

The key where the extraction results are stored in the workflow data.Example: extractedData

Extraction method

Extraction Method

select

required

Select the method used to extract content from the file. Each method has different accuracy, speed, and cost characteristics.

Method	Best for	Speed	Cost	Accuracy
Automatic	Mixed document sets where you don’t know the format up front	Varies	Varies	Varies
LLM Model	Complex layouts, handwritten text, mixed content	Slow	High	High
OCR Engine	Scanned documents, image-heavy files	Medium	Medium	Medium
Text Parsing	Clean digital PDFs with selectable text	Fast	Free	Low–Medium

Automatic
LLM Model
OCR Engine
Text Parsing

The platform selects the best extraction method for each document automatically — “AI will choose the best method for each document.” Use it for mixed document sets where the format varies and you don’t want to pick a strategy per file.

Image extraction options

When using LLM Model or OCR Engine, you can configure how images found within the document are handled.

Image Extraction

select

Select how images embedded in the document should be processed.

Option	Description	When to use
Image Description	Generates a text description of images	When you need to understand what images depict (charts, photos, diagrams)
Image Contents	Extracts text and data from images	When images contain text, tables, or data you need to capture

LLM Model supports both Image Description and Image Contents. OCR Engine supports only Image Contents.

Image extraction options are not available when using the Text Parsing strategy, since Text Parsing only handles selectable text content.

Signature detection

Detect Signatures

boolean

Turn on detection of signatures within the document.Default: OFFWhen enabled, the node identifies areas of the document that contain signatures and includes their locations in the extraction results.

Signature detection is only available when using LLM Model or OCR Engine strategies. It is not available for Text Parsing.

Personal Information Guard

For the full reference — sensitivity presets, the complete entity catalog, scan flow, and run-log fields — see Personal Information Guard.

Personal Information Guard

boolean

Detects and replaces personal data in messages before they reach the model. A system instruction is automatically added so the agent handles redacted content naturally.Default: OFFWhen turned on, the following sub-options become available:

Detection Algorithm Sensitivity — One of Strict, Balanced (default), Relaxed, or Custom. Controls how aggressively the detector flags potential matches.
Detection Target — Check Node Input, Node Output, or both to choose which payloads are scanned.
Personal Info Types — Opens the Customize Entities modal, the picker for which of the 24 supported entity types should be detected. All 24 are enabled by default. Entities are grouped into Universal and Regional (per-locale) sets.

Supported entity types (24)

Universal (8)EMAIL, PHONE, CREDIT_CARD, IBAN, MAC_ADDRESS, CRYPTO_WALLET, PERSON, ADDRESSRegional — EN (6)SSN, US_PASSPORT, US_BANK_ACCOUNT, US_ITIN, UK_NHS, EU_VAT_IDRegional — RO (10)CNP, CUI, RO_IBAN, RO_PHONE, RO_PASSPORT, RO_ID_CARD, LICENSE_PLATE, HEALTH_CARD, POSTAL_CODE, LANDLINE

When a node runs with Personal Information Guard on, the scan lists each detected entity individually — its type, confidence score, original value, the replacement applied, and whether it was matched on the node input or output. Document and image scans report this per-entity list too, and each detection also includes the region (x, y, width, height) where the entity was found, so the run console can highlight its location.

Examples

Processing a scanned invoice

Scenario: Extract line items and totals from a scanned paper invoice.Configuration:

Extraction Method: OCR Engine
Image Extraction: Image Contents
Detect Signatures: ON (to capture the approval signature)

The OCR engine processes the scanned image, extracts text from both the document body and any embedded images (such as company logos with text), and identifies signature areas.

Analyzing a contract with charts

Scenario: Extract text and understand visual elements from a contract that includes charts and diagrams.Configuration:

Extraction Method: LLM Model
Image Extraction: Image Description
Detect Signatures: OFF

The LLM analyzes each page, extracts the contract text, and generates descriptions of charts and diagrams (for example, “Bar chart showing quarterly revenue growth from Q1 to Q4 2025”).

Extracting text from a clean PDF

Scenario: Extract text from a digitally generated report PDF.Configuration:

Extraction Method: Text Parsing
Image Extraction: N/A (not available for Text Parsing)
Detect Signatures: N/A (not available for Text Parsing)

Text Parsing directly extracts the selectable text from the PDF with no AI or OCR processing, making it the fastest and lowest-cost option.

Best practices

Start with Text Parsing

For digital PDFs, try Text Parsing first. Only use OCR or LLM if the results are insufficient.

Match strategy to document type

Use OCR for scanned documents, LLM for complex layouts, and Text Parsing for clean digital files.

Consider cost at scale

LLM processing costs increase linearly with page count. For high-volume workloads, use Text Parsing or OCR where possible.

Turn off unused features

Turn off signature detection and image extraction when not needed to reduce processing time and cost.

Document Parser setup

Configure the Document Parser service, parsing engines, and deployment sizing

AI node types

Overview of all AI node types available in Agent Builder

Agent Builder overview

Get started with Agent Builder workflows

Use cases

See real-world Agent Builder workflow examples

Config-time agents

Agent Builder

Using agents

Extract Data from File

Overview

Supported file formats

Configuration

Document source

Use test file

File path

Response key

Extraction method

Image extraction options

Signature detection

Personal Information Guard

Examples

Best practices

Start with Text Parsing

Match strategy to document type

Consider cost at scale

Turn off unused features

Document Parser setup

AI node types

Agent Builder overview

Use cases

​Overview

​Supported file formats

​Configuration

​Document source

​Use test file

​File path

​Response key

​Extraction method

​Image extraction options

​Signature detection

​Personal Information Guard

​Examples

​Best practices

Start with Text Parsing

Match strategy to document type

Consider cost at scale

Turn off unused features

​Related resources

Document Parser setup

AI node types

Agent Builder overview

Use cases

Overview

Supported file formats

Configuration

Document source

Use test file

File path

Response key

Extraction method

Image extraction options

Signature detection

Personal Information Guard

Examples

Best practices

Related resources