Web Page Extractor

Available starting with FlowX.AI 5.6.0

Download Attached Files is available starting with FlowX.AI 5.7.0.

Overview

The Web Page Extractor node is a workflow node that collects readable content from web page URLs. It supports static URL lists and dynamic URL generation, configurable crawling depth with link following, and adjustable scrape speed presets.

Web Page Extractor node configuration with URLs, Crawl Depth, and Scrape Speed settings

Static or dynamic URLs

Provide a fixed list of URLs or generate them dynamically from workflow data

Link following

Optionally follow links on pages up to a configurable depth

PDF processing

Extract content from PDF files linked on the page

Scrape speed control

Choose from speed presets or define custom rate limits and concurrency

File downloads

Download attached files (.docx, .xlsx, .pdf) found during crawling and store them automatically

Configuration

Open your workflow

Open your workflow in Integration Designer.

Add the node

Add a Web Page Extractor node from the Tools category in the left panel.

Configure URL source and extraction settings

Configure the settings described below.

URL source

URL Mode

enum

required

How URLs are provided to the node.

Mode	Description
Static	Provide a fixed list of up to 20 URLs
Dynamic	Generate URLs from a workflow data key using `${expression}` syntax

Default: Static

URLs

string[]

List of URLs to extract content from. Only available when URL Mode is Static.Maximum: 20 URLsURLs must use http:// or https:// protocol. Supports ${variable} placeholders for dynamic values.

Dynamic Link

string

A workflow data key or expression that resolves to a URL at runtime. Only available when URL Mode is Dynamic.Example: ${inputData.targetUrl}

Crawl depth

Follow Links

boolean

When turned on, the extractor follows links found on the page up to the configured depth.Default: OFF

Max Depth

number

How many levels of links to follow from the starting page. Only applies when Follow Links is turned on.Range: 0–10Default: 0

Process Linked PDFs

boolean

When turned on, extracts content from PDF files linked on the page.Default: OFF

Download attached files

Available starting with FlowX.AI 5.7.0

When turned on, the extractor downloads attached files (.docx, .xlsx, .pdf) found during crawling and stores them using the configured document destination.

Download Attached Files

boolean

When turned on, files discovered during crawling are downloaded and stored automatically. Supported file types include .docx, .xlsx, and .pdf.Default: OFF

Document Destination

enum

Where downloaded files are stored. Only available when Download Attached Files is turned on.

Option	Description
Document Plugin	Store files through the FlowX Document Plugin. Requires Folder Name and Document Type to be configured.
S3 Protocol	Store files directly using S3-compatible storage.

Folder Name

string

required

Identifier used to associate the file with its business owner. Only available when Download Attached Files is turned on and Document Destination is Document Plugin.

Document Type

string

required

Metadata describing the business value of the file. Only available when Download Attached Files is turned on and Document Destination is Document Plugin.

Scrape speed

Scrape Speed Preset

enum

required

Controls how aggressively the node requests pages from the target server.

Preset	Description
Slow	Conservative rate limiting — best for fragile or rate-limited servers
Moderate	Balanced speed and reliability
Fast	Aggressive crawling — assumes the target server can handle high traffic
Custom	Define your own rate limit and concurrency

Default: Moderate

Rate Limit

number

Maximum requests per second. Only available when Scrape Speed Preset is Custom.Default: 2

Concurrency

number

Number of concurrent requests. Only available when Scrape Speed Preset is Custom.Default: 3

Response key

responseKey

string

required

The key where extracted content is stored in the workflow data.Example: extractedContent

Timeout and retry

Timeout

number

Request timeout in milliseconds. If the extraction exceeds this duration, the node fails.

Retry Config

object

Optional retry strategy for failed requests.

Field	Description	Default
Retry Type	`Fixed` or `Exponential` backoff	—
Max Attempts	Maximum retry attempts	`2`
Backoff Period	Delay between retries (ms)	`1000`
Max Backoff Period	Maximum delay for exponential backoff (ms)	`120000`
Backoff Multiplier	Multiplier for exponential backoff	`2`

Best practices

Start with Moderate speed

Use the Moderate preset unless you know the target server’s capacity. Switch to Fast only for internal or robust servers.

Limit crawl depth

Keep Max Depth low (1–3) to avoid excessive page requests. Deep crawls can be slow and may trigger rate limiting.

Use dynamic URLs for runtime flexibility

When the target URL comes from user input or a previous workflow step, use Dynamic mode with ${expression} placeholders.

Set timeouts for external sites

Always configure a timeout when crawling external websites to avoid blocking the workflow on slow or unresponsive servers.

Extract Data from File

Extract text and data from documents and images

AI node types

Overview of all AI workflow node types

Integration Designer

Build and manage integration workflows

Config-time agents

Agent Builder

Using agents

Tutorials

AI Patterns

Web Page Extractor

Overview

Static or dynamic URLs

Link following

PDF processing

Scrape speed control

File downloads

Configuration

URL source

Crawl depth

Download attached files

Scrape speed

Response key

Timeout and retry

Best practices

Start with Moderate speed

Limit crawl depth

Use dynamic URLs for runtime flexibility

Set timeouts for external sites

Extract Data from File

AI node types

Integration Designer

Config-time agents

Agent Builder

Using agents

Tutorials

AI Patterns

Documentation Index

​Overview

Static or dynamic URLs

Link following

PDF processing

Scrape speed control

File downloads

​Configuration

​URL source

​Crawl depth

​Download attached files

​Scrape speed

​Response key

​Timeout and retry

​Best practices

Start with Moderate speed

Limit crawl depth

Use dynamic URLs for runtime flexibility

Set timeouts for external sites

​Related resources

Extract Data from File

AI node types

Integration Designer

Overview

Configuration

URL source

Crawl depth

Download attached files

Scrape speed

Response key

Timeout and retry

Best practices

Related resources