> ## Documentation Index
> Fetch the complete documentation index at: https://docs.flowx.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Web Page Extractor

> Collect readable content from web page URLs with configurable crawling depth, link following, and scrape speed.

## Overview

The **Web Page Extractor** node is a workflow node that collects readable content from web page URLs. It supports static URL lists and dynamic URL generation, configurable crawling depth with link following, and adjustable scrape speed presets.

<Frame>
  ![Web Page Extractor node configuration with URLs, Crawl Depth, and Scrape Speed settings](https://s3.eu-west-1.amazonaws.com/docx.flowx.ai/5.6/web_page_extractor_node.png)
</Frame>

<CardGroup cols={2}>
  <Card title="Static or dynamic URLs" icon="link">
    Provide a fixed list of URLs or generate them dynamically from workflow data
  </Card>

  <Card title="Link following" icon="diagram-project">
    Optionally follow links on pages up to a configurable depth
  </Card>

  <Card title="PDF processing" icon="file-pdf">
    Extract content from PDF files linked on the page
  </Card>

  <Card title="Scrape speed control" icon="gauge-high">
    Choose from speed presets or define custom rate limits and concurrency
  </Card>

  <Card title="File downloads" icon="download">
    Download attached files (.docx, .xlsx, .pdf) found during crawling and store them automatically
  </Card>
</CardGroup>

***

## Configuration

<Steps>
  <Step title="Open your workflow">
    Open your workflow in **Integration Designer**.
  </Step>

  <Step title="Add the node">
    Add a **Web Page Extractor** node from the **Tools** category in the left panel.
  </Step>

  <Step title="Configure URL source and extraction settings">
    Configure the settings described below.
  </Step>
</Steps>

***

### URL source

<ParamField path="URL Mode" type="enum" required>
  How URLs are provided to the node.

  | Mode        | Description                                                         |
  | ----------- | ------------------------------------------------------------------- |
  | **Static**  | Provide a fixed list of up to 20 URLs                               |
  | **Dynamic** | Generate URLs from a workflow data key using `${expression}` syntax |

  **Default:** `Static`
</ParamField>

<ParamField path="URLs" type="string[]">
  List of URLs to extract content from. Only available when **URL Mode** is `Static`.

  **Maximum:** 20 URLs

  URLs must use `http://` or `https://` protocol. Supports `${variable}` placeholders for dynamic values.
</ParamField>

<ParamField path="Dynamic Link" type="string">
  A workflow data key or expression that resolves to a URL at runtime. Only available when **URL Mode** is `Dynamic`.

  **Example:** `${inputData.targetUrl}`
</ParamField>

***

### Crawl depth

<ParamField path="Follow Links" type="boolean">
  When turned on, the extractor follows links found on the page up to the configured depth.

  **Default:** OFF
</ParamField>

<ParamField path="Depth of Crawling" type="number">
  How many levels of links to follow from the starting page. Only available when **Follow Links** is turned on. Displayed in the panel as a slider with the current value shown to the right (for example, *2 levels*).

  **Range:** 0–10

  **Default:** `0`
</ParamField>

<ParamField path="Crawl URLs Containing" type="string[]">
  Optional list of substring filters applied to discovered links — only URLs that contain at least one of the configured fragments are followed. Click **Set Filters** to open the filter editor. Only available when **Follow Links** is turned on.
</ParamField>

<ParamField path="Process Linked PDFs" type="boolean">
  When turned on, extracts content from PDF files linked on the page. Only available when **Follow Links** is turned on.

  **Default:** OFF
</ParamField>

***

### Download attached files

When turned on, the extractor downloads attached files (.docx, .xlsx, .pdf) found during crawling and stores them using the configured document destination.

<ParamField path="Download Attached Files" type="boolean">
  When turned on, files discovered during crawling are downloaded and stored automatically. Supported file types include `.docx`, `.xlsx`, and `.pdf`.

  **Default:** OFF
</ParamField>

<ParamField path="Document Destination" type="enum">
  Where downloaded files are stored. Only available when **Download Attached Files** is turned on.

  | Option              | Description                                                              |
  | ------------------- | ------------------------------------------------------------------------ |
  | **Document Plugin** | Store files through the FlowX Document Plugin. Requires **Folder Name**. |
  | **S3 Protocol**     | Store files directly using S3-compatible storage.                        |
</ParamField>

<ParamField path="Folder Name" type="string" required>
  Identifier used to associate the file with its business owner. Only available when **Download Attached Files** is turned on and **Document Destination** is `Document Plugin`.
</ParamField>

***

### Scrape speed

<ParamField path="Scrape Speed" type="enum" required>
  Controls how aggressively the node requests pages from the target server.

  | Preset       | Description                                                             |
  | ------------ | ----------------------------------------------------------------------- |
  | **Slow**     | Conservative rate limiting — best for fragile or rate-limited servers   |
  | **Moderate** | Balanced speed and reliability                                          |
  | **Fast**     | Aggressive crawling — assumes the target server can handle high traffic |
  | **Custom**   | Define your own rate limit and concurrency                              |

  **Default:** `Moderate`
</ParamField>

<ParamField path="Rate Limit" type="number">
  Maximum requests per second. Displayed in the panel as a slider labelled in `req/s`. Only available when **Scrape Speed** is `Custom`.

  **Default:** `2 req/s`
</ParamField>

<ParamField path="Concurrency" type="number">
  Number of concurrent requests. Displayed in the panel as a slider labelled in `parallel`. Only available when **Scrape Speed** is `Custom`.

  **Default:** `3 parallel`
</ParamField>

***

### Response key

<ParamField path="responseKey" type="string" required>
  The key where extracted content is stored in the workflow data.

  **Example:** `extractedContent`
</ParamField>

***

### Timeout and retry

<ParamField path="Timeout" type="number">
  Request timeout in milliseconds. If the extraction exceeds this duration, the node fails.
</ParamField>

<ParamField path="Retry Config" type="object">
  Optional retry strategy for failed requests.

  | Field                  | Description                                | Default  |
  | ---------------------- | ------------------------------------------ | -------- |
  | **Retry Type**         | `Fixed` or `Exponential` backoff           | —        |
  | **Max Attempts**       | Maximum retry attempts                     | `2`      |
  | **Backoff Period**     | Delay between retries (ms)                 | `1000`   |
  | **Max Backoff Period** | Maximum delay for exponential backoff (ms) | `120000` |
  | **Backoff Multiplier** | Multiplier for exponential backoff         | `2`      |
</ParamField>

***

## Best practices

<CardGroup cols={2}>
  <Card title="Start with Moderate speed" icon="gauge">
    Use the Moderate preset unless you know the target server's capacity. Switch to Fast only for internal or robust servers.
  </Card>

  <Card title="Limit crawl depth" icon="layer-group">
    Keep **Depth of Crawling** low (1–3) to avoid excessive page requests. Deep crawls can be slow and may trigger rate limiting.
  </Card>

  <Card title="Use dynamic URLs for runtime flexibility" icon="code">
    When the target URL comes from user input or a previous workflow step, use Dynamic mode with `${expression}` placeholders.
  </Card>

  <Card title="Set timeouts for external sites" icon="clock">
    Always configure a timeout when crawling external websites to avoid blocking the workflow on slow or unresponsive servers.
  </Card>
</CardGroup>

***

## Related resources

<CardGroup cols={2}>
  <Card title="Extract Data from File" icon="file-lines" href="./extract-data-from-file">
    Extract text and data from documents and images
  </Card>

  <Card title="AI node types" icon="diagram-project" href="./node-types">
    Overview of all AI workflow node types
  </Card>

  <Card title="Integration Designer" icon="sitemap" href="/5.9/docs/platform-deep-dive/integrations/integration-designer">
    Build and manage integration workflows
  </Card>
</CardGroup>
