Web Crawler setup

Available starting with FlowX.AI 5.6.0The Web Crawler powers the web page extraction capabilities in Integration Designer workflows.

The Web Crawler is a Python-based microservice that crawls web pages using Playwright and extracts content as Markdown. It also provides AI-driven browser automation for multi-step web interactions.

Dependencies

The Web Crawler connects to standard FlowX infrastructure services that should already be configured in your environment:

Kafka — async job processing for crawl requests from Integration Designer
Identity provider (Keycloak or Azure AD) — authentication and Kafka OAuth
CMS Core — content management integration
Document Plugin — file upload for downloaded documents
S3-compatible storage (MinIO or AWS S3) — storing crawl results and offloaded Kafka payloads
AI provider (OpenAI, Anthropic, Ollama, Azure OpenAI, or GCP) — required only for browser automation features

Capabilities

What it can do

Single-page crawling: Extract content from individual web pages as Markdown
Deep crawling: Breadth-first crawling with configurable depth, concurrency, and rate limiting
PDF processing: Dedicated scraping strategy for linked PDF documents
File downloads: Automatic detection and upload of downloadable files (.docx, .xlsx, .pdf) to the Document Plugin
AI browser automation: Multi-step browser interactions using Claude or GPT-4o via the browser-use library
Session management: Persistent browser sessions for multi-step automation workflows
Per-level URL caps: Configurable maximum pages per depth level during deep crawls
Multiple URL filters: Configure up to 20 URL filter patterns per crawler from the URL Filters modal on the Web Crawler node to restrict deep crawls to matching URLs (5.8.0+)

What it cannot do

Cannot crawl pages that require JavaScript-based authentication (unless using browser automation)
Cannot process audio or video content
Cannot bypass CAPTCHAs or bot protection without proxy configuration

API endpoints

All endpoints are prefixed with the URL_PREFIX value (default: /web-crawler).

Endpoint	Method	Purpose
`/api/v1/web-crawler/parse`	POST	Crawl URLs and extract Markdown content
`/api/v1/web-crawler/interact`	POST	Start AI browser automation session
`/api/v1/web-crawler/interact/{session_id}`	POST	Continue automation in existing session
`/api/v1/web-crawler/interact/{session_id}`	DELETE	Close browser session
`/api/v1/web-crawler/info/health`	GET	Health check

Configuration

Server configuration

Environment Variable	Description	Default Value
`URL_PREFIX`	FastAPI root path for reverse proxy routing	`/web-crawler`
`VERSION`	Service version identifier	`DEV`
`VERBOSE`	Enable verbose/debug logging (`1` to enable)	`0`
`GUNICORN_WORKERS`	Number of Gunicorn worker processes	`4`
`GUNICORN_TIMEOUT`	Worker timeout in seconds	`600`

Set GUNICORN_TIMEOUT to at least 600 (10 minutes). Deep crawls with multiple pages can take significant time depending on target site responsiveness and configured depth.

Authentication configuration

Keycloak / OAuth2

Environment Variable	Description	Default Value
`SECURITY_OAUTH2_BASE_SERVER_URL`	Keycloak server URL	(required)
`SECURITY_OAUTH2_REALM`	Default Keycloak realm	`flowx`
`SECURITY_OAUTH2_CLIENT_ID`	Platform auth client ID	`flowx-platform-authenticate`
`SECURITY_OAUTH2_CLIENT_SECRET`	Platform auth client secret	`flowx-platform-authenticate-secret`
`SECURITY_PROVIDER`	Auth provider (`KEYCLOAK` or `AZURE`)	`KEYCLOAK`

Service account

The Web Crawler uses a service account for platform API calls (CMS, Document Plugin).

Environment Variable	Description	Default Value
`SERVICE_OAUTH_CLIENT_ID`	Service account client ID	`flowx-scheduler-core-sa`
`SERVICE_OAUTH_CLIENT_SECRET`	Service account client secret	-
`SERVICE_OAUTH_CLIENT_REALM`	Service account Keycloak realm	-

Kafka configuration

The Web Crawler uses Kafka for async job processing. Integration Designer sends crawl requests, and the Web Crawler returns results through Kafka topics.

Core Kafka settings

Environment Variable	Description	Default Value
`KAFKA_BOOTSTRAP_SERVERS`	Kafka broker address	`kafka:9092`
`KAFKA_CONSUMER_ENABLED`	Enable Kafka consumer (`0` to turn off)	`1`
`KAFKA_PRODUCER_ENABLED`	Enable Kafka producer (`0` to turn off)	`1`
`KAFKA_CONSUMER_GROUPID`	Consumer group ID	`ai-services`

Consumer tuning

Environment Variable	Description	Default Value
`KAFKA_MAX_POLL_INTERVAL_MS`	Max poll interval (ms)	`300000`
`KAFKA_SESSION_TIMEOUT_MS`	Session timeout (ms)	`300000`
`KAFKA_HEARTBEAT_INTERVAL_MS`	Heartbeat interval (ms)	`10000`

Increase KAFKA_MAX_POLL_INTERVAL_MS if deep crawls regularly exceed 5 minutes. The consumer will be removed from the group if processing takes longer than this interval.

OAuth authentication

When using KAFKA_SECURITY_ENABLED=1, the Web Crawler authenticates to Kafka using SASL/OAUTHBEARER via Keycloak.

Environment Variable	Description	Default Value
`KAFKA_SECURITY_ENABLED`	Enable SASL/OAUTHBEARER auth (`0` to turn off)	`1`
`KAFKA_OAUTH_CLIENT_ID`	Service account for Kafka SASL	`flowx-service-client`
`KAFKA_OAUTH_CLIENT_SECRET`	Service account secret	-
`KAFKA_OAUTH_TOKEN_ENDPOINT_URI`	Keycloak token endpoint for Kafka OAuth	`{SECURITY_OAUTH2_BASE_SERVER_URL}/realms/kafka-authz/protocol/openid-connect/token`

Topic configuration

Environment Variable	Description	Default Value
`KAFKA_JOB_REQUEST_TOPIC`	Incoming crawl job requests (from Integration Designer)	`ai.flowx.web-crawler.job.request`
`KAFKA_JOB_RESPONSE_TOPIC`	Outgoing crawl job responses (to Integration Designer)	`ai.flowx.web-crawler.job.response`

The Web Crawler integrates with flx-job-lib for async job processing. On startup, the service verifies Kafka connectivity and starts a background consumer.

Kafka payload offloading

When crawl results exceed the size limit, payloads are offloaded to S3-compatible storage.

Environment Variable	Description	Default Value
`KAFKA_OFFLOAD_BACKEND`	Storage backend (`minio` or `azure`)	`minio`
`KAFKA_OFFLOAD_SIZE_LIMIT`	Payload size threshold (bytes) before offloading	`512000`
`KAFKA_OFFLOAD_ROOT_FOLDER`	Root folder for offloaded payloads	`payload-offload`

Storage configuration

The Web Crawler stores crawl results in S3-compatible storage. Configure one of the following storage backends.

MinIO / S3
AWS S3
Azure Blob Storage

Environment Variable	Description	Default Value
`MINIO_DOCUMENTS_URL`	MinIO/S3 endpoint URL	-
`MINIO_DOCUMENTS_ACCESS_KEY`	MinIO access key	-
`MINIO_DOCUMENTS_SECRET_KEY`	MinIO secret key	-
`MINIO_DOCUMENTS_BUCKET`	MinIO bucket name	-
`MINIO_DOCUMENTS_SECURE`	Use HTTPS for MinIO (`1` to enable)	`0`

Environment Variable	Description	Default Value
`APPLICATION_FILE_STORAGE_S3_PRIVATE_ACCESS_KEY`	AWS S3 access key	-
`APPLICATION_FILE_STORAGE_S3_PRIVATE_SECRET_KEY`	AWS S3 secret key	-
`APPLICATION_FILE_STORAGE_S3_PRIVATE_SERVER_URL`	AWS S3 endpoint URL	-
`APPLICATION_FILE_STORAGE_S3_PRIVATE_REGION`	AWS S3 region	-
`APPLICATION_FILE_STORAGE_S3_PRIVATE_BUCKET_NAME`	AWS S3 bucket name	-

Environment Variable	Description	Default Value
`AZURE_STORAGE_CONNECTION_STRING`	Azure blob storage connection string	-
`AZURE_STORAGE_CONTAINER`	Azure blob container name	-

Platform service URLs

Environment Variable	Description	Default Value
`CMS_API_URL`	CMS Core service URL	`http://cms:80`
`DOCUMENT_API_URL`	Document Plugin URL	`http://document-plugin:80`
`DOCUMENT_PLUGIN_API_URL`	Document Plugin URL (for file uploads)	`http://document-plugin:80`
`APP_MANAGER_API_URL`	App Manager service URL	`http://application-manager:80`
`CMS_STATIC_ASSETS_PATH`	Static assets CDN path for CMS downloads	-

AI model configuration

Required only for the browser automation feature (/interact endpoints). The Web Crawler uses LLM models to drive browser interactions.

Model type	Variable	Purpose	Default
Chat	`MODEL_TYPE`	Primary LLM for browser automation	`OPENAI`
Vision	`VISION_MODEL_TYPE`	Visual page understanding	`OPENAI`
Embedding	`EMBEDDING_MODEL_TYPE`	Text embeddings	`OPENAI`

Environment Variable	Description	Default Value
`FALLBACK_MODEL_TYPE`	Fallback LLM provider if primary fails	-
`STRUCTURED_OUTPUT_METHOD`	Structured output method for LLM	`json_schema`

All model types support the same providers: OPENAI, AZURE, OLLAMA, ANTHROPIC, GCP, CUSTOM_OPENAI.

Provider-specific variables

OpenAI
Anthropic
Ollama
Other providers

Environment Variable	Description	Default Value
`OPENAI_API_KEY`	OpenAI API key	-
`OPENAI_MODEL_NAME`	Chat model name	`gpt-4o-2024-08-06`
`OPENAI_VISION_MODEL_NAME`	Vision model name	`gpt-4o-2024-08-06`
`OPENAI_EMBEDDING_MODEL_NAME`	Embedding model name	`text-embedding-3-large`
`OPENAI_CHUNK_SIZE`	Embedding chunk size	`1000`

Environment Variable	Description	Default Value
`ANTHROPIC_API_KEY`	Anthropic API key	-
`ANTHROPIC_MODEL_NAME`	Chat model name	-
`ANTHROPIC_VISION_MODEL_NAME`	Vision model name	-

Environment Variable	Description	Default Value
`OLLAMA_BASE_URL`	Ollama server URL	-
`OLLAMA_MODEL_NAME`	Chat model name	`llama3.2:latest`
`OLLAMA_VISION_MODEL_NAME`	Vision model name	`llava:7b-v1.6`
`OLLAMA_EMBEDDING_MODEL_NAME`	Embedding model name	`all-minilm:33m`

Azure OpenAI:

Environment Variable	Description	Default Value
`AZUREOPENAI_MODEL_NAME`	Chat model name	`gpt-4o-2024-08-06`

Google Cloud / Vertex AI:

Environment Variable	Description	Default Value
`GCP_PROJECT_ID`	Google Cloud project ID	-
`GCP_LOCATION`	Google Cloud region	-
`GCP_MODEL_NAME`	Chat model name	`gemini-1.5-pro`

Custom OpenAI-compatible:

Environment Variable	Description	Default Value
`CUSTOM_OPENAI_BASE_URL`	API base URL	`https://api.openai.com`
`CUSTOM_OPENAI_CLIENT_ID`	Client ID	-
`CUSTOM_OPENAI_SECRET_KEY`	Secret key	-
`CUSTOM_OPENAI_CHAT_URL`	Chat endpoint path	`/chat/completions`
`CUSTOM_OPENAI_EMBEDDING_URL`	Embedding endpoint path	`/embeddings`

PostgreSQL configuration

The Web Crawler uses PostgreSQL for embedding storage, memory, and chat history.

Embedding database
Memory store
Chat history

Environment Variable	Description	Default Value
`POSTGRES_EMBEDDING_URI`	Host and port	`postgresql-ai:5432`
`POSTGRES_EMBEDDING_USERNAME`	Username	`dip`
`POSTGRES_EMBEDDING_PASSWORD`	Password	-
`POSTGRES_EMBEDDING_DATABASE`	Database name	`dip`

Environment Variable	Description	Default Value
`POSTGRES_MEMORY_URI`	Host and port	`postgresql-ai:5432`
`POSTGRES_MEMORY_USERNAME`	Username	`memory`
`POSTGRES_MEMORY_PASSWORD`	Password	-
`PG_MEMORY_DB`	Database name	`memory`

Environment Variable	Description	Default Value
`POSTGRES_HISTORY_URI`	Host and port	`postgresql-ai:5432`
`POSTGRES_HISTORY_USERNAME`	Username	`chat_history`
`POSTGRES_HISTORY_PASSWORD`	Password	-
`POSTGRES_HISTORY_DATABASE`	Database name	`chat_history`

Observability (optional)

FlowX Observatory

Environment Variable	Description	Default Value
`USE_OBSERVATORY`	Enable Observatory middleware (`1` to enable)	`0`
`FLOWX_OBSERVATORY_API_URL`	Observatory API URL	`http://observatory-api:80`
`FLOWX_OBSERVATORY_APP_ID`	Application ID for Observatory	-
`FLOWX_OBSERVATORY_VERBOSE`	Verbose Observatory logging	`0`

Langfuse

Environment Variable	Description	Default Value
`LANGFUSE_PUBLIC_KEY`	Langfuse public key	-
`LANGFUSE_SECRET_KEY`	Langfuse secret key	-
`LANGFUSE_HOST`	Langfuse server URL	-

LLM Guard

Environment Variable	Description	Default Value
`FLOWX_LLMGUARD_URL`	LLMGuard service URL	`http://llmguard:8000`

CORS configuration

Environment Variable	Description	Default Value
`APPLICATION_CORS_ALLOW_ORIGIN`	Allowed origins (comma-separated)	-
`APPLICATION_CORS_ALLOW_HEADERS`	Allowed headers	`Accept,Content-Type,Authorization,...`
`APPLICATION_CORS_ALLOW_METHODS`	Allowed methods	`GET,PUT,POST,DELETE,PATCH,OPTIONS`
`APPLICATION_CORS_ALLOW_CREDENTIALS`	Allow credentials	`true`

MCP server configuration

Environment Variable	Description	Default Value
`MCP_AUTH_KEY`	Authentication key for MCP server endpoints	-

Kafka job processing

The Web Crawler processes crawl requests asynchronously via Kafka. Integration Designer sends requests to the job request topic, and the Web Crawler writes results to S3 and sends a response to the job response topic.

Request payload

Field	Type	Default	Description
`urls`	`string[]`	(required)	URLs to crawl
`depth`	`int`	`1`	Crawl depth (1 = single page)
`maxPages`	`int`	`50`	Max total pages to crawl
`maxPagesPerLevel`	`int`	`10`	Max pages per depth level
`followLinks`	`bool`	`true`	Follow links on crawled pages
`processLinkedPdfs`	`bool`	`true`	Process linked PDF documents
`acceptDownloads`	`bool`	`false`	Download files and upload to Document Plugin
`pageTimeout`	`int`	`30000`	Page navigation timeout (ms)
`proxyUrl`	`string`	`null`	Proxy URL for requests
`applicationId`	`string`	`null`	Application ID (required for file uploads)

Response flow

Integration Designer publishes a crawl request to ai.flowx.web-crawler.job.request
Web Crawler processes the request and stores the result (Markdown content) in S3
Web Crawler publishes a response with the result path to ai.flowx.web-crawler.job.response
Integration Designer retrieves the result from S3

Deployment and sizing

Docker

Base image: Python 3.13-slim-bookworm
Runs as non-root user (aiservices)
Port: 8080
Health check: /web-crawler/api/v1/web-crawler/info/health

Kubernetes configuration

Setting	Value
Replicas	2
CPU requests	1 core
CPU limits	2 cores
RAM requests	1 Gi
RAM limits	2 Gi
`GUNICORN_WORKERS`	`4`

The Web Crawler runs headless Chromium browsers for page rendering. Each concurrent crawl consumes significant memory. Monitor memory usage and increase limits if pods are OOM-killed during deep crawls.

Verify your setup

The Web Crawler pod is running: kubectl get pods -l app=web-crawler

The health endpoint returns HTTP 200: curl http://web-crawler:8080/web-crawler/api/v1/web-crawler/info/health

Kafka consumer is connected — check pod logs for Kafka consumer started message at startup

Integration Designer can reach the Web Crawler — verify FLOWX_WEBCRAWLER_BASEURL is set in Integration Designer setup

Troubleshooting

Pod fails to start or crashes

Symptoms: Pod stays in CrashLoopBackOff or never becomes ready.Solutions:

Verify Chromium/Playwright dependencies are available (included in Docker image)
Check that Kafka broker addresses are reachable
Ensure the Keycloak server URL is correct and accessible
Review pod logs for specific startup errors

Crawl jobs not being processed

Symptoms: Integration Designer sends crawl requests but no results are returned.Solutions:

Verify KAFKA_CONSUMER_ENABLED is set to 1
Check that KAFKA_JOB_REQUEST_TOPIC matches the topic Integration Designer publishes to
Ensure the Kafka consumer group (KAFKA_CONSUMER_GROUPID) has no conflicting consumers
Review pod logs for Kafka connection or authentication errors

Crawl results are empty or incomplete

Symptoms: Pages return no content or partial Markdown.Solutions:

Increase pageTimeout in the crawl request for slow-loading pages
Check if the target site blocks headless browsers (try configuring a proxyUrl)
Verify network connectivity from the pod to the target URLs
Review verbose logs (VERBOSE=1) for page-level errors

File downloads not uploading

Symptoms: acceptDownloads is enabled but files are not appearing in the Document Plugin.Solutions:

Verify DOCUMENT_PLUGIN_API_URL is set and reachable
Ensure applicationId is provided in the crawl request payload
Check service account credentials (SERVICE_OAUTH_CLIENT_ID / SERVICE_OAUTH_CLIENT_SECRET)
Review pod logs for upload errors or authentication failures

OOM kills during deep crawls

Symptoms: Pods restart with OOMKilled status during multi-page crawls.Solutions:

Increase memory limits (try 4 Gi)
Reduce maxPages and maxPagesPerLevel in crawl requests
Reduce GUNICORN_WORKERS to limit concurrent browser instances
Scale horizontally with more pods instead of larger pods

Integration Designer setup

Configure Integration Designer, which orchestrates Web Crawler jobs

Integration Designer

Learn about building integration workflows

Kafka Authentication

Configure Kafka security and authentication

Document Plugin setup

Configure the Document Plugin for file storage

Documentation Index

​Dependencies

​Capabilities

​What it can do

​What it cannot do

​API endpoints

​Configuration

​Server configuration

​Authentication configuration

​Keycloak / OAuth2

​Service account

​Kafka configuration

​Core Kafka settings

​Consumer tuning

​OAuth authentication

​Topic configuration

​Kafka payload offloading

​Storage configuration

​Platform service URLs

​AI model configuration

​Provider-specific variables

​PostgreSQL configuration

​Observability (optional)

​FlowX Observatory

​Langfuse

​LLM Guard

​CORS configuration

​MCP server configuration

​Kafka job processing

​Request payload

​Response flow

​Deployment and sizing

​Docker

​Kubernetes configuration

​Verify your setup

​Troubleshooting

​Related resources

Integration Designer setup

Integration Designer

Kafka Authentication

Document Plugin setup

Dependencies

Capabilities

What it can do

What it cannot do

API endpoints

Configuration

Server configuration

Authentication configuration

Keycloak / OAuth2

Service account

Kafka configuration

Core Kafka settings

Consumer tuning

OAuth authentication

Topic configuration

Kafka payload offloading

Storage configuration

Platform service URLs

AI model configuration

Provider-specific variables

PostgreSQL configuration

Observability (optional)

FlowX Observatory

Langfuse

LLM Guard

CORS configuration

MCP server configuration

Kafka job processing

Request payload

Response flow

Deployment and sizing

Docker

Kubernetes configuration

Verify your setup

Troubleshooting

Related resources