Skip to main content
Available starting with FlowX.AI 5.6.0The Web Crawler powers the web page extraction capabilities in Integration Designer workflows.
The Web Crawler is a Python-based microservice that crawls web pages using Playwright and extracts content as Markdown. It also provides AI-driven browser automation for multi-step web interactions.

Dependencies

The Web Crawler connects to standard FlowX infrastructure services that should already be configured in your environment:
  • Kafka — async job processing for crawl requests from Integration Designer
  • Identity provider (Keycloak or Azure AD) — authentication and Kafka OAuth
  • CMS Core — content management integration
  • Document Plugin — file upload for downloaded documents
  • S3-compatible storage (MinIO or AWS S3) — storing crawl results and offloaded Kafka payloads
  • AI provider (OpenAI, Anthropic, Ollama, Azure OpenAI, or GCP) — required only for browser automation features

Capabilities

What it can do

  • Single-page crawling: Extract content from individual web pages as Markdown
  • Deep crawling: Breadth-first crawling with configurable depth, concurrency, and rate limiting
  • PDF processing: Dedicated scraping strategy for linked PDF documents
  • File downloads: Automatic detection and upload of downloadable files (.docx, .xlsx, .pdf) to the Document Plugin
  • AI browser automation: Multi-step browser interactions using Claude or GPT-4o via the browser-use library
  • Session management: Persistent browser sessions for multi-step automation workflows
  • Per-level URL caps: Configurable maximum pages per depth level during deep crawls

What it cannot do

  • Cannot crawl pages that require JavaScript-based authentication (unless using browser automation)
  • Cannot process audio or video content
  • Cannot bypass CAPTCHAs or bot protection without proxy configuration

API endpoints

All endpoints are prefixed with the URL_PREFIX value (default: /web-crawler).
EndpointMethodPurpose
/api/v1/web-crawler/parsePOSTCrawl URLs and extract Markdown content
/api/v1/web-crawler/interactPOSTStart AI browser automation session
/api/v1/web-crawler/interact/{session_id}POSTContinue automation in existing session
/api/v1/web-crawler/interact/{session_id}DELETEClose browser session
/api/v1/web-crawler/info/healthGETHealth check

Configuration

Server configuration

Environment VariableDescriptionDefault Value
URL_PREFIXFastAPI root path for reverse proxy routing/web-crawler
VERSIONService version identifierDEV
VERBOSEEnable verbose/debug logging (1 to enable)0
GUNICORN_WORKERSNumber of Gunicorn worker processes4
GUNICORN_TIMEOUTWorker timeout in seconds600
Set GUNICORN_TIMEOUT to at least 600 (10 minutes). Deep crawls with multiple pages can take significant time depending on target site responsiveness and configured depth.

Authentication configuration

Keycloak / OAuth2

Environment VariableDescriptionDefault Value
SECURITY_OAUTH2_BASE_SERVER_URLKeycloak server URL(required)
SECURITY_OAUTH2_REALMDefault Keycloak realmflowx
SECURITY_OAUTH2_CLIENT_IDPlatform auth client IDflowx-platform-authenticate
SECURITY_OAUTH2_CLIENT_SECRETPlatform auth client secretflowx-platform-authenticate-secret
SECURITY_PROVIDERAuth provider (KEYCLOAK or AZURE)KEYCLOAK

Service account

The Web Crawler uses a service account for platform API calls (CMS, Document Plugin).
Environment VariableDescriptionDefault Value
SERVICE_OAUTH_CLIENT_IDService account client IDflowx-scheduler-core-sa
SERVICE_OAUTH_CLIENT_SECRETService account client secret-
SERVICE_OAUTH_CLIENT_REALMService account Keycloak realm-

Kafka configuration

The Web Crawler uses Kafka for async job processing. Integration Designer sends crawl requests, and the Web Crawler returns results through Kafka topics.

Core Kafka settings

Environment VariableDescriptionDefault Value
KAFKA_BOOTSTRAP_SERVERSKafka broker addresskafka:9092
KAFKA_CONSUMER_ENABLEDEnable Kafka consumer (0 to turn off)1
KAFKA_PRODUCER_ENABLEDEnable Kafka producer (0 to turn off)1
KAFKA_CONSUMER_GROUPIDConsumer group IDai-services

Consumer tuning

Environment VariableDescriptionDefault Value
KAFKA_MAX_POLL_INTERVAL_MSMax poll interval (ms)300000
KAFKA_SESSION_TIMEOUT_MSSession timeout (ms)300000
KAFKA_HEARTBEAT_INTERVAL_MSHeartbeat interval (ms)10000
Increase KAFKA_MAX_POLL_INTERVAL_MS if deep crawls regularly exceed 5 minutes. The consumer will be removed from the group if processing takes longer than this interval.

OAuth authentication

When using KAFKA_SECURITY_ENABLED=1, the Web Crawler authenticates to Kafka using SASL/OAUTHBEARER via Keycloak.
Environment VariableDescriptionDefault Value
KAFKA_SECURITY_ENABLEDEnable SASL/OAUTHBEARER auth (0 to turn off)1
KAFKA_OAUTH_CLIENT_IDService account for Kafka SASLflowx-service-client
KAFKA_OAUTH_CLIENT_SECRETService account secret-
KAFKA_OAUTH_TOKEN_ENDPOINT_URIKeycloak token endpoint for Kafka OAuth{SECURITY_OAUTH2_BASE_SERVER_URL}/realms/kafka-authz/protocol/openid-connect/token

Topic configuration

Environment VariableDescriptionDefault Value
KAFKA_JOB_REQUEST_TOPICIncoming crawl job requests (from Integration Designer)ai.flowx.web-crawler.job.request
KAFKA_JOB_RESPONSE_TOPICOutgoing crawl job responses (to Integration Designer)ai.flowx.web-crawler.job.response
The Web Crawler integrates with flx-job-lib for async job processing. On startup, the service verifies Kafka connectivity and starts a background consumer.

Kafka payload offloading

When crawl results exceed the size limit, payloads are offloaded to S3-compatible storage.
Environment VariableDescriptionDefault Value
KAFKA_OFFLOAD_BACKENDStorage backend (minio or azure)minio
KAFKA_OFFLOAD_SIZE_LIMITPayload size threshold (bytes) before offloading512000
KAFKA_OFFLOAD_ROOT_FOLDERRoot folder for offloaded payloadspayload-offload

Storage configuration

The Web Crawler stores crawl results in S3-compatible storage. Configure one of the following storage backends.
Environment VariableDescriptionDefault Value
MINIO_DOCUMENTS_URLMinIO/S3 endpoint URL-
MINIO_DOCUMENTS_ACCESS_KEYMinIO access key-
MINIO_DOCUMENTS_SECRET_KEYMinIO secret key-
MINIO_DOCUMENTS_BUCKETMinIO bucket name-
MINIO_DOCUMENTS_SECUREUse HTTPS for MinIO (1 to enable)0

Platform service URLs

Environment VariableDescriptionDefault Value
CMS_API_URLCMS Core service URLhttp://cms:80
DOCUMENT_API_URLDocument Plugin URLhttp://document-plugin:80
DOCUMENT_PLUGIN_API_URLDocument Plugin URL (for file uploads)http://document-plugin:80
APP_MANAGER_API_URLApp Manager service URLhttp://application-manager:80
CMS_STATIC_ASSETS_PATHStatic assets CDN path for CMS downloads-

AI model configuration

Required only for the browser automation feature (/interact endpoints). The Web Crawler uses LLM models to drive browser interactions.
Model typeVariablePurposeDefault
ChatMODEL_TYPEPrimary LLM for browser automationOPENAI
VisionVISION_MODEL_TYPEVisual page understandingOPENAI
EmbeddingEMBEDDING_MODEL_TYPEText embeddingsOPENAI
Environment VariableDescriptionDefault Value
FALLBACK_MODEL_TYPEFallback LLM provider if primary fails-
STRUCTURED_OUTPUT_METHODStructured output method for LLMjson_schema
All model types support the same providers: OPENAI, AZURE, OLLAMA, ANTHROPIC, GCP, CUSTOM_OPENAI.

Provider-specific variables

Environment VariableDescriptionDefault Value
OPENAI_API_KEYOpenAI API key-
OPENAI_MODEL_NAMEChat model namegpt-4o-2024-08-06
OPENAI_VISION_MODEL_NAMEVision model namegpt-4o-2024-08-06
OPENAI_EMBEDDING_MODEL_NAMEEmbedding model nametext-embedding-3-large
OPENAI_CHUNK_SIZEEmbedding chunk size1000

PostgreSQL configuration

The Web Crawler uses PostgreSQL for embedding storage, memory, and chat history.
Environment VariableDescriptionDefault Value
POSTGRES_EMBEDDING_URIHost and portpostgresql-ai:5432
POSTGRES_EMBEDDING_USERNAMEUsernamedip
POSTGRES_EMBEDDING_PASSWORDPassword-
POSTGRES_EMBEDDING_DATABASEDatabase namedip

Observability (optional)

FlowX Observatory

Environment VariableDescriptionDefault Value
USE_OBSERVATORYEnable Observatory middleware (1 to enable)0
FLOWX_OBSERVATORY_API_URLObservatory API URLhttp://observatory-api:80
FLOWX_OBSERVATORY_APP_IDApplication ID for Observatory-
FLOWX_OBSERVATORY_VERBOSEVerbose Observatory logging0

Langfuse

Environment VariableDescriptionDefault Value
LANGFUSE_PUBLIC_KEYLangfuse public key-
LANGFUSE_SECRET_KEYLangfuse secret key-
LANGFUSE_HOSTLangfuse server URL-

LLM Guard

Environment VariableDescriptionDefault Value
FLOWX_LLMGUARD_URLLLMGuard service URLhttp://llmguard:8000

CORS configuration

Environment VariableDescriptionDefault Value
APPLICATION_CORS_ALLOW_ORIGINAllowed origins (comma-separated)-
APPLICATION_CORS_ALLOW_HEADERSAllowed headersAccept,Content-Type,Authorization,...
APPLICATION_CORS_ALLOW_METHODSAllowed methodsGET,PUT,POST,DELETE,PATCH,OPTIONS
APPLICATION_CORS_ALLOW_CREDENTIALSAllow credentialstrue

MCP server configuration

Environment VariableDescriptionDefault Value
MCP_AUTH_KEYAuthentication key for MCP server endpoints-

Kafka job processing

The Web Crawler processes crawl requests asynchronously via Kafka. Integration Designer sends requests to the job request topic, and the Web Crawler writes results to S3 and sends a response to the job response topic.

Request payload

FieldTypeDefaultDescription
urlsstring[](required)URLs to crawl
depthint1Crawl depth (1 = single page)
maxPagesint50Max total pages to crawl
maxPagesPerLevelint10Max pages per depth level
followLinksbooltrueFollow links on crawled pages
processLinkedPdfsbooltrueProcess linked PDF documents
acceptDownloadsboolfalseDownload files and upload to Document Plugin
pageTimeoutint30000Page navigation timeout (ms)
proxyUrlstringnullProxy URL for requests
applicationIdstringnullApplication ID (required for file uploads)

Response flow

  1. Integration Designer publishes a crawl request to ai.flowx.web-crawler.job.request
  2. Web Crawler processes the request and stores the result (Markdown content) in S3
  3. Web Crawler publishes a response with the result path to ai.flowx.web-crawler.job.response
  4. Integration Designer retrieves the result from S3

Deployment and sizing

Docker

  • Base image: Python 3.13-slim-bookworm
  • Runs as non-root user (aiservices)
  • Port: 8080
  • Health check: /web-crawler/api/v1/web-crawler/info/health

Kubernetes configuration

SettingValue
Replicas2
CPU requests1 core
CPU limits2 cores
RAM requests1 Gi
RAM limits2 Gi
GUNICORN_WORKERS4
The Web Crawler runs headless Chromium browsers for page rendering. Each concurrent crawl consumes significant memory. Monitor memory usage and increase limits if pods are OOM-killed during deep crawls.

Verify your setup

The Web Crawler pod is running: kubectl get pods -l app=web-crawler
The health endpoint returns HTTP 200: curl http://web-crawler:8080/web-crawler/api/v1/web-crawler/info/health
Kafka consumer is connected — check pod logs for Kafka consumer started message at startup
Integration Designer can reach the Web Crawler — verify FLOWX_WEBCRAWLER_BASEURL is set in Integration Designer setup

Troubleshooting

Symptoms: Pod stays in CrashLoopBackOff or never becomes ready.Solutions:
  • Verify Chromium/Playwright dependencies are available (included in Docker image)
  • Check that Kafka broker addresses are reachable
  • Ensure the Keycloak server URL is correct and accessible
  • Review pod logs for specific startup errors
Symptoms: Integration Designer sends crawl requests but no results are returned.Solutions:
  • Verify KAFKA_CONSUMER_ENABLED is set to 1
  • Check that KAFKA_JOB_REQUEST_TOPIC matches the topic Integration Designer publishes to
  • Ensure the Kafka consumer group (KAFKA_CONSUMER_GROUPID) has no conflicting consumers
  • Review pod logs for Kafka connection or authentication errors
Symptoms: Pages return no content or partial Markdown.Solutions:
  • Increase pageTimeout in the crawl request for slow-loading pages
  • Check if the target site blocks headless browsers (try configuring a proxyUrl)
  • Verify network connectivity from the pod to the target URLs
  • Review verbose logs (VERBOSE=1) for page-level errors
Symptoms: acceptDownloads is enabled but files are not appearing in the Document Plugin.Solutions:
  • Verify DOCUMENT_PLUGIN_API_URL is set and reachable
  • Ensure applicationId is provided in the crawl request payload
  • Check service account credentials (SERVICE_OAUTH_CLIENT_ID / SERVICE_OAUTH_CLIENT_SECRET)
  • Review pod logs for upload errors or authentication failures
Symptoms: Pods restart with OOMKilled status during multi-page crawls.Solutions:
  • Increase memory limits (try 4 Gi)
  • Reduce maxPages and maxPagesPerLevel in crawl requests
  • Reduce GUNICORN_WORKERS to limit concurrent browser instances
  • Scale horizontally with more pods instead of larger pods

Integration Designer setup

Configure Integration Designer, which orchestrates Web Crawler jobs

Integration Designer

Learn about building integration workflows

Kafka Authentication

Configure Kafka security and authentication

Document Plugin setup

Configure the Document Plugin for file storage
Last modified on April 9, 2026