Available starting with FlowX.AI 5.6.0The Web Crawler powers the web page extraction capabilities in Integration Designer workflows.
Dependencies
The Web Crawler connects to standard FlowX infrastructure services that should already be configured in your environment:- Kafka — async job processing for crawl requests from Integration Designer
- Identity provider (Keycloak or Azure AD) — authentication and Kafka OAuth
- CMS Core — content management integration
- Document Plugin — file upload for downloaded documents
- S3-compatible storage (MinIO or AWS S3) — storing crawl results and offloaded Kafka payloads
- AI provider (OpenAI, Anthropic, Ollama, Azure OpenAI, or GCP) — required only for browser automation features
Capabilities
What it can do
- Single-page crawling: Extract content from individual web pages as Markdown
- Deep crawling: Breadth-first crawling with configurable depth, concurrency, and rate limiting
- PDF processing: Dedicated scraping strategy for linked PDF documents
- File downloads: Automatic detection and upload of downloadable files (.docx, .xlsx, .pdf) to the Document Plugin
- AI browser automation: Multi-step browser interactions using Claude or GPT-4o via the browser-use library
- Session management: Persistent browser sessions for multi-step automation workflows
- Per-level URL caps: Configurable maximum pages per depth level during deep crawls
What it cannot do
- Cannot crawl pages that require JavaScript-based authentication (unless using browser automation)
- Cannot process audio or video content
- Cannot bypass CAPTCHAs or bot protection without proxy configuration
API endpoints
All endpoints are prefixed with theURL_PREFIX value (default: /web-crawler).
| Endpoint | Method | Purpose |
|---|---|---|
/api/v1/web-crawler/parse | POST | Crawl URLs and extract Markdown content |
/api/v1/web-crawler/interact | POST | Start AI browser automation session |
/api/v1/web-crawler/interact/{session_id} | POST | Continue automation in existing session |
/api/v1/web-crawler/interact/{session_id} | DELETE | Close browser session |
/api/v1/web-crawler/info/health | GET | Health check |
Configuration
Server configuration
| Environment Variable | Description | Default Value |
|---|---|---|
URL_PREFIX | FastAPI root path for reverse proxy routing | /web-crawler |
VERSION | Service version identifier | DEV |
VERBOSE | Enable verbose/debug logging (1 to enable) | 0 |
GUNICORN_WORKERS | Number of Gunicorn worker processes | 4 |
GUNICORN_TIMEOUT | Worker timeout in seconds | 600 |
Authentication configuration
Keycloak / OAuth2
| Environment Variable | Description | Default Value |
|---|---|---|
SECURITY_OAUTH2_BASE_SERVER_URL | Keycloak server URL | (required) |
SECURITY_OAUTH2_REALM | Default Keycloak realm | flowx |
SECURITY_OAUTH2_CLIENT_ID | Platform auth client ID | flowx-platform-authenticate |
SECURITY_OAUTH2_CLIENT_SECRET | Platform auth client secret | flowx-platform-authenticate-secret |
SECURITY_PROVIDER | Auth provider (KEYCLOAK or AZURE) | KEYCLOAK |
Service account
The Web Crawler uses a service account for platform API calls (CMS, Document Plugin).| Environment Variable | Description | Default Value |
|---|---|---|
SERVICE_OAUTH_CLIENT_ID | Service account client ID | flowx-scheduler-core-sa |
SERVICE_OAUTH_CLIENT_SECRET | Service account client secret | - |
SERVICE_OAUTH_CLIENT_REALM | Service account Keycloak realm | - |
Kafka configuration
The Web Crawler uses Kafka for async job processing. Integration Designer sends crawl requests, and the Web Crawler returns results through Kafka topics.Core Kafka settings
| Environment Variable | Description | Default Value |
|---|---|---|
KAFKA_BOOTSTRAP_SERVERS | Kafka broker address | kafka:9092 |
KAFKA_CONSUMER_ENABLED | Enable Kafka consumer (0 to turn off) | 1 |
KAFKA_PRODUCER_ENABLED | Enable Kafka producer (0 to turn off) | 1 |
KAFKA_CONSUMER_GROUPID | Consumer group ID | ai-services |
Consumer tuning
| Environment Variable | Description | Default Value |
|---|---|---|
KAFKA_MAX_POLL_INTERVAL_MS | Max poll interval (ms) | 300000 |
KAFKA_SESSION_TIMEOUT_MS | Session timeout (ms) | 300000 |
KAFKA_HEARTBEAT_INTERVAL_MS | Heartbeat interval (ms) | 10000 |
OAuth authentication
When usingKAFKA_SECURITY_ENABLED=1, the Web Crawler authenticates to Kafka using SASL/OAUTHBEARER via Keycloak.
| Environment Variable | Description | Default Value |
|---|---|---|
KAFKA_SECURITY_ENABLED | Enable SASL/OAUTHBEARER auth (0 to turn off) | 1 |
KAFKA_OAUTH_CLIENT_ID | Service account for Kafka SASL | flowx-service-client |
KAFKA_OAUTH_CLIENT_SECRET | Service account secret | - |
KAFKA_OAUTH_TOKEN_ENDPOINT_URI | Keycloak token endpoint for Kafka OAuth | {SECURITY_OAUTH2_BASE_SERVER_URL}/realms/kafka-authz/protocol/openid-connect/token |
Topic configuration
| Environment Variable | Description | Default Value |
|---|---|---|
KAFKA_JOB_REQUEST_TOPIC | Incoming crawl job requests (from Integration Designer) | ai.flowx.web-crawler.job.request |
KAFKA_JOB_RESPONSE_TOPIC | Outgoing crawl job responses (to Integration Designer) | ai.flowx.web-crawler.job.response |
The Web Crawler integrates with flx-job-lib for async job processing. On startup, the service verifies Kafka connectivity and starts a background consumer.
Kafka payload offloading
When crawl results exceed the size limit, payloads are offloaded to S3-compatible storage.| Environment Variable | Description | Default Value |
|---|---|---|
KAFKA_OFFLOAD_BACKEND | Storage backend (minio or azure) | minio |
KAFKA_OFFLOAD_SIZE_LIMIT | Payload size threshold (bytes) before offloading | 512000 |
KAFKA_OFFLOAD_ROOT_FOLDER | Root folder for offloaded payloads | payload-offload |
Storage configuration
The Web Crawler stores crawl results in S3-compatible storage. Configure one of the following storage backends.- MinIO / S3
- AWS S3
- Azure Blob Storage
| Environment Variable | Description | Default Value |
|---|---|---|
MINIO_DOCUMENTS_URL | MinIO/S3 endpoint URL | - |
MINIO_DOCUMENTS_ACCESS_KEY | MinIO access key | - |
MINIO_DOCUMENTS_SECRET_KEY | MinIO secret key | - |
MINIO_DOCUMENTS_BUCKET | MinIO bucket name | - |
MINIO_DOCUMENTS_SECURE | Use HTTPS for MinIO (1 to enable) | 0 |
Platform service URLs
| Environment Variable | Description | Default Value |
|---|---|---|
CMS_API_URL | CMS Core service URL | http://cms:80 |
DOCUMENT_API_URL | Document Plugin URL | http://document-plugin:80 |
DOCUMENT_PLUGIN_API_URL | Document Plugin URL (for file uploads) | http://document-plugin:80 |
APP_MANAGER_API_URL | App Manager service URL | http://application-manager:80 |
CMS_STATIC_ASSETS_PATH | Static assets CDN path for CMS downloads | - |
AI model configuration
Required only for the browser automation feature (/interact endpoints). The Web Crawler uses LLM models to drive browser interactions.
| Model type | Variable | Purpose | Default |
|---|---|---|---|
| Chat | MODEL_TYPE | Primary LLM for browser automation | OPENAI |
| Vision | VISION_MODEL_TYPE | Visual page understanding | OPENAI |
| Embedding | EMBEDDING_MODEL_TYPE | Text embeddings | OPENAI |
| Environment Variable | Description | Default Value |
|---|---|---|
FALLBACK_MODEL_TYPE | Fallback LLM provider if primary fails | - |
STRUCTURED_OUTPUT_METHOD | Structured output method for LLM | json_schema |
All model types support the same providers:
OPENAI, AZURE, OLLAMA, ANTHROPIC, GCP, CUSTOM_OPENAI.Provider-specific variables
- OpenAI
- Anthropic
- Ollama
- Other providers
| Environment Variable | Description | Default Value |
|---|---|---|
OPENAI_API_KEY | OpenAI API key | - |
OPENAI_MODEL_NAME | Chat model name | gpt-4o-2024-08-06 |
OPENAI_VISION_MODEL_NAME | Vision model name | gpt-4o-2024-08-06 |
OPENAI_EMBEDDING_MODEL_NAME | Embedding model name | text-embedding-3-large |
OPENAI_CHUNK_SIZE | Embedding chunk size | 1000 |
PostgreSQL configuration
The Web Crawler uses PostgreSQL for embedding storage, memory, and chat history.- Embedding database
- Memory store
- Chat history
| Environment Variable | Description | Default Value |
|---|---|---|
POSTGRES_EMBEDDING_URI | Host and port | postgresql-ai:5432 |
POSTGRES_EMBEDDING_USERNAME | Username | dip |
POSTGRES_EMBEDDING_PASSWORD | Password | - |
POSTGRES_EMBEDDING_DATABASE | Database name | dip |
Observability (optional)
FlowX Observatory
| Environment Variable | Description | Default Value |
|---|---|---|
USE_OBSERVATORY | Enable Observatory middleware (1 to enable) | 0 |
FLOWX_OBSERVATORY_API_URL | Observatory API URL | http://observatory-api:80 |
FLOWX_OBSERVATORY_APP_ID | Application ID for Observatory | - |
FLOWX_OBSERVATORY_VERBOSE | Verbose Observatory logging | 0 |
Langfuse
| Environment Variable | Description | Default Value |
|---|---|---|
LANGFUSE_PUBLIC_KEY | Langfuse public key | - |
LANGFUSE_SECRET_KEY | Langfuse secret key | - |
LANGFUSE_HOST | Langfuse server URL | - |
LLM Guard
| Environment Variable | Description | Default Value |
|---|---|---|
FLOWX_LLMGUARD_URL | LLMGuard service URL | http://llmguard:8000 |
CORS configuration
| Environment Variable | Description | Default Value |
|---|---|---|
APPLICATION_CORS_ALLOW_ORIGIN | Allowed origins (comma-separated) | - |
APPLICATION_CORS_ALLOW_HEADERS | Allowed headers | Accept,Content-Type,Authorization,... |
APPLICATION_CORS_ALLOW_METHODS | Allowed methods | GET,PUT,POST,DELETE,PATCH,OPTIONS |
APPLICATION_CORS_ALLOW_CREDENTIALS | Allow credentials | true |
MCP server configuration
| Environment Variable | Description | Default Value |
|---|---|---|
MCP_AUTH_KEY | Authentication key for MCP server endpoints | - |
Kafka job processing
The Web Crawler processes crawl requests asynchronously via Kafka. Integration Designer sends requests to the job request topic, and the Web Crawler writes results to S3 and sends a response to the job response topic.Request payload
| Field | Type | Default | Description |
|---|---|---|---|
urls | string[] | (required) | URLs to crawl |
depth | int | 1 | Crawl depth (1 = single page) |
maxPages | int | 50 | Max total pages to crawl |
maxPagesPerLevel | int | 10 | Max pages per depth level |
followLinks | bool | true | Follow links on crawled pages |
processLinkedPdfs | bool | true | Process linked PDF documents |
acceptDownloads | bool | false | Download files and upload to Document Plugin |
pageTimeout | int | 30000 | Page navigation timeout (ms) |
proxyUrl | string | null | Proxy URL for requests |
applicationId | string | null | Application ID (required for file uploads) |
Response flow
- Integration Designer publishes a crawl request to
ai.flowx.web-crawler.job.request - Web Crawler processes the request and stores the result (Markdown content) in S3
- Web Crawler publishes a response with the result path to
ai.flowx.web-crawler.job.response - Integration Designer retrieves the result from S3
Deployment and sizing
Docker
- Base image: Python 3.13-slim-bookworm
- Runs as non-root user (
aiservices) - Port:
8080 - Health check:
/web-crawler/api/v1/web-crawler/info/health
Kubernetes configuration
| Setting | Value |
|---|---|
| Replicas | 2 |
| CPU requests | 1 core |
| CPU limits | 2 cores |
| RAM requests | 1 Gi |
| RAM limits | 2 Gi |
GUNICORN_WORKERS | 4 |
Verify your setup
The Web Crawler pod is running:
kubectl get pods -l app=web-crawlerThe health endpoint returns HTTP 200:
curl http://web-crawler:8080/web-crawler/api/v1/web-crawler/info/healthKafka consumer is connected — check pod logs for
Kafka consumer started message at startupIntegration Designer can reach the Web Crawler — verify
FLOWX_WEBCRAWLER_BASEURL is set in Integration Designer setupTroubleshooting
Pod fails to start or crashes
Pod fails to start or crashes
Symptoms: Pod stays in CrashLoopBackOff or never becomes ready.Solutions:
- Verify Chromium/Playwright dependencies are available (included in Docker image)
- Check that Kafka broker addresses are reachable
- Ensure the Keycloak server URL is correct and accessible
- Review pod logs for specific startup errors
Crawl jobs not being processed
Crawl jobs not being processed
Symptoms: Integration Designer sends crawl requests but no results are returned.Solutions:
- Verify
KAFKA_CONSUMER_ENABLEDis set to1 - Check that
KAFKA_JOB_REQUEST_TOPICmatches the topic Integration Designer publishes to - Ensure the Kafka consumer group (
KAFKA_CONSUMER_GROUPID) has no conflicting consumers - Review pod logs for Kafka connection or authentication errors
Crawl results are empty or incomplete
Crawl results are empty or incomplete
Symptoms: Pages return no content or partial Markdown.Solutions:
- Increase
pageTimeoutin the crawl request for slow-loading pages - Check if the target site blocks headless browsers (try configuring a
proxyUrl) - Verify network connectivity from the pod to the target URLs
- Review verbose logs (
VERBOSE=1) for page-level errors
File downloads not uploading
File downloads not uploading
Symptoms:
acceptDownloads is enabled but files are not appearing in the Document Plugin.Solutions:- Verify
DOCUMENT_PLUGIN_API_URLis set and reachable - Ensure
applicationIdis provided in the crawl request payload - Check service account credentials (
SERVICE_OAUTH_CLIENT_ID/SERVICE_OAUTH_CLIENT_SECRET) - Review pod logs for upload errors or authentication failures
OOM kills during deep crawls
OOM kills during deep crawls
Symptoms: Pods restart with OOMKilled status during multi-page crawls.Solutions:
- Increase memory limits (try 4 Gi)
- Reduce
maxPagesandmaxPagesPerLevelin crawl requests - Reduce
GUNICORN_WORKERSto limit concurrent browser instances - Scale horizontally with more pods instead of larger pods
Related resources
Integration Designer setup
Configure Integration Designer, which orchestrates Web Crawler jobs
Integration Designer
Learn about building integration workflows
Kafka Authentication
Configure Kafka security and authentication
Document Plugin setup
Configure the Document Plugin for file storage

