Document Parser setup

Available starting with FlowX.AI 5.5.0The Document Parser is a Python-based service that powers the Extract Data from File AI node in Agent Builder.

The Document Parser service handles document and image processing for AI workflows. It provides multiple parsing engines, OCR capabilities, table extraction, document classification, and signature detection.

Dependencies

The Document Parser connects to standard FlowX infrastructure services that should already be configured in your environment:

Kafka — async document processing and inter-service communication
Identity provider (Keycloak or Azure AD) — authentication
CMS Core — file retrieval from FlowX content management
AI provider (OpenAI, Azure OpenAI, Ollama, Anthropic, or GCP) — required for LLM and VLM parsing engines
Vector database (Qdrant or PostgreSQL with pgvector) — optional, for RAG use cases

The Document Parser retrieves files through CMS Core — it does not require its own S3/MinIO storage configuration.

The Document Parser is a Python 3.13 service that loads ML models at startup. Plan for longer startup times (5–20 minutes) compared to Java-based FlowX services.

Capabilities

What it can do

Multi-format parsing: PDF, DOCX, XLSX/XLS/XLSM, PPTX, JPG, PNG, TIFF
Three parsing engine families: Classic, LLM, and Docling (with Auto/Simple/OCR/VLM modes)
OCR text extraction: Tesseract, RapidOCR, EasyOCR
Table structure extraction: Detect and extract tabular data
Document classification: Keyword-based or LLM-based
Signature detection: Locate signatures within documents
Page rotation detection: Detect and correct rotated pages
Semantic chunking: Smart chunking for Retrieval-Augmented Generation (RAG) workflows
Vector store integration: Store embeddings in Qdrant or PostgreSQL/pgvector

What it cannot do

Cannot process encrypted or password-protected PDFs
Cannot process audio or video files
Cannot edit or modify source documents
Does not perform language translation

Supported formats

Format	Extensions	Notes
PDF	`.pdf`	All parsing engines supported
Word	`.docx`	Docling and Classic engines
Excel	`.xlsx`, `.xls`, `.xlsm`	Table extraction supported
PowerPoint	`.pptx`	Slide-by-slide extraction
Images	`.jpg`, `.png`, `.tiff`	Converted to PDF before parsing

Parsing engines

The Document Parser supports six parsing modes organized into three engine families:

Engine	API Value	Best for	Speed	Cost	Accuracy
Classic	`ClassicParser`	Clean text PDFs	Fast	Free	Low–Medium
LLM	`LLMParser`	Any document via AI vision	Slow	High	High
Docling Simple	`DoclingParserSimple`	Standard business documents	Medium	Low	Medium
Docling OCR	`DoclingParserOCR`	Scanned/image-heavy documents	Slow	Medium–High	High
Docling VLM	`DoclingParserVLM`	Complex layouts	Slow	High	Highest
Docling Auto	`DoclingParserAuto`	Unknown documents (auto-routes)	Varies	Varies	Varies

Docling Auto analyzes each page using text density, gibberish ratio, image area coverage, and layout metrics. It then routes each page to the cheapest engine that can handle it: Standard (simple extraction), OCR (text-poor/garbled pages), or VLM (complex visual layouts).

Engine selection guide

Classic
LLM
Docling

Uses PyMuPDF for direct text extraction from the PDF file structure, with Tesseract OCR as a fallback for empty pages.

Zero cost (no external API calls)
Fastest processing
Only works with digital PDFs that have selectable text
Falls back to Tesseract OCR when pages have no text layer

API endpoints

All endpoints are prefixed with the URL_PREFIX value (default: /doc-parser).

Endpoint	Method	Purpose
`/api/v1/doc-parser/extract/parse_file`	POST	Parse an uploaded file (multipart form)
`/api/v1/doc-parser/extract/parse_object`	POST	Parse a file from CMS or MinIO (JSON body)
`/api/v1/doc-parser/extract/store_text`	POST	Store extracted text as embeddings in a vector database
`/api/v1/doc-parser/info/health`	GET	Health check

Configuration

Server configuration

Environment Variable	Description	Default Value
`URL_PREFIX`	FastAPI root path for reverse proxy routing	`/doc-parser`
`GUNICORN_WORKERS`	Number of Gunicorn worker processes	`1`
`GUNICORN_TIMEOUT`	Worker timeout in seconds	`600`
`GUNICORN_GRACEFUL_TIMEOUT`	Graceful shutdown timeout in seconds	`320`
`USE_DOCLING`	Pre-download Docling models at startup	`0`
`VERBOSE`	Enable verbose/debug logging	`0`

Keep GUNICORN_WORKERS at 1 per pod. The Document Parser loads large ML models into memory — multiple workers per pod will cause out-of-memory (OOM) errors. Scale horizontally by adding more pods instead.

Set USE_DOCLING to 1 if you use Docling parsing engines. This pre-downloads models at startup rather than on first request, avoiding timeout issues on the first document.

AI model configuration

The Document Parser uses three types of AI models, each independently configurable:

Model type	Variable	Purpose	Default
Chat	`MODEL_TYPE`	LLM-based parsing, classification, structured extraction	`OPENAI`
Vision	`VISION_MODEL_TYPE`	VLM parsing, image analysis	`OPENAI`
Embedding	`EMBEDDING_MODEL_TYPE`	Vector store embeddings, semantic chunking	`OPENAI`

Environment Variable	Description	Default Value
`FALLBACK_MODEL_TYPE`	Fallback LLM provider if primary fails	-
`STRUCTURED_OUTPUT_METHOD`	Structured output method for LLM	`json_schema`
`VLM_API_TIMEOUT`	Timeout in seconds for VLM API calls	`300`

All three model types support the same providers: OPENAI, AZUREOPENAI, OLLAMA, ANTHROPIC, GCP, GCP_VERTEX, CUSTOM_OPENAI, OPENAI_COMPATIBLE. Embedding models additionally support HUGGINGFACE and AWS.

Provider-specific variables

OpenAI
Azure OpenAI
Ollama
Other providers

Environment Variable	Description	Default Value
`OPENAI_API_KEY`	OpenAI API key	-
`OPENAI_MODEL_NAME`	Chat model name	`gpt-4o-2024-08-06`
`OPENAI_VISION_MODEL_NAME`	Vision model name	`gpt-4o-2024-08-06`
`OPENAI_EMBEDDING_MODEL_NAME`	Embedding model name	`text-embedding-3-large`
`OPENAI_CHUNK_SIZE`	Embedding chunk size	`1000`

Environment Variable	Description	Default Value
`AZURE_OPENAI_API_KEY`	Azure OpenAI API key	-
`AZURE_OPENAI_INSTANCE_NAME`	Azure OpenAI instance name	-
`AZURE_OPENAI_API_VERSION`	Azure OpenAI API version	`2025-01-01-preview`
`AZURE_OPENAI_MODEL`	Azure model deployment name	`gpt-4.1-mini`
`AZUREOPENAI_MODEL_NAME`	Chat model name	`gpt-4o-2024-08-06`
`AZUREOPENAI_VISION_MODEL_NAME`	Vision model name	`gpt-4o-2024-08-06`
`AZUREOPENAI_EMBEDDING_MODEL_NAME`	Embedding model name	`text-embedding-3-large`

Environment Variable	Description	Default Value
`OLLAMA_BASE_URL`	Ollama server URL	-
`OLLAMA_MODEL_NAME`	Chat model name	`llama3.2:latest`
`OLLAMA_VISION_MODEL_NAME`	Vision model name	`llava:7b-v1.6`
`OLLAMA_EMBEDDING_MODEL_NAME`	Embedding model name	`all-minilm:33m`

Anthropic:

Environment Variable	Description	Default Value
`ANTHROPIC_MODEL_NAME`	Chat model name	-
`ANTHROPIC_VISION_MODEL_NAME`	Vision model name	-

Google Cloud / Vertex AI:

Environment Variable	Description	Default Value
`GCP_PROJECT_ID`	Google Cloud project ID	-
`GCP_LOCATION`	Google Cloud region	-
`GCP_MODEL_NAME`	Chat model name	`gemini-1.5-pro`
`GCP_VISION_MODEL_NAME`	Vision model name	`gemini-1.5-pro`

Custom OpenAI-compatible:

Environment Variable	Description	Default Value
`CUSTOM_OPENAI_BASE_URL`	API base URL	-
`CUSTOM_OPENAI_CLIENT_ID`	Client ID	-
`CUSTOM_OPENAI_SECRET_KEY`	Secret key	-
`CUSTOM_OPENAI_MODEL_NAME`	Chat model name	`gpt-4o-2024-08-06`
`CUSTOM_OPENAI_VISION_MODEL_NAME`	Vision model name	`gpt-4o-2024-08-06`
`CUSTOM_OPENAI_EMBEDDING_MODEL_NAME`	Embedding model name	`text-embedding-3-large`

OCR configuration

Environment Variable	Description	Default Value
`OMP_NUM_THREADS`	Number of threads for OCR processing	`4`

The OCR engine is selected per-request via the API ocr_engine parameter. Available engines: tesseract-cli (default), tesseract-ocr, rapidocr, easyocr.

Vector database configuration (optional)

Required only if using semantic chunking and RAG workflows. The vector store provider is selected per-request via the API vector_store_provider parameter.

Qdrant
PostgreSQL / pgvector

Environment Variable	Description	Default Value
`QDRANT_HOST`	Qdrant server URL	`http://localhost:6333`

Environment Variable	Description	Default Value
`POSTGRES_HOST`	PostgreSQL host	`localhost`
`POSTGRES_PORT`	PostgreSQL port	`5432`
`POSTGRES_DB`	Database name	`dip`
`POSTGRES_USER`	Database username	-
`POSTGRES_PASSWORD`	Database password	-

Observability (optional)

Langfuse

Environment Variable	Description	Default Value
`LANGFUSE_PUBLIC_KEY`	Langfuse public key	-
`LANGFUSE_SECRET_KEY`	Langfuse secret key	-
`LANGFUSE_HOST`	Langfuse server URL	-

Deployment and sizing

Kubernetes configuration (15,000 pages/day)

Steady load
Autoscaling

For predictable, consistent workloads:

Setting	Value
Replicas	3
CPU requests	3 cores
CPU limits	4 cores
RAM requests	4 Gi
RAM limits	6 Gi
`GUNICORN_WORKERS`	`1`
`OMP_NUM_THREADS`	`4`

For variable workloads with burst capacity:

Setting	Value
Min replicas	2
Max replicas	7
CPU requests	3 cores
CPU limits	4 cores
RAM requests	4 Gi
RAM limits	6 Gi
`GUNICORN_WORKERS`	`1`
`OMP_NUM_THREADS`	`4`
HPA target CPU	70%

Important deployment notes

Pod startup time: 5–20 minutes. The service loads ML models (OCR, Docling) into memory at startup. Configure your readiness probes and deployment strategy accordingly.

Set GUNICORN_TIMEOUT to at least 600 (10 minutes) — increase for very large documents
Use 1 Gunicorn worker per pod — scale horizontally with more pods
Configure liveness probes with a generous initial delay (at least 300 seconds)
Consider using a RollingUpdate strategy with maxUnavailable: 0 to avoid downtime during deploys
The service listens on port 8080

Cost considerations

Estimated monthly costs for a 15,000 pages/day workload using LLM-based parsing:

Component	Monthly cost estimate
OpenAI API calls	~$960
Kubernetes compute (3 pods)	~$150–500
Total	~$1,100–1,460

Cost reduction tips

Use keyword-only classification

Set use_llm_to_classify to false in API calls instead of LLM-based classification. Saves ~$450/month.

Use Docling Auto

Let Docling Auto route pages to the cheapest engine that works. Saves ~$240/month compared to forcing OCR on all pages.

Turn off signature detection

Set detect_signature to false in API calls when not needed. Saves ~$90/month.

Use Classic for clean PDFs

Route clean digital PDFs to the Classic engine (free, no API calls). Reserve LLM/Docling for scanned or complex documents.

Troubleshooting

OOM kills (out of memory)

Symptoms: Pods restart with OOMKilled status.Solutions:

Increase memory limits (try 8 Gi)
Ensure GUNICORN_WORKERS is set to 1
Reduce OMP_NUM_THREADS to 2
Check for concurrent large document processing

Slow document processing

Symptoms: Documents take much longer than expected to process.Solutions:

Check if GPU is available for Docling VLM mode
Verify network latency to the AI provider
Use Docling Auto instead of forcing VLM on all documents
Check pod CPU utilization — may need more replicas

Low extraction accuracy

Symptoms: Extracted text is incomplete or garbled.Solutions:

Try VLM or LLM mode for complex documents
Check source document quality (low resolution images reduce OCR accuracy)
Ensure the correct OCR engine is configured for your document language
Use apply_rotation: true for scanned documents that may be rotated

Pod fails to start

Symptoms: Pod stays in CrashLoopBackOff or never becomes ready.Solutions:

Increase readiness probe initial delay to 600 seconds
Check that ML model files are accessible (network or storage issues)
Verify memory limits are sufficient for model loading (minimum 4 Gi)
Review pod logs for specific error messages

Extract Data from File

Configure the Extract Data from File AI node in Agent Builder

AI Platform setup

Configure AI Platform services and service discovery

Agent Builder overview

Get started with Agent Builder workflows

Microservices

Plugins

Observability

Access management

Document Parser setup

Dependencies

Capabilities

What it can do

What it cannot do

Supported formats

Parsing engines

Engine selection guide

API endpoints

Configuration

Server configuration

AI model configuration

Provider-specific variables

OCR configuration

Vector database configuration (optional)

Observability (optional)

Langfuse

Deployment and sizing

Kubernetes configuration (15,000 pages/day)

Important deployment notes

Cost considerations

Cost reduction tips

Use keyword-only classification

Use Docling Auto

Turn off signature detection

Use Classic for clean PDFs

Troubleshooting

Extract Data from File

AI Platform setup

Agent Builder overview

Microservices

Plugins

Observability

Access management

​Dependencies

​Capabilities

​What it can do

​What it cannot do

​Supported formats

​Parsing engines

​Engine selection guide

​API endpoints

​Configuration

​Server configuration

​AI model configuration

​Provider-specific variables

​OCR configuration

​Vector database configuration (optional)

​Observability (optional)

​Langfuse

​Deployment and sizing

​Kubernetes configuration (15,000 pages/day)

​Important deployment notes

​Cost considerations

​Cost reduction tips

Use keyword-only classification

Use Docling Auto

Turn off signature detection

Use Classic for clean PDFs

​Troubleshooting

​Related resources

Extract Data from File

AI Platform setup

Agent Builder overview

Dependencies

Capabilities

What it can do

What it cannot do

Supported formats

Parsing engines

Engine selection guide

API endpoints

Configuration

Server configuration

AI model configuration

Provider-specific variables

OCR configuration

Vector database configuration (optional)

Observability (optional)

Langfuse

Deployment and sizing

Kubernetes configuration (15,000 pages/day)

Important deployment notes

Cost considerations

Cost reduction tips

Troubleshooting

Related resources