Skip to main content
Available starting with FlowX.AI 5.5.0The Document Parser is a Python-based service that powers the Extract Data from File AI node in Agent Builder.
The Document Parser service handles document and image processing for AI workflows. It provides multiple parsing engines, OCR capabilities, table extraction, document classification, and signature detection.

Dependencies

The Document Parser connects to standard FlowX infrastructure services that should already be configured in your environment:
  • Kafka — async document processing and inter-service communication
  • Identity provider (Keycloak or Azure AD) — authentication
  • CMS Core — file retrieval from FlowX content management
  • AI provider (OpenAI, Azure OpenAI, Ollama, Anthropic, or GCP) — required for LLM and VLM parsing engines
  • Vector database (Qdrant or PostgreSQL with pgvector) — optional, for RAG use cases
The Document Parser retrieves files through CMS Core — it does not require its own S3/MinIO storage configuration.
The Document Parser is a Python 3.13 service that loads ML models at startup. Plan for longer startup times (5–20 minutes) compared to Java-based FlowX services.

Capabilities

What it can do

  • Multi-format parsing: PDF, DOCX, XLSX/XLS/XLSM, PPTX, JPG, PNG, TIFF
  • Three parsing engine families: Classic, LLM, and Docling (with Auto/Simple/OCR/VLM modes)
  • OCR text extraction: Tesseract, RapidOCR, EasyOCR
  • Table structure extraction: Detect and extract tabular data
  • Document classification: Keyword-based or LLM-based
  • Signature detection: Locate signatures within documents
  • Page rotation detection: Detect and correct rotated pages
  • Semantic chunking: Smart chunking for Retrieval-Augmented Generation (RAG) workflows
  • Vector store integration: Store embeddings in Qdrant or PostgreSQL/pgvector

What it cannot do

  • Cannot process encrypted or password-protected PDFs
  • Cannot process audio or video files
  • Cannot edit or modify source documents
  • Does not perform language translation

Supported formats

FormatExtensionsNotes
PDF.pdfAll parsing engines supported
Word.docxDocling and Classic engines
Excel.xlsx, .xls, .xlsmTable extraction supported
PowerPoint.pptxSlide-by-slide extraction
Images.jpg, .png, .tiffConverted to PDF before parsing

Parsing engines

The Document Parser supports six parsing modes organized into three engine families:
EngineAPI ValueBest forSpeedCostAccuracy
ClassicClassicParserClean text PDFsFastFreeLow–Medium
LLMLLMParserAny document via AI visionSlowHighHigh
Docling SimpleDoclingParserSimpleStandard business documentsMediumLowMedium
Docling OCRDoclingParserOCRScanned/image-heavy documentsSlowMedium–HighHigh
Docling VLMDoclingParserVLMComplex layoutsSlowHighHighest
Docling AutoDoclingParserAutoUnknown documents (auto-routes)VariesVariesVaries
Docling Auto analyzes each page using text density, gibberish ratio, image area coverage, and layout metrics. It then routes each page to the cheapest engine that can handle it: Standard (simple extraction), OCR (text-poor/garbled pages), or VLM (complex visual layouts).

Engine selection guide

Uses PyMuPDF for direct text extraction from the PDF file structure, with Tesseract OCR as a fallback for empty pages.
  • Zero cost (no external API calls)
  • Fastest processing
  • Only works with digital PDFs that have selectable text
  • Falls back to Tesseract OCR when pages have no text layer

API endpoints

All endpoints are prefixed with the URL_PREFIX value (default: /doc-parser).
EndpointMethodPurpose
/api/v1/doc-parser/extract/parse_filePOSTParse an uploaded file (multipart form)
/api/v1/doc-parser/extract/parse_objectPOSTParse a file from CMS or MinIO (JSON body)
/api/v1/doc-parser/extract/store_textPOSTStore extracted text as embeddings in a vector database
/api/v1/doc-parser/info/healthGETHealth check

Configuration

Server configuration

Environment VariableDescriptionDefault Value
URL_PREFIXFastAPI root path for reverse proxy routing/doc-parser
GUNICORN_WORKERSNumber of Gunicorn worker processes1
GUNICORN_TIMEOUTWorker timeout in seconds600
GUNICORN_GRACEFUL_TIMEOUTGraceful shutdown timeout in seconds320
USE_DOCLINGPre-download Docling models at startup0
VERBOSEEnable verbose/debug logging0
Keep GUNICORN_WORKERS at 1 per pod. The Document Parser loads large ML models into memory — multiple workers per pod will cause out-of-memory (OOM) errors. Scale horizontally by adding more pods instead.
Set USE_DOCLING to 1 if you use Docling parsing engines. This pre-downloads models at startup rather than on first request, avoiding timeout issues on the first document.

AI model configuration

The Document Parser uses three types of AI models, each independently configurable:
Model typeVariablePurposeDefault
ChatMODEL_TYPELLM-based parsing, classification, structured extractionOPENAI
VisionVISION_MODEL_TYPEVLM parsing, image analysisOPENAI
EmbeddingEMBEDDING_MODEL_TYPEVector store embeddings, semantic chunkingOPENAI
Environment VariableDescriptionDefault Value
FALLBACK_MODEL_TYPEFallback LLM provider if primary fails-
STRUCTURED_OUTPUT_METHODStructured output method for LLMjson_schema
VLM_API_TIMEOUTTimeout in seconds for VLM API calls300
All three model types support the same providers: OPENAI, AZUREOPENAI, OLLAMA, ANTHROPIC, GCP, GCP_VERTEX, CUSTOM_OPENAI, OPENAI_COMPATIBLE. Embedding models additionally support HUGGINGFACE and AWS.

Provider-specific variables

Environment VariableDescriptionDefault Value
OPENAI_API_KEYOpenAI API key-
OPENAI_MODEL_NAMEChat model namegpt-4o-2024-08-06
OPENAI_VISION_MODEL_NAMEVision model namegpt-4o-2024-08-06
OPENAI_EMBEDDING_MODEL_NAMEEmbedding model nametext-embedding-3-large
OPENAI_CHUNK_SIZEEmbedding chunk size1000

OCR configuration

Environment VariableDescriptionDefault Value
OMP_NUM_THREADSNumber of threads for OCR processing4
The OCR engine is selected per-request via the API ocr_engine parameter. Available engines: tesseract-cli (default), tesseract-ocr, rapidocr, easyocr.

Vector database configuration (optional)

Required only if using semantic chunking and RAG workflows. The vector store provider is selected per-request via the API vector_store_provider parameter.
Environment VariableDescriptionDefault Value
QDRANT_HOSTQdrant server URLhttp://localhost:6333

Observability (optional)

Langfuse

Environment VariableDescriptionDefault Value
LANGFUSE_PUBLIC_KEYLangfuse public key-
LANGFUSE_SECRET_KEYLangfuse secret key-
LANGFUSE_HOSTLangfuse server URL-

Deployment and sizing

Kubernetes configuration (15,000 pages/day)

For predictable, consistent workloads:
SettingValue
Replicas3
CPU requests3 cores
CPU limits4 cores
RAM requests4 Gi
RAM limits6 Gi
GUNICORN_WORKERS1
OMP_NUM_THREADS4

Important deployment notes

Pod startup time: 5–20 minutes. The service loads ML models (OCR, Docling) into memory at startup. Configure your readiness probes and deployment strategy accordingly.
  • Set GUNICORN_TIMEOUT to at least 600 (10 minutes) — increase for very large documents
  • Use 1 Gunicorn worker per pod — scale horizontally with more pods
  • Configure liveness probes with a generous initial delay (at least 300 seconds)
  • Consider using a RollingUpdate strategy with maxUnavailable: 0 to avoid downtime during deploys
  • The service listens on port 8080

Cost considerations

Estimated monthly costs for a 15,000 pages/day workload using LLM-based parsing:
ComponentMonthly cost estimate
OpenAI API calls~$960
Kubernetes compute (3 pods)~$150–500
Total~$1,100–1,460

Cost reduction tips

Use keyword-only classification

Set use_llm_to_classify to false in API calls instead of LLM-based classification. Saves ~$450/month.

Use Docling Auto

Let Docling Auto route pages to the cheapest engine that works. Saves ~$240/month compared to forcing OCR on all pages.

Turn off signature detection

Set detect_signature to false in API calls when not needed. Saves ~$90/month.

Use Classic for clean PDFs

Route clean digital PDFs to the Classic engine (free, no API calls). Reserve LLM/Docling for scanned or complex documents.

Troubleshooting

Symptoms: Pods restart with OOMKilled status.Solutions:
  • Increase memory limits (try 8 Gi)
  • Ensure GUNICORN_WORKERS is set to 1
  • Reduce OMP_NUM_THREADS to 2
  • Check for concurrent large document processing
Symptoms: Documents take much longer than expected to process.Solutions:
  • Check if GPU is available for Docling VLM mode
  • Verify network latency to the AI provider
  • Use Docling Auto instead of forcing VLM on all documents
  • Check pod CPU utilization — may need more replicas
Symptoms: Extracted text is incomplete or garbled.Solutions:
  • Try VLM or LLM mode for complex documents
  • Check source document quality (low resolution images reduce OCR accuracy)
  • Ensure the correct OCR engine is configured for your document language
  • Use apply_rotation: true for scanned documents that may be rotated
Symptoms: Pod stays in CrashLoopBackOff or never becomes ready.Solutions:
  • Increase readiness probe initial delay to 600 seconds
  • Check that ML model files are accessible (network or storage issues)
  • Verify memory limits are sufficient for model loading (minimum 4 Gi)
  • Review pod logs for specific error messages

Last modified on February 13, 2026