Available starting with FlowX.AI 5.5.0The Document Parser is a Python-based service that powers the Extract Data from File AI node in Agent Builder.
Dependencies
The Document Parser connects to standard FlowX infrastructure services that should already be configured in your environment:- Kafka — async document processing and inter-service communication
- Identity provider (Keycloak or Azure AD) — authentication
- CMS Core — file retrieval from FlowX content management
- AI provider (OpenAI, Azure OpenAI, Ollama, Anthropic, or GCP) — required for LLM and VLM parsing engines
- Vector database (Qdrant or PostgreSQL with pgvector) — optional, for RAG use cases
The Document Parser retrieves files through CMS Core — it does not require its own S3/MinIO storage configuration.
The Document Parser is a Python 3.13 service that loads ML models at startup. Plan for longer startup times (5–20 minutes) compared to Java-based FlowX services.
Capabilities
What it can do
- Multi-format parsing: PDF, DOCX, XLSX/XLS/XLSM, PPTX, JPG, PNG, TIFF
- Three parsing engine families: Classic, LLM, and Docling (with Auto/Simple/OCR/VLM modes)
- OCR text extraction: Tesseract, RapidOCR, EasyOCR
- Table structure extraction: Detect and extract tabular data
- Document classification: Keyword-based or LLM-based
- Signature detection: Locate signatures within documents
- Page rotation detection: Detect and correct rotated pages
- Semantic chunking: Smart chunking for Retrieval-Augmented Generation (RAG) workflows
- Vector store integration: Store embeddings in Qdrant or PostgreSQL/pgvector
What it cannot do
- Cannot process encrypted or password-protected PDFs
- Cannot process audio or video files
- Cannot edit or modify source documents
- Does not perform language translation
Supported formats
| Format | Extensions | Notes |
|---|---|---|
.pdf | All parsing engines supported | |
| Word | .docx | Docling and Classic engines |
| Excel | .xlsx, .xls, .xlsm | Table extraction supported |
| PowerPoint | .pptx | Slide-by-slide extraction |
| Images | .jpg, .png, .tiff | Converted to PDF before parsing |
Parsing engines
The Document Parser supports six parsing modes organized into three engine families:| Engine | API Value | Best for | Speed | Cost | Accuracy |
|---|---|---|---|---|---|
| Classic | ClassicParser | Clean text PDFs | Fast | Free | Low–Medium |
| LLM | LLMParser | Any document via AI vision | Slow | High | High |
| Docling Simple | DoclingParserSimple | Standard business documents | Medium | Low | Medium |
| Docling OCR | DoclingParserOCR | Scanned/image-heavy documents | Slow | Medium–High | High |
| Docling VLM | DoclingParserVLM | Complex layouts | Slow | High | Highest |
| Docling Auto | DoclingParserAuto | Unknown documents (auto-routes) | Varies | Varies | Varies |
Docling Auto analyzes each page using text density, gibberish ratio, image area coverage, and layout metrics. It then routes each page to the cheapest engine that can handle it: Standard (simple extraction), OCR (text-poor/garbled pages), or VLM (complex visual layouts).
Engine selection guide
- Classic
- LLM
- Docling
Uses PyMuPDF for direct text extraction from the PDF file structure, with Tesseract OCR as a fallback for empty pages.
- Zero cost (no external API calls)
- Fastest processing
- Only works with digital PDFs that have selectable text
- Falls back to Tesseract OCR when pages have no text layer
API endpoints
All endpoints are prefixed with theURL_PREFIX value (default: /doc-parser).
| Endpoint | Method | Purpose |
|---|---|---|
/api/v1/doc-parser/extract/parse_file | POST | Parse an uploaded file (multipart form) |
/api/v1/doc-parser/extract/parse_object | POST | Parse a file from CMS or MinIO (JSON body) |
/api/v1/doc-parser/extract/store_text | POST | Store extracted text as embeddings in a vector database |
/api/v1/doc-parser/info/health | GET | Health check |
Configuration
Server configuration
| Environment Variable | Description | Default Value |
|---|---|---|
URL_PREFIX | FastAPI root path for reverse proxy routing | /doc-parser |
GUNICORN_WORKERS | Number of Gunicorn worker processes | 1 |
GUNICORN_TIMEOUT | Worker timeout in seconds | 600 |
GUNICORN_GRACEFUL_TIMEOUT | Graceful shutdown timeout in seconds | 320 |
USE_DOCLING | Pre-download Docling models at startup | 0 |
VERBOSE | Enable verbose/debug logging | 0 |
AI model configuration
The Document Parser uses three types of AI models, each independently configurable:| Model type | Variable | Purpose | Default |
|---|---|---|---|
| Chat | MODEL_TYPE | LLM-based parsing, classification, structured extraction | OPENAI |
| Vision | VISION_MODEL_TYPE | VLM parsing, image analysis | OPENAI |
| Embedding | EMBEDDING_MODEL_TYPE | Vector store embeddings, semantic chunking | OPENAI |
| Environment Variable | Description | Default Value |
|---|---|---|
FALLBACK_MODEL_TYPE | Fallback LLM provider if primary fails | - |
STRUCTURED_OUTPUT_METHOD | Structured output method for LLM | json_schema |
VLM_API_TIMEOUT | Timeout in seconds for VLM API calls | 300 |
All three model types support the same providers:
OPENAI, AZUREOPENAI, OLLAMA, ANTHROPIC, GCP, GCP_VERTEX, CUSTOM_OPENAI, OPENAI_COMPATIBLE. Embedding models additionally support HUGGINGFACE and AWS.Provider-specific variables
- OpenAI
- Azure OpenAI
- Ollama
- Other providers
| Environment Variable | Description | Default Value |
|---|---|---|
OPENAI_API_KEY | OpenAI API key | - |
OPENAI_MODEL_NAME | Chat model name | gpt-4o-2024-08-06 |
OPENAI_VISION_MODEL_NAME | Vision model name | gpt-4o-2024-08-06 |
OPENAI_EMBEDDING_MODEL_NAME | Embedding model name | text-embedding-3-large |
OPENAI_CHUNK_SIZE | Embedding chunk size | 1000 |
OCR configuration
| Environment Variable | Description | Default Value |
|---|---|---|
OMP_NUM_THREADS | Number of threads for OCR processing | 4 |
The OCR engine is selected per-request via the API
ocr_engine parameter. Available engines: tesseract-cli (default), tesseract-ocr, rapidocr, easyocr.Vector database configuration (optional)
Required only if using semantic chunking and RAG workflows. The vector store provider is selected per-request via the APIvector_store_provider parameter.
- Qdrant
- PostgreSQL / pgvector
| Environment Variable | Description | Default Value |
|---|---|---|
QDRANT_HOST | Qdrant server URL | http://localhost:6333 |
Observability (optional)
Langfuse
| Environment Variable | Description | Default Value |
|---|---|---|
LANGFUSE_PUBLIC_KEY | Langfuse public key | - |
LANGFUSE_SECRET_KEY | Langfuse secret key | - |
LANGFUSE_HOST | Langfuse server URL | - |
Deployment and sizing
Kubernetes configuration (15,000 pages/day)
- Steady load
- Autoscaling
For predictable, consistent workloads:
| Setting | Value |
|---|---|
| Replicas | 3 |
| CPU requests | 3 cores |
| CPU limits | 4 cores |
| RAM requests | 4 Gi |
| RAM limits | 6 Gi |
GUNICORN_WORKERS | 1 |
OMP_NUM_THREADS | 4 |
Important deployment notes
- Set
GUNICORN_TIMEOUTto at least600(10 minutes) — increase for very large documents - Use
1Gunicorn worker per pod — scale horizontally with more pods - Configure liveness probes with a generous initial delay (at least 300 seconds)
- Consider using a
RollingUpdatestrategy withmaxUnavailable: 0to avoid downtime during deploys - The service listens on port
8080
Cost considerations
Estimated monthly costs for a 15,000 pages/day workload using LLM-based parsing:| Component | Monthly cost estimate |
|---|---|
| OpenAI API calls | ~$960 |
| Kubernetes compute (3 pods) | ~$150–500 |
| Total | ~$1,100–1,460 |
Cost reduction tips
Use keyword-only classification
Set
use_llm_to_classify to false in API calls instead of LLM-based classification. Saves ~$450/month.Use Docling Auto
Let Docling Auto route pages to the cheapest engine that works. Saves ~$240/month compared to forcing OCR on all pages.
Turn off signature detection
Set
detect_signature to false in API calls when not needed. Saves ~$90/month.Use Classic for clean PDFs
Route clean digital PDFs to the Classic engine (free, no API calls). Reserve LLM/Docling for scanned or complex documents.
Troubleshooting
OOM kills (out of memory)
OOM kills (out of memory)
Symptoms: Pods restart with OOMKilled status.Solutions:
- Increase memory limits (try 8 Gi)
- Ensure
GUNICORN_WORKERSis set to1 - Reduce
OMP_NUM_THREADSto2 - Check for concurrent large document processing
Slow document processing
Slow document processing
Symptoms: Documents take much longer than expected to process.Solutions:
- Check if GPU is available for Docling VLM mode
- Verify network latency to the AI provider
- Use Docling Auto instead of forcing VLM on all documents
- Check pod CPU utilization — may need more replicas
Low extraction accuracy
Low extraction accuracy
Symptoms: Extracted text is incomplete or garbled.Solutions:
- Try VLM or LLM mode for complex documents
- Check source document quality (low resolution images reduce OCR accuracy)
- Ensure the correct OCR engine is configured for your document language
- Use
apply_rotation: truefor scanned documents that may be rotated
Pod fails to start
Pod fails to start
Symptoms: Pod stays in CrashLoopBackOff or never becomes ready.Solutions:
- Increase readiness probe initial delay to 600 seconds
- Check that ML model files are accessible (network or storage issues)
- Verify memory limits are sufficient for model loading (minimum 4 Gi)
- Review pod logs for specific error messages

