> ## Documentation Index
> Fetch the complete documentation index at: https://docs.flowx.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Document Parser setup

> Configure the Document Parser service for document text extraction, OCR processing, and AI-powered document parsing.

The Document Parser service handles document and image processing for AI workflows. It provides multiple parsing engines, OCR capabilities, table extraction, document classification, and signature detection.

## Dependencies

The Document Parser connects to standard FlowX infrastructure services that should already be configured in your environment:

* [**Kafka**](/5.9/setup-guides/kafka-authentication-config) — async document processing and inter-service communication
* [**Identity provider**](/5.9/setup-guides/access-management/configuring-an-iam-solution) (Keycloak or Azure AD) — authentication
* [**CMS Core**](/5.9/setup-guides/cms-setup) — file retrieval from FlowX content management
* **AI provider** (OpenAI, Azure OpenAI, Ollama, Anthropic, or GCP) — required for LLM and VLM parsing engines
* **Vector database** (Qdrant or PostgreSQL with pgvector) — optional, for RAG use cases

<Info>
  The Document Parser retrieves files through CMS Core — it does not require its own S3/MinIO storage configuration.
</Info>

<Info>
  The Document Parser is a Python 3.13 service that loads ML models at startup. Plan for longer startup times (5–20 minutes) compared to Java-based FlowX services.
</Info>

***

## Capabilities

### What it can do

* **Multi-format parsing**: PDF, DOCX, XLSX/XLS/XLSM, PPTX, JPG, PNG, TIFF
* **Three parsing engine families**: Classic, LLM, and Docling (with Auto/Simple/OCR/VLM modes)
* **OCR text extraction**: Tesseract, RapidOCR, EasyOCR
* **Table structure extraction**: Detect and extract tabular data
* **Document classification**: Keyword-based or LLM-based
* **Signature detection**: Locate signatures within documents
* **Page rotation detection**: Detect and correct rotated pages
* **Semantic chunking**: Smart chunking for Retrieval-Augmented Generation (RAG) workflows
* **Vector store integration**: Store embeddings in Qdrant or PostgreSQL/pgvector

### What it cannot do

* Cannot process encrypted or password-protected PDFs
* Cannot process audio or video files
* Cannot edit or modify source documents
* Does not perform language translation

***

## Supported formats

| Format     | Extensions               | Notes                           |
| ---------- | ------------------------ | ------------------------------- |
| PDF        | `.pdf`                   | All parsing engines supported   |
| Word       | `.docx`                  | Docling and Classic engines     |
| Excel      | `.xlsx`, `.xls`, `.xlsm` | Table extraction supported      |
| PowerPoint | `.pptx`                  | Slide-by-slide extraction       |
| Images     | `.jpg`, `.png`, `.tiff`  | Converted to PDF before parsing |

***

## Parsing engines

The Document Parser supports six parsing modes organized into three engine families:

| Engine         | API Value             | Best for                        | Speed  | Cost        | Accuracy   |
| -------------- | --------------------- | ------------------------------- | ------ | ----------- | ---------- |
| Classic        | `ClassicParser`       | Clean text PDFs                 | Fast   | Free        | Low–Medium |
| LLM            | `LLMParser`           | Any document via AI vision      | Slow   | High        | High       |
| Docling Simple | `DoclingParserSimple` | Standard business documents     | Medium | Low         | Medium     |
| Docling OCR    | `DoclingParserOCR`    | Scanned/image-heavy documents   | Slow   | Medium–High | High       |
| Docling VLM    | `DoclingParserVLM`    | Complex layouts                 | Slow   | High        | Highest    |
| Docling Auto   | `DoclingParserAuto`   | Unknown documents (auto-routes) | Varies | Varies      | Varies     |

<Info>
  **Docling Auto** analyzes each page using text density, gibberish ratio, image area coverage, and layout metrics. It then routes each page to the cheapest engine that can handle it: **Standard** (simple extraction), **OCR** (text-poor/garbled pages), or **VLM** (complex visual layouts).
</Info>

### Engine selection guide

<Tabs>
  <Tab title="Classic">
    Uses PyMuPDF for direct text extraction from the PDF file structure, with Tesseract OCR as a fallback for empty pages.

    * Zero cost (no external API calls)
    * Fastest processing
    * Only works with digital PDFs that have selectable text
    * Falls back to Tesseract OCR when pages have no text layer
  </Tab>

  <Tab title="LLM">
    Encodes each page as a base64 image and sends it to an AI vision model for extraction.

    * Highest accuracy for complex documents
    * Most expensive (AI API call per page)
    * Handles any document type including handwritten text
    * Supports multiple providers: OpenAI, Azure OpenAI, Ollama, Anthropic, GCP
  </Tab>

  <Tab title="Docling">
    IBM's open-source document processing library with multiple modes:

    * **Simple** — Layout analysis without OCR, good for standard business documents
    * **OCR** — Adds OCR processing for scanned documents
    * **VLM** — Uses vision-language models for the highest accuracy on complex layouts
    * **Auto** — Automatically selects the best mode per page based on content analysis
  </Tab>
</Tabs>

***

## API endpoints

All endpoints are prefixed with the `URL_PREFIX` value (default: `/doc-parser`).

| Endpoint                                  | Method | Purpose                                                 |
| ----------------------------------------- | ------ | ------------------------------------------------------- |
| `/api/v1/doc-parser/extract/parse_file`   | POST   | Parse an uploaded file (multipart form)                 |
| `/api/v1/doc-parser/extract/parse_object` | POST   | Parse a file from CMS or MinIO (JSON body)              |
| `/api/v1/doc-parser/extract/store_text`   | POST   | Store extracted text as embeddings in a vector database |
| `/api/v1/doc-parser/info/health`          | GET    | Health check                                            |

***

## Configuration

### Server configuration

| Environment Variable        | Description                                 | Default Value |
| --------------------------- | ------------------------------------------- | ------------- |
| `URL_PREFIX`                | FastAPI root path for reverse proxy routing | `/doc-parser` |
| `GUNICORN_WORKERS`          | Number of Gunicorn worker processes         | `1`           |
| `GUNICORN_TIMEOUT`          | Worker timeout in seconds                   | `600`         |
| `GUNICORN_GRACEFUL_TIMEOUT` | Graceful shutdown timeout in seconds        | `320`         |
| `USE_DOCLING`               | Pre-download Docling models at startup      | `0`           |
| `VERBOSE`                   | Enable verbose/debug logging                | `0`           |

<Warning>
  Keep `GUNICORN_WORKERS` at `1` per pod. The Document Parser loads large ML models into memory — multiple workers per pod will cause out-of-memory (OOM) errors. Scale horizontally by adding more pods instead.
</Warning>

<Tip>
  Set `USE_DOCLING` to `1` if you use Docling parsing engines. This pre-downloads models at startup rather than on first request, avoiding timeout issues on the first document.
</Tip>

***

### AI model configuration

The Document Parser uses three types of AI models, each independently configurable:

| Model type | Variable               | Purpose                                                  | Default  |
| ---------- | ---------------------- | -------------------------------------------------------- | -------- |
| Chat       | `MODEL_TYPE`           | LLM-based parsing, classification, structured extraction | `OPENAI` |
| Vision     | `VISION_MODEL_TYPE`    | VLM parsing, image analysis                              | `OPENAI` |
| Embedding  | `EMBEDDING_MODEL_TYPE` | Vector store embeddings, semantic chunking               | `OPENAI` |

| Environment Variable       | Description                            | Default Value |
| -------------------------- | -------------------------------------- | ------------- |
| `FALLBACK_MODEL_TYPE`      | Fallback LLM provider if primary fails | -             |
| `STRUCTURED_OUTPUT_METHOD` | Structured output method for LLM       | `json_schema` |
| `VLM_API_TIMEOUT`          | Timeout in seconds for VLM API calls   | `300`         |

<Info>
  All three model types support the same providers: `OPENAI`, `AZUREOPENAI`, `OLLAMA`, `ANTHROPIC`, `GCP`, `GCP_VERTEX`, `CUSTOM_OPENAI`, `OPENAI_COMPATIBLE`. Embedding models additionally support `HUGGINGFACE` and `AWS`.
</Info>

#### Provider-specific variables

<Tabs>
  <Tab title="OpenAI">
    | Environment Variable          | Description          | Default Value            |
    | ----------------------------- | -------------------- | ------------------------ |
    | `OPENAI_API_KEY`              | OpenAI API key       | -                        |
    | `OPENAI_MODEL_NAME`           | Chat model name      | `gpt-4o-2024-08-06`      |
    | `OPENAI_VISION_MODEL_NAME`    | Vision model name    | `gpt-4o-2024-08-06`      |
    | `OPENAI_EMBEDDING_MODEL_NAME` | Embedding model name | `text-embedding-3-large` |
    | `OPENAI_CHUNK_SIZE`           | Embedding chunk size | `1000`                   |
  </Tab>

  <Tab title="Azure OpenAI">
    | Environment Variable               | Description                 | Default Value            |
    | ---------------------------------- | --------------------------- | ------------------------ |
    | `AZURE_OPENAI_API_KEY`             | Azure OpenAI API key        | -                        |
    | `AZURE_OPENAI_INSTANCE_NAME`       | Azure OpenAI instance name  | -                        |
    | `AZURE_OPENAI_API_VERSION`         | Azure OpenAI API version    | `2025-01-01-preview`     |
    | `AZURE_OPENAI_MODEL`               | Azure model deployment name | `gpt-4.1-mini`           |
    | `AZUREOPENAI_MODEL_NAME`           | Chat model name             | `gpt-4o-2024-08-06`      |
    | `AZUREOPENAI_VISION_MODEL_NAME`    | Vision model name           | `gpt-4o-2024-08-06`      |
    | `AZUREOPENAI_EMBEDDING_MODEL_NAME` | Embedding model name        | `text-embedding-3-large` |
  </Tab>

  <Tab title="Ollama">
    | Environment Variable          | Description          | Default Value     |
    | ----------------------------- | -------------------- | ----------------- |
    | `OLLAMA_BASE_URL`             | Ollama server URL    | -                 |
    | `OLLAMA_MODEL_NAME`           | Chat model name      | `llama3.2:latest` |
    | `OLLAMA_VISION_MODEL_NAME`    | Vision model name    | `llava:7b-v1.6`   |
    | `OLLAMA_EMBEDDING_MODEL_NAME` | Embedding model name | `all-minilm:33m`  |
  </Tab>

  <Tab title="Other providers">
    **Anthropic:**

    | Environment Variable          | Description       | Default Value |
    | ----------------------------- | ----------------- | ------------- |
    | `ANTHROPIC_MODEL_NAME`        | Chat model name   | -             |
    | `ANTHROPIC_VISION_MODEL_NAME` | Vision model name | -             |

    **Google Cloud / Vertex AI:**

    | Environment Variable    | Description             | Default Value    |
    | ----------------------- | ----------------------- | ---------------- |
    | `GCP_PROJECT_ID`        | Google Cloud project ID | -                |
    | `GCP_LOCATION`          | Google Cloud region     | -                |
    | `GCP_MODEL_NAME`        | Chat model name         | `gemini-1.5-pro` |
    | `GCP_VISION_MODEL_NAME` | Vision model name       | `gemini-1.5-pro` |

    **Custom OpenAI-compatible:**

    | Environment Variable                 | Description          | Default Value            |
    | ------------------------------------ | -------------------- | ------------------------ |
    | `CUSTOM_OPENAI_BASE_URL`             | API base URL         | -                        |
    | `CUSTOM_OPENAI_CLIENT_ID`            | Client ID            | -                        |
    | `CUSTOM_OPENAI_SECRET_KEY`           | Secret key           | -                        |
    | `CUSTOM_OPENAI_MODEL_NAME`           | Chat model name      | `gpt-4o-2024-08-06`      |
    | `CUSTOM_OPENAI_VISION_MODEL_NAME`    | Vision model name    | `gpt-4o-2024-08-06`      |
    | `CUSTOM_OPENAI_EMBEDDING_MODEL_NAME` | Embedding model name | `text-embedding-3-large` |
  </Tab>
</Tabs>

***

### OCR configuration

| Environment Variable | Description                          | Default Value |
| -------------------- | ------------------------------------ | ------------- |
| `OMP_NUM_THREADS`    | Number of threads for OCR processing | `4`           |

<Info>
  The OCR engine is selected per-request via the API `ocr_engine` parameter. Available engines: `tesseract-cli` (default), `tesseract-ocr`, `rapidocr`, `easyocr`.
</Info>

***

### Vector database configuration (optional)

Required only if using semantic chunking and RAG workflows. The vector store provider is selected per-request via the API `vector_store_provider` parameter.

<Tabs>
  <Tab title="Qdrant">
    | Environment Variable | Description       | Default Value           |
    | -------------------- | ----------------- | ----------------------- |
    | `QDRANT_HOST`        | Qdrant server URL | `http://localhost:6333` |
  </Tab>

  <Tab title="PostgreSQL / pgvector">
    | Environment Variable | Description       | Default Value |
    | -------------------- | ----------------- | ------------- |
    | `POSTGRES_HOST`      | PostgreSQL host   | `localhost`   |
    | `POSTGRES_PORT`      | PostgreSQL port   | `5432`        |
    | `POSTGRES_DB`        | Database name     | `dip`         |
    | `POSTGRES_USER`      | Database username | -             |
    | `POSTGRES_PASSWORD`  | Database password | -             |
  </Tab>
</Tabs>

***

### Observability (optional)

#### Langfuse

| Environment Variable  | Description         | Default Value |
| --------------------- | ------------------- | ------------- |
| `LANGFUSE_PUBLIC_KEY` | Langfuse public key | -             |
| `LANGFUSE_SECRET_KEY` | Langfuse secret key | -             |
| `LANGFUSE_HOST`       | Langfuse server URL | -             |

***

## Deployment and sizing

### Kubernetes configuration (15,000 pages/day)

<Tabs>
  <Tab title="Steady load">
    For predictable, consistent workloads:

    | Setting            | Value   |
    | ------------------ | ------- |
    | Replicas           | 3       |
    | CPU requests       | 3 cores |
    | CPU limits         | 4 cores |
    | RAM requests       | 4 Gi    |
    | RAM limits         | 6 Gi    |
    | `GUNICORN_WORKERS` | `1`     |
    | `OMP_NUM_THREADS`  | `4`     |
  </Tab>

  <Tab title="Autoscaling">
    For variable workloads with burst capacity:

    | Setting            | Value   |
    | ------------------ | ------- |
    | Min replicas       | 2       |
    | Max replicas       | 7       |
    | CPU requests       | 3 cores |
    | CPU limits         | 4 cores |
    | RAM requests       | 4 Gi    |
    | RAM limits         | 6 Gi    |
    | `GUNICORN_WORKERS` | `1`     |
    | `OMP_NUM_THREADS`  | `4`     |
    | HPA target CPU     | 70%     |
  </Tab>
</Tabs>

### Important deployment notes

<Warning>
  **Pod startup time:** 5–20 minutes. The service loads ML models (OCR, Docling) into memory at startup. Configure your readiness probes and deployment strategy accordingly.
</Warning>

* Set `GUNICORN_TIMEOUT` to at least `600` (10 minutes) — increase for very large documents
* Use `1` Gunicorn worker per pod — scale horizontally with more pods
* Configure liveness probes with a generous initial delay (at least 300 seconds)
* Consider using a `RollingUpdate` strategy with `maxUnavailable: 0` to avoid downtime during deploys
* The service listens on port `8080`

***

## Cost considerations

Estimated monthly costs for a 15,000 pages/day workload using LLM-based parsing:

| Component                   | Monthly cost estimate |
| --------------------------- | --------------------- |
| OpenAI API calls            | \~\$960               |
| Kubernetes compute (3 pods) | \~\$150–500           |
| **Total**                   | **\~\$1,100–1,460**   |

### Cost reduction tips

<CardGroup cols={2}>
  <Card title="Use keyword-only classification" icon="tag">
    Set `use_llm_to_classify` to `false` in API calls instead of LLM-based classification. Saves \~\$450/month.
  </Card>

  <Card title="Use Docling Auto" icon="wand-magic-sparkles">
    Let Docling Auto route pages to the cheapest engine that works. Saves \~\$240/month compared to forcing OCR on all pages.
  </Card>

  <Card title="Turn off signature detection" icon="toggle-off">
    Set `detect_signature` to `false` in API calls when not needed. Saves \~\$90/month.
  </Card>

  <Card title="Use Classic for clean PDFs" icon="bolt">
    Route clean digital PDFs to the Classic engine (free, no API calls). Reserve LLM/Docling for scanned or complex documents.
  </Card>
</CardGroup>

***

## Verify your setup

<Check>
  The Document Parser pod is running: `kubectl get pods -l app=document-parser`
</Check>

<Check>
  The health endpoint returns HTTP 200: `curl http://document-parser:8080/api/v1/doc-parser/info/health`
</Check>

<Check>
  ML models loaded successfully — check pod logs for model initialization messages (this can take 5–20 minutes)
</Check>

<Check>
  Test a document parse request against the `/api/v1/doc-parser/extract/parse_file` endpoint and verify the response contains extracted text
</Check>

***

## Troubleshooting

<AccordionGroup>
  <Accordion title="OOM kills (out of memory)">
    **Symptoms:** Pods restart with OOMKilled status.

    **Solutions:**

    * Increase memory limits (try 8 Gi)
    * Ensure `GUNICORN_WORKERS` is set to `1`
    * Reduce `OMP_NUM_THREADS` to `2`
    * Check for concurrent large document processing
  </Accordion>

  <Accordion title="Slow document processing">
    **Symptoms:** Documents take much longer than expected to process.

    **Solutions:**

    * Check if GPU is available for Docling VLM mode
    * Verify network latency to the AI provider
    * Use Docling Auto instead of forcing VLM on all documents
    * Check pod CPU utilization — may need more replicas
  </Accordion>

  <Accordion title="Low extraction accuracy">
    **Symptoms:** Extracted text is incomplete or garbled.

    **Solutions:**

    * Try VLM or LLM mode for complex documents
    * Check source document quality (low resolution images reduce OCR accuracy)
    * Ensure the correct OCR engine is configured for your document language
    * Use `apply_rotation: true` for scanned documents that may be rotated
  </Accordion>

  <Accordion title="Pod fails to start">
    **Symptoms:** Pod stays in CrashLoopBackOff or never becomes ready.

    **Solutions:**

    * Increase readiness probe initial delay to 600 seconds
    * Check that ML model files are accessible (network or storage issues)
    * Verify memory limits are sufficient for model loading (minimum 4 Gi)
    * Review pod logs for specific error messages
  </Accordion>
</AccordionGroup>

***

## Related resources

<CardGroup cols={2}>
  <Card title="Extract Data from File" icon="file-lines" href="/5.9/ai-platform/agent-builder/extract-data-from-file">
    Configure the Extract Data from File AI node in Agent Builder
  </Card>

  <Card title="AI Platform setup" icon="server" href="/5.9/ai-platform/ai-platform-setup">
    Configure AI Platform services and service discovery
  </Card>

  <Card title="Agent Builder overview" icon="robot" href="/5.9/ai-platform/agent-builder/overview">
    Get started with Agent Builder workflows
  </Card>
</CardGroup>