> ## Documentation Index
> Fetch the complete documentation index at: https://docs.flowx.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluations

> Score AI node outputs against test cases using LLM-as-judge, RAG-quality, and code evaluators. Catalogue, datasets, experiments, and the evals-judge service.

<Info>
  Attach test cases to AI nodes in a workflow, pick from a catalogue of evaluators (correctness, hallucination, RAG groundedness, JSON match, and more), launch an experiment, and read per-evaluator aggregate scores plus per-test-case details. Powered by the async `evals-judge` service.
</Info>

## Overview

An *evaluation* (eval) scores the output of an AI node against a known expected outcome. Use evals to:

* **Tune prompts and retrieval** — change a system prompt or KB chunk size, rerun the same experiment, and see whether your scores improved.
* **Regression-guard agents** — before promoting a new agent version, run the dataset of "hard" past inputs and compare scores against the previous version.
* **Pick a model** — run the same dataset against two different LLMs and compare correctness, conciseness, and cost.

Evals are workflow-centric in 5.9.0. Each dataset is attached to a specific AI node in a specific workflow; experiments launch that workflow against the dataset and score the node's outputs.

***

## Core concepts

| Concept                | What it is                                                                                                                                                                   |
| ---------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Dataset**            | A named collection of test cases attached to a specific AI node in a workflow. Holds the node configuration snapshot at the time the dataset was last updated.               |
| **Test case**          | One row of input data plus an expected output. May include retrieved context. Sourced either **manually** or captured from a real **node execution** in a workflow run.      |
| **Evaluator (scorer)** | A pre-seeded rule that scores a model output against the expected output. Belongs to one of three categories — LLM-as-judge, RAG-quality, or code-exact-match.               |
| **Experiment**         | A run of the workflow against every test case in one or more datasets, scored by one or more evaluators. Produces aggregate scores per evaluator plus per-test-case details. |
| **Score**              | Numeric (typically 0.0–1.0) or binary, depending on the evaluator. Numeric scorers have a default pass/fail threshold.                                                       |

***

## Get started

<Steps>
  <Step title="Add an AI node to a workflow">
    Drop a Custom Agent, Intent Classification, Extract Data, or any other AI node into a workflow and configure it as usual.
  </Step>

  <Step title="Create a dataset on the node">
    From the node's side panel, open **Evaluations → Datasets** and click **New dataset**. Name it after what it tests (e.g. *Customer-intent-classification-baseline*).
  </Step>

  <Step title="Add test cases">
    Add at least one test case manually, or capture one from an existing process instance run (see [Add test cases](#add-test-cases)).
  </Step>

  <Step title="Pick evaluators">
    Open **Evaluations → Experiments → New experiment**. Choose one or more datasets and pick the evaluators you want to score against from the catalogue.
  </Step>

  <Step title="Launch the experiment">
    Click **Start**. The workflow runs against every test case; the `evals-judge` service scores each (test case × evaluator) pair asynchronously.
  </Step>

  <Step title="Read the results">
    Once the experiment moves to `PASSED` or `FAILED`, open the report for aggregate scores per evaluator and a list of test-case rows you can drill into.
  </Step>
</Steps>

***

## Create a dataset

A dataset belongs to a single AI node on a single workflow. Two datasets on the same node must have unique names.

Fields:

| Field                     | Description                                                                                  |
| ------------------------- | -------------------------------------------------------------------------------------------- |
| **Name**                  | Human-readable label, unique per node.                                                       |
| **Description**           | Optional. Useful for explaining "what is this dataset testing?"                              |
| **Multimodal**            | Toggle on if test cases include image or file inputs as well as text.                        |
| **Output shape contract** | Optional JSON schema describing the expected output shape. Used by the JSON Match evaluator. |

The dataset persists a *schema snapshot* of the node's input/output at the time of last edit. This is what's used when a workflow is changed underneath an existing dataset — the experiment runs against the snapshot, not the live node, so older runs remain reproducible.

***

## Add test cases

A test case has three core fields:

* **Input** — the data the AI node receives.
* **Expected output** — what a correct response looks like. Required by exact-match and JSON-match evaluators; optional but recommended for LLM-as-judge.
* **Context** — optional retrieved context (KB chunks, retrieval results) used to score RAG evaluators.

Two ways to add them:

<Tabs>
  <Tab title="Manual">
    From the dataset's **Test cases** tab, click **Add test case**. Fill in **Input**, **Expected output**, and optional **Context**. Save.

    Use this for synthetic or hand-crafted cases — typical edge cases, known-difficult inputs, or cases borrowed from a spec or QA tracker.
  </Tab>

  <Tab title="From a node execution">
    Open a real process instance in the runtime, find the AI node execution you want to capture, and choose **Add to dataset**. FlowX serialises the node's input, captured output, and any retrieved context into a new test case on the chosen dataset.

    Use this to harvest "interesting" production cases (high-uncertainty completions, ones flagged by an operator) into a regression set.

    Captured cases are marked with `source: NODE_EXECUTION` and carry a back-reference to the originating execution.
  </Tab>
</Tabs>

***

## Launch an experiment

An experiment binds together:

* One or more datasets (all must be on the same AI node)
* One or more evaluators
* A node configuration snapshot (frozen at launch time)

When you click **Start**, the workflow runs end-to-end against every test case. Each output produces one judge job per (test case × evaluator), published to the `evals-judge.job.request` Kafka topic. `evals-judge` resolves the configured judge LLM, runs the scoring prompt, and returns a score + reasoning on `evals-judge.job.response`.

Experiment status moves through:

| Status      | Meaning                                                                                                     |
| ----------- | ----------------------------------------------------------------------------------------------------------- |
| `STARTED`   | Experiment accepted, workflows are being scheduled.                                                         |
| `RUNNING`   | At least one (test case × evaluator) pair is still scoring.                                                 |
| `PASSED`    | All scoring jobs completed; overall score is above the pass threshold for the experiment.                   |
| `FAILED`    | All scoring jobs completed; overall score is below threshold (or a non-trivial number of jobs hit the DLQ). |
| `CANCELLED` | Cancelled by a user before completion.                                                                      |

The Designer streams progress updates as scoring jobs complete, so the report fills in live without a full reload.

***

## View results

Each experiment produces a **report** with two views:

### Aggregate scores

A row per evaluator with the mean (or pass-rate) across all test cases. Useful for "is this prompt better than the last one?" comparisons.

### Per-test-case details

A row per test case showing:

* The input, the model's actual output, and the expected output side-by-side.
* One score column per evaluator selected for the experiment.
* For LLM-as-judge evaluators, the judge's reasoning text (the model's own explanation of the score).
* Pass/fail badges against each evaluator's threshold.

Click into a row to see the full prompt, response, and any retrieved context — this is the screen you use to debug "why is this case failing?"

***

## Evaluator catalogue

5.9.0 ships **10 built-in evaluators** in three categories. The catalogue is read-only — user-defined evaluators are planned for a later release.

### LLM-as-judge

A judge LLM reads the model output (and optional expected output) and scores it on a single rubric. Score is numeric 0.0–1.0.

| Evaluator            | What it scores                                                                    | Needs expected output? |
| -------------------- | --------------------------------------------------------------------------------- | ---------------------- |
| **Correctness**      | How closely the output matches the expected meaning.                              | Yes                    |
| **Conciseness**      | Whether the output is appropriately brief — penalises padding and rambling.       | No                     |
| **Hallucination**    | Whether the output contains claims not grounded in the input or provided context. | No                     |
| **Answer relevance** | Whether the output actually answers what the input asked.                         | No                     |

### RAG quality

Evaluators that judge the *retrieval* side of a RAG workflow. Score is numeric 0.0–1.0.

| Evaluator               | What it scores                                                                    |
| ----------------------- | --------------------------------------------------------------------------------- |
| **Groundedness**        | Is every claim in the output supported by the retrieved context?                  |
| **Helpfulness**         | Does the retrieved context actually help answer the question, or is it off-topic? |
| **Retrieval relevance** | Are the top retrieved chunks the most relevant to the question?                   |

### Code

Deterministic, no LLM call. Useful for structured outputs.

| Evaluator       | What it scores                                                            | Score type |
| --------------- | ------------------------------------------------------------------------- | ---------- |
| **Exact match** | Output equals expected, character-for-character.                          | Binary     |
| **Levenshtein** | Edit-distance similarity between output and expected (normalised).        | Numeric    |
| **JSON match**  | Output is valid JSON and conforms to the dataset's output-shape contract. | Binary     |

### Score types and thresholds

* **Numeric evaluators** return a score in `[0.0, 1.0]` and ship with a `defaultThreshold` (e.g. `0.7`). The report flags a row as failing if the score is below the threshold.
* **Binary evaluators** return pass or fail. No threshold.

***

## Judge model resolution

LLM-as-judge and RAG evaluators need a judge LLM to run their scoring prompts. The model is resolved **per organization** via a new LLM capability:

* 5.9.0 adds the **`EVAL`** capability to the org-manager LLM catalogue (display order 7, alongside the existing capabilities).
* The platform-default judge model is **`gpt-5.4-mini`**, backfilled onto every org's catalogue on upgrade.
* An admin can change the judge model per org by assigning `EVAL` to a different model in **Settings → AI Providers → Models**.

This means the same experiment can produce slightly different scores in two different orgs if they've picked different judge models — the report records which model was used, so the comparison stays apples-to-apples within one org.

<Info>
  The judge LLM is *not* the same as the LLM your agent uses. The agent under test can use any provider; the judge runs separately and can be a cheaper, faster model dedicated to scoring.
</Info>

***

## Permissions

Evals reuse existing workflow-level permissions — there is no new role.

| Capability                                                                                       | Permission      |
| ------------------------------------------------------------------------------------------------ | --------------- |
| List, view datasets, experiments, reports                                                        | `WORKFLOW_READ` |
| Create / update / delete datasets, add / remove test cases, launch / cancel / delete experiments | `WORKFLOW_EDIT` |

Permissions are enforced per workflow resource ID — a user with `WORKFLOW_EDIT` on Workflow A but not Workflow B cannot launch an experiment against B's nodes.

***

## How it works

The runtime side has three pieces:

* **`integration-designer`** owns the dataset, test-case, and experiment REST APIs and the experiment orchestration. It publishes judge job requests to Kafka and writes results to MongoDB.
* **`evals-judge`** is a small Python service that consumes judge job requests, resolves the judge LLM via `organization-manager`, calls the LLM provider, and publishes scored responses.
* **`organization-manager`** owns the LLM catalogue and resolves the model bound to the `EVAL` capability for the org.

Scoring is fully async. A judge job's lifecycle:

1. `integration-designer` publishes a request on `ai.flowx.ai-platform.evals-judge.job.request.v1` with the test case, the evaluator id, and the node-config snapshot.
2. `evals-judge` consumes the request, resolves the judge model, runs the scoring prompt, and publishes the result on `ai.flowx.ai-platform.evals-judge.job.response.v1`.
3. `integration-designer` consumes the response, persists the score on the test-case row, and pushes a progress event over SSE to any open Designer report.
4. Permanent failures (no judge model resolvable, invalid request shape, repeated timeouts) end up on `ai.flowx.ai-platform.evals-judge.job.dlq.v1` for manual operator review — the report shows the test-case row as `ERROR`.

***

## REST API surface

The Designer is the primary entry point, but the same endpoints are usable from your own tooling. All are scoped to a workspace and require `WORKFLOW_READ` / `WORKFLOW_EDIT` on the relevant workflow.

| Method             | Endpoint                                                                | Purpose                                             |
| ------------------ | ----------------------------------------------------------------------- | --------------------------------------------------- |
| POST               | `/api/workflows/datasets`                                               | Create a dataset on a node                          |
| GET                | `/api/workflows/datasets?nodeFlowxUuid={uuid}`                          | List datasets, optionally filtered to one node      |
| GET / PUT / DELETE | `/api/workflows/datasets/{datasetUuid}`                                 | Fetch, update name/description, or delete a dataset |
| POST               | `/api/workflows/datasets/{datasetUuid}/test-cases`                      | Add a test case (manual)                            |
| POST               | `/api/workflows/experiments/start`                                      | Launch an experiment                                |
| GET                | `/api/workflows/experiments`                                            | List experiments (paginated)                        |
| GET                | `/api/workflows/experiments/{experimentUuid}`                           | Experiment report with aggregate scores             |
| GET                | `/api/workflows/experiments/{experimentUuid}/test-cases/{testCaseUuid}` | One test-case row with all scores                   |
| POST               | `/api/workflows/experiments/{experimentUuid}/cancel`                    | Cancel a running experiment                         |
| DELETE             | `/api/workflows/experiments/{experimentUuid}`                           | Delete a completed or cancelled experiment          |

***

## Operator setup

Evals add a new service and three new Kafka topics to the AI Platform deployment.

### evals-judge service

* **Language:** Python
* **Container image:** `{{flowx-ai}}/evals-judge`
* **Dependencies:** Kafka (request / response / DLQ), `organization-manager` (LLM resolution), the LLM provider configured for the `EVAL` capability, S3-compatible storage (MinIO env vars on multimodal test cases).
* **No exposed ingress.** Communication is Kafka-only.
* **Storage:** none of its own — state lives in `integration-designer`'s MongoDB and the LLM provider.

Add `evals-judge` to your AI Platform install. See [AI Platform setup](/5.9/ai-platform/ai-platform-setup#service-architecture) for the full Python services list.

### Kafka topics

Declare the following in your Kafka provisioning (`kafka-topics.yaml.gotmpl` for k8s, or your own topic management):

| Topic                                              | Partitions | Purpose                                         |
| -------------------------------------------------- | ---------- | ----------------------------------------------- |
| `ai.flowx.ai-platform.evals-judge.job.request.v1`  | 3          | Judge job requests from `integration-designer`  |
| `ai.flowx.ai-platform.evals-judge.job.response.v1` | 3          | Scored responses back to `integration-designer` |
| `ai.flowx.ai-platform.evals-judge.job.dlq.v1`      | 3          | Dead-letter queue for unscored jobs             |

The `integration-designer` consumer group for the response topic is `integration-designer-evals-judge-response-group` with 3 consumer threads by default.

### EVAL capability backfill

The migration `20260513_eval_capability.xml` runs on the first `organization-manager` boot after upgrade and:

* Registers the `EVAL` capability in `llm_capability_catalog` (display order 7).
* Backfills `gpt-5.4-mini` as the model bound to `EVAL` in `llm_model_catalog` (platform default).
* Backfills the same binding into every existing organization's `llm_model` table so evals work out-of-the-box on upgrade without manual configuration.

If your org uses a non-OpenAI provider, change the bound model in **Settings → AI Providers → Models** after the upgrade.

***

## Limitations in 5.9.0

* **No user-defined evaluators.** The 10-evaluator catalogue is read-only. Custom rubrics and custom scoring prompts are planned for a later release.
* **No eval suggestions.** The `EVAL_SUGGESTIONS` capability is registered but reserved — there is no producer in 5.9.0.
* **Workflow-scoped only.** Datasets bind to a single AI node on a single workflow. Cross-workflow shared datasets are out of scope.
* **Judge model is org-wide.** Picking a different judge per evaluator or per experiment is not supported.
* **No retroactive backfill across workflow rewrites.** If you change the AI node's contract significantly, existing datasets keep their snapshot — they don't fail, but they may no longer reflect the current workflow's behaviour. Create a new dataset.
* **DLQ is manual.** There is no automatic retry for jobs landing on the DLQ; an operator must inspect and decide whether to retry.

***

## Related resources

<CardGroup cols={2}>
  <Card title="AI Platform setup" icon="server" href="./ai-platform-setup">
    Install evals-judge alongside the rest of the AI Platform.
  </Card>

  <Card title="AI providers" icon="plug" href="./ai-providers">
    Configure the LLM bound to the `EVAL` capability for your org.
  </Card>

  <Card title="Agent Builder overview" icon="robot" href="./agent-builder/overview">
    Build the AI agents whose outputs you'll score.
  </Card>

  <Card title="Conversational workflows" icon="diagram-project" href="./conversational-workflows">
    Conversational workflow nodes you can attach datasets to.
  </Card>
</CardGroup>
