Documentation Index
Fetch the complete documentation index at: https://docs.flowx.ai/llms.txt
Use this file to discover all available pages before exploring further.
Attach test cases to AI nodes in a workflow, pick from a catalogue of evaluators (correctness, hallucination, RAG groundedness, JSON match, and more), launch an experiment, and read per-evaluator aggregate scores plus per-test-case details. Powered by the async
evals-judge service.Overview
An evaluation (eval) scores the output of an AI node against a known expected outcome. Use evals to:- Tune prompts and retrieval — change a system prompt or KB chunk size, rerun the same experiment, and see whether your scores improved.
- Regression-guard agents — before promoting a new agent version, run the dataset of “hard” past inputs and compare scores against the previous version.
- Pick a model — run the same dataset against two different LLMs and compare correctness, conciseness, and cost.
Core concepts
| Concept | What it is |
|---|---|
| Dataset | A named collection of test cases attached to a specific AI node in a workflow. Holds the node configuration snapshot at the time the dataset was last updated. |
| Test case | One row of input data plus an expected output. May include retrieved context. Sourced either manually or captured from a real node execution in a workflow run. |
| Evaluator (scorer) | A pre-seeded rule that scores a model output against the expected output. Belongs to one of three categories — LLM-as-judge, RAG-quality, or code-exact-match. |
| Experiment | A run of the workflow against every test case in one or more datasets, scored by one or more evaluators. Produces aggregate scores per evaluator plus per-test-case details. |
| Score | Numeric (typically 0.0–1.0) or binary, depending on the evaluator. Numeric scorers have a default pass/fail threshold. |
Get started
Add an AI node to a workflow
Drop a Custom Agent, Intent Classification, Extract Data, or any other AI node into a workflow and configure it as usual.
Create a dataset on the node
From the node’s side panel, open Evaluations → Datasets and click New dataset. Name it after what it tests (e.g. Customer-intent-classification-baseline).
Add test cases
Add at least one test case manually, or capture one from an existing process instance run (see Add test cases).
Pick evaluators
Open Evaluations → Experiments → New experiment. Choose one or more datasets and pick the evaluators you want to score against from the catalogue.
Launch the experiment
Click Start. The workflow runs against every test case; the
evals-judge service scores each (test case × evaluator) pair asynchronously.Create a dataset
A dataset belongs to a single AI node on a single workflow. Two datasets on the same node must have unique names. Fields:| Field | Description |
|---|---|
| Name | Human-readable label, unique per node. |
| Description | Optional. Useful for explaining “what is this dataset testing?” |
| Multimodal | Toggle on if test cases include image or file inputs as well as text. |
| Output shape contract | Optional JSON schema describing the expected output shape. Used by the JSON Match evaluator. |
Add test cases
A test case has three core fields:- Input — the data the AI node receives.
- Expected output — what a correct response looks like. Required by exact-match and JSON-match evaluators; optional but recommended for LLM-as-judge.
- Context — optional retrieved context (KB chunks, retrieval results) used to score RAG evaluators.
- Manual
- From a node execution
From the dataset’s Test cases tab, click Add test case. Fill in Input, Expected output, and optional Context. Save.Use this for synthetic or hand-crafted cases — typical edge cases, known-difficult inputs, or cases borrowed from a spec or QA tracker.
Launch an experiment
An experiment binds together:- One or more datasets (all must be on the same AI node)
- One or more evaluators
- A node configuration snapshot (frozen at launch time)
evals-judge.job.request Kafka topic. evals-judge resolves the configured judge LLM, runs the scoring prompt, and returns a score + reasoning on evals-judge.job.response.
Experiment status moves through:
| Status | Meaning |
|---|---|
STARTED | Experiment accepted, workflows are being scheduled. |
RUNNING | At least one (test case × evaluator) pair is still scoring. |
PASSED | All scoring jobs completed; overall score is above the pass threshold for the experiment. |
FAILED | All scoring jobs completed; overall score is below threshold (or a non-trivial number of jobs hit the DLQ). |
CANCELLED | Cancelled by a user before completion. |
View results
Each experiment produces a report with two views:Aggregate scores
A row per evaluator with the mean (or pass-rate) across all test cases. Useful for “is this prompt better than the last one?” comparisons.Per-test-case details
A row per test case showing:- The input, the model’s actual output, and the expected output side-by-side.
- One score column per evaluator selected for the experiment.
- For LLM-as-judge evaluators, the judge’s reasoning text (the model’s own explanation of the score).
- Pass/fail badges against each evaluator’s threshold.
Evaluator catalogue
5.9.0 ships 10 built-in evaluators in three categories. The catalogue is read-only — user-defined evaluators are planned for a later release.LLM-as-judge
A judge LLM reads the model output (and optional expected output) and scores it on a single rubric. Score is numeric 0.0–1.0.| Evaluator | What it scores | Needs expected output? |
|---|---|---|
| Correctness | How closely the output matches the expected meaning. | Yes |
| Conciseness | Whether the output is appropriately brief — penalises padding and rambling. | No |
| Hallucination | Whether the output contains claims not grounded in the input or provided context. | No |
| Answer relevance | Whether the output actually answers what the input asked. | No |
RAG quality
Evaluators that judge the retrieval side of a RAG workflow. Score is numeric 0.0–1.0.| Evaluator | What it scores |
|---|---|
| Groundedness | Is every claim in the output supported by the retrieved context? |
| Helpfulness | Does the retrieved context actually help answer the question, or is it off-topic? |
| Retrieval relevance | Are the top retrieved chunks the most relevant to the question? |
Code
Deterministic, no LLM call. Useful for structured outputs.| Evaluator | What it scores | Score type |
|---|---|---|
| Exact match | Output equals expected, character-for-character. | Binary |
| Levenshtein | Edit-distance similarity between output and expected (normalised). | Numeric |
| JSON match | Output is valid JSON and conforms to the dataset’s output-shape contract. | Binary |
Score types and thresholds
- Numeric evaluators return a score in
[0.0, 1.0]and ship with adefaultThreshold(e.g.0.7). The report flags a row as failing if the score is below the threshold. - Binary evaluators return pass or fail. No threshold.
Judge model resolution
LLM-as-judge and RAG evaluators need a judge LLM to run their scoring prompts. The model is resolved per organization via a new LLM capability:- 5.9.0 adds the
EVALcapability to the org-manager LLM catalogue (display order 7, alongside the existing capabilities). - The platform-default judge model is
gpt-5.4-mini, backfilled onto every org’s catalogue on upgrade. - An admin can change the judge model per org by assigning
EVALto a different model in Settings → AI Providers → Models.
The judge LLM is not the same as the LLM your agent uses. The agent under test can use any provider; the judge runs separately and can be a cheaper, faster model dedicated to scoring.
Permissions
Evals reuse existing workflow-level permissions — there is no new role.| Capability | Permission |
|---|---|
| List, view datasets, experiments, reports | WORKFLOW_READ |
| Create / update / delete datasets, add / remove test cases, launch / cancel / delete experiments | WORKFLOW_EDIT |
WORKFLOW_EDIT on Workflow A but not Workflow B cannot launch an experiment against B’s nodes.
How it works
The runtime side has three pieces:integration-designerowns the dataset, test-case, and experiment REST APIs and the experiment orchestration. It publishes judge job requests to Kafka and writes results to MongoDB.evals-judgeis a small Python service that consumes judge job requests, resolves the judge LLM viaorganization-manager, calls the LLM provider, and publishes scored responses.organization-managerowns the LLM catalogue and resolves the model bound to theEVALcapability for the org.
integration-designerpublishes a request onai.flowx.ai-platform.evals-judge.job.request.v1with the test case, the evaluator id, and the node-config snapshot.evals-judgeconsumes the request, resolves the judge model, runs the scoring prompt, and publishes the result onai.flowx.ai-platform.evals-judge.job.response.v1.integration-designerconsumes the response, persists the score on the test-case row, and pushes a progress event over SSE to any open Designer report.- Permanent failures (no judge model resolvable, invalid request shape, repeated timeouts) end up on
ai.flowx.ai-platform.evals-judge.job.dlq.v1for manual operator review — the report shows the test-case row asERROR.
REST API surface
The Designer is the primary entry point, but the same endpoints are usable from your own tooling. All are scoped to a workspace and requireWORKFLOW_READ / WORKFLOW_EDIT on the relevant workflow.
| Method | Endpoint | Purpose |
|---|---|---|
| POST | /api/workflows/datasets | Create a dataset on a node |
| GET | /api/workflows/datasets?nodeFlowxUuid={uuid} | List datasets, optionally filtered to one node |
| GET / PUT / DELETE | /api/workflows/datasets/{datasetUuid} | Fetch, update name/description, or delete a dataset |
| POST | /api/workflows/datasets/{datasetUuid}/test-cases | Add a test case (manual) |
| POST | /api/workflows/experiments/start | Launch an experiment |
| GET | /api/workflows/experiments | List experiments (paginated) |
| GET | /api/workflows/experiments/{experimentUuid} | Experiment report with aggregate scores |
| GET | /api/workflows/experiments/{experimentUuid}/test-cases/{testCaseUuid} | One test-case row with all scores |
| POST | /api/workflows/experiments/{experimentUuid}/cancel | Cancel a running experiment |
| DELETE | /api/workflows/experiments/{experimentUuid} | Delete a completed or cancelled experiment |
Operator setup
Evals add a new service and three new Kafka topics to the AI Platform deployment.evals-judge service
- Language: Python
- Container image:
{{flowx-ai}}/evals-judge - Dependencies: Kafka (request / response / DLQ),
organization-manager(LLM resolution), the LLM provider configured for theEVALcapability, S3-compatible storage (MinIO env vars on multimodal test cases). - No exposed ingress. Communication is Kafka-only.
- Storage: none of its own — state lives in
integration-designer’s MongoDB and the LLM provider.
evals-judge to your AI Platform install. See AI Platform setup for the full Python services list.
Kafka topics
Declare the following in your Kafka provisioning (kafka-topics.yaml.gotmpl for k8s, or your own topic management):
| Topic | Partitions | Purpose |
|---|---|---|
ai.flowx.ai-platform.evals-judge.job.request.v1 | 3 | Judge job requests from integration-designer |
ai.flowx.ai-platform.evals-judge.job.response.v1 | 3 | Scored responses back to integration-designer |
ai.flowx.ai-platform.evals-judge.job.dlq.v1 | 3 | Dead-letter queue for unscored jobs |
integration-designer consumer group for the response topic is integration-designer-evals-judge-response-group with 3 consumer threads by default.
EVAL capability backfill
The migration20260513_eval_capability.xml runs on the first organization-manager boot after upgrade and:
- Registers the
EVALcapability inllm_capability_catalog(display order 7). - Backfills
gpt-5.4-minias the model bound toEVALinllm_model_catalog(platform default). - Backfills the same binding into every existing organization’s
llm_modeltable so evals work out-of-the-box on upgrade without manual configuration.
Limitations in 5.9.0
- No user-defined evaluators. The 10-evaluator catalogue is read-only. Custom rubrics and custom scoring prompts are planned for a later release.
- No eval suggestions. The
EVAL_SUGGESTIONScapability is registered but reserved — there is no producer in 5.9.0. - Workflow-scoped only. Datasets bind to a single AI node on a single workflow. Cross-workflow shared datasets are out of scope.
- Judge model is org-wide. Picking a different judge per evaluator or per experiment is not supported.
- No retroactive backfill across workflow rewrites. If you change the AI node’s contract significantly, existing datasets keep their snapshot — they don’t fail, but they may no longer reflect the current workflow’s behaviour. Create a new dataset.
- DLQ is manual. There is no automatic retry for jobs landing on the DLQ; an operator must inspect and decide whether to retry.
Related resources
AI Platform setup
Install evals-judge alongside the rest of the AI Platform.
AI providers
Configure the LLM bound to the
EVAL capability for your org.Agent Builder overview
Build the AI agents whose outputs you’ll score.
Conversational workflows
Conversational workflow nodes you can attach datasets to.

