Evaluations

Attach test cases to AI nodes in a workflow, pick from a catalogue of evaluators (correctness, hallucination, RAG groundedness, JSON match, and more), launch an experiment, and read per-evaluator aggregate scores plus per-test-case details. Powered by the async evals-judge service.

Overview

An evaluation (eval) scores the output of an AI node against a known expected outcome. Use evals to:

Tune prompts and retrieval — change a system prompt or KB chunk size, rerun the same experiment, and see whether your scores improved.
Regression-guard agents — before promoting a new agent version, run the dataset of “hard” past inputs and compare scores against the previous version.
Pick a model — run the same dataset against two different LLMs and compare correctness, conciseness, and cost.

Evals are workflow-centric. Each dataset is attached to a specific AI node in a specific workflow; experiments launch that workflow against the dataset and score the node’s outputs.

Core concepts

Concept	What it is
Dataset	A named collection of test cases attached to a specific AI node in a workflow. Holds the node configuration snapshot at the time the dataset was last updated.
Test case	One row of input data plus an expected output. May include retrieved context. Sourced either manually or captured from a real node execution in a workflow run.
Evaluator (scorer)	A pre-seeded rule that scores a model output against the expected output. Belongs to one of three categories — LLM-as-judge, RAG-quality, or code-exact-match.
Experiment	A run of the workflow against every test case in one or more datasets, scored by one or more evaluators. Produces aggregate scores per evaluator plus per-test-case details.
Score	Numeric (typically 0.0–1.0) or binary, depending on the evaluator. Numeric scorers have a default pass/fail threshold.

Get started

Add an AI node to a workflow

Drop a Custom Agent, Intent Classification, Extract Data, or any other AI node into a workflow and configure it as usual.

Create a dataset on the node

From the node’s side panel, open Evaluations → Datasets and click New dataset. Name it after what it tests (e.g. Customer-intent-classification-baseline).

Add test cases

Add at least one test case manually, or capture one from an existing process instance run (see Add test cases).

Pick evaluators

Open Evaluations → Experiments → New experiment. Choose one or more datasets and pick the evaluators you want to score against from the catalogue.

Launch the experiment

Click Start. The workflow runs against every test case; the evals-judge service scores each (test case × evaluator) pair asynchronously.

Read the results

Once the experiment moves to PASSED or FAILED, open the report for aggregate scores per evaluator and a list of test-case rows you can drill into.

Create a dataset

A dataset belongs to a single AI node on a single workflow. Two datasets on the same node must have unique names. Fields:

Field	Description
Name	Human-readable label, unique per node.
Description	Optional. Useful for explaining “what is this dataset testing?”
Multimodal	Toggle on if test cases include image or file inputs as well as text.
Output shape contract	Optional JSON schema describing the expected output shape. Used by the JSON Match evaluator.

The dataset persists a schema snapshot of the node’s input/output at the time of last edit. This is what’s used when a workflow is changed underneath an existing dataset — the experiment runs against the snapshot, not the live node, so older runs remain reproducible.

Add test cases

A test case has three core fields:

Input — the data the AI node receives.
Expected output — what a correct response looks like. Required by exact-match and JSON-match evaluators; optional but recommended for LLM-as-judge.
Context — optional retrieved context (KB chunks, retrieval results) used to score RAG evaluators.

Two ways to add them:

Manual
From a node execution

From the dataset’s Test cases tab, click Add test case. Fill in Input, Expected output, and optional Context. Save.Use this for synthetic or hand-crafted cases — typical edge cases, known-difficult inputs, or cases borrowed from a spec or QA tracker.

Open a real process instance in the runtime, find the AI node execution you want to capture, and choose Add to dataset. FlowX serialises the node’s input, captured output, and any retrieved context into a new test case on the chosen dataset.Use this to harvest “interesting” production cases (high-uncertainty completions, ones flagged by an operator) into a regression set.Captured cases are marked with source: NODE_EXECUTION and carry a back-reference to the originating execution.

Launch an experiment

An experiment binds together:

One or more datasets (all must be on the same AI node)
One or more evaluators
A node configuration snapshot (frozen at launch time)

When you click Start, the workflow runs end-to-end against every test case. Each output produces one judge job per (test case × evaluator), published to the evals-judge.job.request Kafka topic. evals-judge resolves the configured judge LLM, runs the scoring prompt, and returns a score + reasoning on evals-judge.job.response. Experiment status moves through:

Status	Meaning
`STARTED`	Experiment accepted, workflows are being scheduled.
`RUNNING`	At least one (test case × evaluator) pair is still scoring.
`PASSED`	All scoring jobs completed; overall score is above the pass threshold for the experiment.
`FAILED`	All scoring jobs completed; overall score is below threshold (or a non-trivial number of jobs hit the DLQ).
`CANCELLED`	Cancelled by a user before completion.

The Designer streams progress updates as scoring jobs complete, so the report fills in live without a full reload.

Multiple datasets

Available starting with FlowX.AI 5.9.2

An experiment can run against multiple datasets at once (datasets sharing the same AI node). Overall metrics are aggregated across all of them, and the report adds a dataset column so you can see how each case scored per dataset.

View results

Each experiment produces a report with two views:

Aggregate scores

A row per evaluator with the mean (or pass-rate) across all test cases. Useful for “is this prompt better than the last one?” comparisons.

Per-test-case details

A row per test case showing:

The input, the model’s actual output, and the expected output side-by-side.
One score column per evaluator selected for the experiment.
For LLM-as-judge evaluators, the judge’s reasoning text (the model’s own explanation of the score).
Pass/fail badges against each evaluator’s threshold.

Click into a row to see the full prompt, response, and any retrieved context — this is the screen you use to debug “why is this case failing?”

Improvement suggestions

Available starting with FlowX.AI 5.9.2

Experiment reports show AI-generated suggestions for improving the node’s prompt, grouped per LLM-as-judge evaluator, beneath the score cards — turning a low score into a concrete next edit.

Evaluator catalogue

The catalogue includes 10 built-in evaluators in three categories. It is read-only — user-defined evaluators are planned for a later release.

General quality

A judge LLM reads the model output (and optional expected output) and scores it on a single rubric. Score is numeric 0.0–1.0. In the Designer these appear under the General Quality group.

Evaluator	What it scores	Needs expected output?
Correctness	How closely the output matches the expected meaning.	Yes
Conciseness	Whether the output is appropriately brief — penalises padding and rambling.	No
Hallucination	Whether the output contains claims not grounded in the input or provided context.	No
Answer relevance	Whether the output actually answers what the input asked.	No

RAG

Evaluators for RAG workflows, covering both the retrieval and the answer. Score is numeric 0.0–1.0.

Evaluator	What it scores
Groundedness	Is every claim in the output traceable to the provided context, rather than the model’s prior knowledge?
Helpfulness	Holistically, how useful is the answer to the end user, considering tone, completeness, and actionability?
Retrieval relevance	How relevant are the retrieved documents to the user’s question? A quality check for the retrieval pipeline.

Code / deterministic

Deterministic, no LLM call. Useful for structured outputs.

Evaluator	What it scores	Score type
Exact match	Output equals expected, character-for-character.	Binary
Levenshtein	Edit-distance similarity between output and expected (normalised).	Numeric
JSON match	Output is valid JSON and conforms to the dataset’s output-shape contract.	Binary

Score types and thresholds

Numeric evaluators return a score in [0.0, 1.0] and ship with a defaultThreshold (e.g. 0.7). The report flags a row as failing if the score is below the threshold.
Binary evaluators return pass or fail. No threshold.

Judge model resolution

LLM-as-judge and RAG evaluators need a judge LLM to run their scoring prompts. The model is resolved per organization via a new LLM capability:

The EVAL capability is part of the org-manager LLM catalogue (display order 7, alongside the existing capabilities).
The platform-default judge model is gpt-5.4-mini, backfilled onto every org’s catalogue on upgrade.
An admin can change the judge model per org by assigning EVAL to a different model in Settings → AI Providers → Models.

This means the same experiment can produce slightly different scores in two different orgs if they’ve picked different judge models — the report records which model was used, so the comparison stays apples-to-apples within one org.

The judge LLM is not the same as the LLM your agent uses. The agent under test can use any provider; the judge runs separately and can be a cheaper, faster model dedicated to scoring.

Permissions

Evals reuse existing workflow-level permissions — there is no new role.

Capability	Permission
List, view datasets, experiments, reports	`WORKFLOW_READ`
Create / update / delete datasets, add / remove test cases, launch / cancel / delete experiments	`WORKFLOW_EDIT`

Permissions are enforced per workflow resource ID — a user with WORKFLOW_EDIT on Workflow A but not Workflow B cannot launch an experiment against B’s nodes.

How it works

The runtime side has three pieces:

integration-designer owns the dataset, test-case, and experiment REST APIs and the experiment orchestration. It publishes judge job requests to Kafka and writes results to MongoDB.
evals-judge is a small Python service that consumes judge job requests, resolves the judge LLM via organization-manager, calls the LLM provider, and publishes scored responses.
organization-manager owns the LLM catalogue and resolves the model bound to the EVAL capability for the org.

Scoring is fully async. A judge job’s lifecycle:

integration-designer publishes a request on ai.flowx.ai-platform.evals-judge.job.request.v1 with the test case, the evaluator id, and the node-config snapshot.
evals-judge consumes the request, resolves the judge model, runs the scoring prompt, and publishes the result on ai.flowx.ai-platform.evals-judge.job.response.v1.
integration-designer consumes the response, persists the score on the test-case row, and pushes a progress event over SSE to any open Designer report.
Permanent failures (no judge model resolvable, invalid request shape, repeated timeouts) end up on ai.flowx.ai-platform.evals-judge.job.dlq.v1 for manual operator review — the report shows the test-case row as ERROR.

REST API surface

The Designer is the primary entry point, but the same endpoints are usable from your own tooling. All are scoped to a workspace and require WORKFLOW_READ / WORKFLOW_EDIT on the relevant workflow.

Method	Endpoint	Purpose
POST	`/api/workflows/datasets`	Create a dataset on a node
GET	`/api/workflows/datasets?nodeFlowxUuid={uuid}`	List datasets, optionally filtered to one node
GET / PUT / DELETE	`/api/workflows/datasets/{datasetUuid}`	Fetch, update name/description, or delete a dataset
POST	`/api/workflows/datasets/{datasetUuid}/test-cases`	Add a test case (manual)
POST	`/api/workflows/experiments/start`	Launch an experiment
GET	`/api/workflows/experiments`	List experiments (paginated)
GET	`/api/workflows/experiments/{experimentUuid}`	Experiment report with aggregate scores
GET	`/api/workflows/experiments/{experimentUuid}/test-cases/{testCaseUuid}`	One test-case row with all scores
POST	`/api/workflows/experiments/{experimentUuid}/cancel`	Cancel a running experiment
DELETE	`/api/workflows/experiments/{experimentUuid}`	Delete a completed or cancelled experiment

Operator setup

Evals add a new service and three new Kafka topics to the AI Platform deployment.

evals-judge service

Language: Python
Container image: {{flowx-ai}}/evals-judge
Dependencies: Kafka (request / response / DLQ), organization-manager (LLM resolution), the LLM provider configured for the EVAL capability, S3-compatible storage (MinIO env vars on multimodal test cases).
No exposed ingress. Communication is Kafka-only.
Storage: none of its own — state lives in integration-designer’s MongoDB and the LLM provider.

Add evals-judge to your AI Platform install. See AI Platform setup for the full Python services list.

Kafka topics

Declare the following in your Kafka provisioning (kafka-topics.yaml.gotmpl for k8s, or your own topic management):

Topic	Partitions	Purpose
`ai.flowx.ai-platform.evals-judge.job.request.v1`	3	Judge job requests from `integration-designer`
`ai.flowx.ai-platform.evals-judge.job.response.v1`	3	Scored responses back to `integration-designer`
`ai.flowx.ai-platform.evals-judge.job.dlq.v1`	3	Dead-letter queue for unscored jobs

The integration-designer consumer group for the response topic is integration-designer-evals-judge-response-group with 3 consumer threads by default.

EVAL capability backfill

The migration 20260513_eval_capability.xml runs on the first organization-manager boot after upgrade and:

Registers the EVAL capability in llm_capability_catalog (display order 7).
Backfills gpt-5.4-mini as the model bound to EVAL in llm_model_catalog (platform default).
Backfills the same binding into every existing organization’s llm_model table so evals work out-of-the-box on upgrade without manual configuration.

If your org uses a non-OpenAI provider, change the bound model in Settings → AI Providers → Models after the upgrade.

Limitations

No user-defined evaluators. The 10-evaluator catalogue is read-only. Custom rubrics and custom scoring prompts are planned for a later release.
Workflow-scoped only. Datasets bind to a single AI node on a single workflow. Cross-workflow shared datasets are out of scope.
Judge model is org-wide. Picking a different judge per evaluator or per experiment is not supported.
No retroactive backfill across workflow rewrites. If you change the AI node’s contract significantly, existing datasets keep their snapshot — they don’t fail, but they may no longer reflect the current workflow’s behaviour. Create a new dataset.
DLQ is manual. There is no automatic retry for jobs landing on the DLQ; an operator must inspect and decide whether to retry.

AI Platform setup

Install evals-judge alongside the rest of the AI Platform.

AI providers

Configure the LLM bound to the EVAL capability for your org.

Agent Builder overview

Build the AI agents whose outputs you’ll score.

Chat-driven workflows

Chat-driven workflow nodes you can attach datasets to.

Config-time agents

Agent Builder

Using agents

Evaluations

Overview

Core concepts

Get started

Create a dataset

Add test cases

Launch an experiment

Multiple datasets

View results

Aggregate scores

Per-test-case details

Improvement suggestions

Evaluator catalogue

General quality

RAG

Code / deterministic

Score types and thresholds

Judge model resolution

Permissions

How it works

REST API surface

Operator setup

evals-judge service

Kafka topics

EVAL capability backfill

Limitations

AI Platform setup

AI providers

Agent Builder overview

Chat-driven workflows

​Overview

​Core concepts

​Get started

​Create a dataset

​Add test cases

​Launch an experiment

​Multiple datasets

​View results

​Aggregate scores

​Per-test-case details

​Improvement suggestions

​Evaluator catalogue

​General quality

​RAG

​Code / deterministic

​Score types and thresholds

​Judge model resolution

​Permissions

​How it works

​REST API surface

​Operator setup

​evals-judge service

​Kafka topics

​EVAL capability backfill

​Limitations

​Related resources

AI Platform setup

AI providers

Agent Builder overview

Chat-driven workflows

Overview

Core concepts

Get started

Create a dataset

Add test cases

Launch an experiment

Multiple datasets

View results

Aggregate scores

Per-test-case details

Improvement suggestions

Evaluator catalogue

General quality

RAG

Code / deterministic

Score types and thresholds

Judge model resolution

Permissions

How it works

REST API surface

Operator setup

evals-judge service

Kafka topics

EVAL capability backfill

Limitations

Related resources