Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.flowx.ai/llms.txt

Use this file to discover all available pages before exploring further.

Attach test cases to AI nodes in a workflow, pick from a catalogue of evaluators (correctness, hallucination, RAG groundedness, JSON match, and more), launch an experiment, and read per-evaluator aggregate scores plus per-test-case details. Powered by the async evals-judge service.

Overview

An evaluation (eval) scores the output of an AI node against a known expected outcome. Use evals to:
  • Tune prompts and retrieval — change a system prompt or KB chunk size, rerun the same experiment, and see whether your scores improved.
  • Regression-guard agents — before promoting a new agent version, run the dataset of “hard” past inputs and compare scores against the previous version.
  • Pick a model — run the same dataset against two different LLMs and compare correctness, conciseness, and cost.
Evals are workflow-centric in 5.9.0. Each dataset is attached to a specific AI node in a specific workflow; experiments launch that workflow against the dataset and score the node’s outputs.

Core concepts

ConceptWhat it is
DatasetA named collection of test cases attached to a specific AI node in a workflow. Holds the node configuration snapshot at the time the dataset was last updated.
Test caseOne row of input data plus an expected output. May include retrieved context. Sourced either manually or captured from a real node execution in a workflow run.
Evaluator (scorer)A pre-seeded rule that scores a model output against the expected output. Belongs to one of three categories — LLM-as-judge, RAG-quality, or code-exact-match.
ExperimentA run of the workflow against every test case in one or more datasets, scored by one or more evaluators. Produces aggregate scores per evaluator plus per-test-case details.
ScoreNumeric (typically 0.0–1.0) or binary, depending on the evaluator. Numeric scorers have a default pass/fail threshold.

Get started

1

Add an AI node to a workflow

Drop a Custom Agent, Intent Classification, Extract Data, or any other AI node into a workflow and configure it as usual.
2

Create a dataset on the node

From the node’s side panel, open Evaluations → Datasets and click New dataset. Name it after what it tests (e.g. Customer-intent-classification-baseline).
3

Add test cases

Add at least one test case manually, or capture one from an existing process instance run (see Add test cases).
4

Pick evaluators

Open Evaluations → Experiments → New experiment. Choose one or more datasets and pick the evaluators you want to score against from the catalogue.
5

Launch the experiment

Click Start. The workflow runs against every test case; the evals-judge service scores each (test case × evaluator) pair asynchronously.
6

Read the results

Once the experiment moves to PASSED or FAILED, open the report for aggregate scores per evaluator and a list of test-case rows you can drill into.

Create a dataset

A dataset belongs to a single AI node on a single workflow. Two datasets on the same node must have unique names. Fields:
FieldDescription
NameHuman-readable label, unique per node.
DescriptionOptional. Useful for explaining “what is this dataset testing?”
MultimodalToggle on if test cases include image or file inputs as well as text.
Output shape contractOptional JSON schema describing the expected output shape. Used by the JSON Match evaluator.
The dataset persists a schema snapshot of the node’s input/output at the time of last edit. This is what’s used when a workflow is changed underneath an existing dataset — the experiment runs against the snapshot, not the live node, so older runs remain reproducible.

Add test cases

A test case has three core fields:
  • Input — the data the AI node receives.
  • Expected output — what a correct response looks like. Required by exact-match and JSON-match evaluators; optional but recommended for LLM-as-judge.
  • Context — optional retrieved context (KB chunks, retrieval results) used to score RAG evaluators.
Two ways to add them:
From the dataset’s Test cases tab, click Add test case. Fill in Input, Expected output, and optional Context. Save.Use this for synthetic or hand-crafted cases — typical edge cases, known-difficult inputs, or cases borrowed from a spec or QA tracker.

Launch an experiment

An experiment binds together:
  • One or more datasets (all must be on the same AI node)
  • One or more evaluators
  • A node configuration snapshot (frozen at launch time)
When you click Start, the workflow runs end-to-end against every test case. Each output produces one judge job per (test case × evaluator), published to the evals-judge.job.request Kafka topic. evals-judge resolves the configured judge LLM, runs the scoring prompt, and returns a score + reasoning on evals-judge.job.response. Experiment status moves through:
StatusMeaning
STARTEDExperiment accepted, workflows are being scheduled.
RUNNINGAt least one (test case × evaluator) pair is still scoring.
PASSEDAll scoring jobs completed; overall score is above the pass threshold for the experiment.
FAILEDAll scoring jobs completed; overall score is below threshold (or a non-trivial number of jobs hit the DLQ).
CANCELLEDCancelled by a user before completion.
The Designer streams progress updates as scoring jobs complete, so the report fills in live without a full reload.

View results

Each experiment produces a report with two views:

Aggregate scores

A row per evaluator with the mean (or pass-rate) across all test cases. Useful for “is this prompt better than the last one?” comparisons.

Per-test-case details

A row per test case showing:
  • The input, the model’s actual output, and the expected output side-by-side.
  • One score column per evaluator selected for the experiment.
  • For LLM-as-judge evaluators, the judge’s reasoning text (the model’s own explanation of the score).
  • Pass/fail badges against each evaluator’s threshold.
Click into a row to see the full prompt, response, and any retrieved context — this is the screen you use to debug “why is this case failing?”

Evaluator catalogue

5.9.0 ships 10 built-in evaluators in three categories. The catalogue is read-only — user-defined evaluators are planned for a later release.

LLM-as-judge

A judge LLM reads the model output (and optional expected output) and scores it on a single rubric. Score is numeric 0.0–1.0.
EvaluatorWhat it scoresNeeds expected output?
CorrectnessHow closely the output matches the expected meaning.Yes
ConcisenessWhether the output is appropriately brief — penalises padding and rambling.No
HallucinationWhether the output contains claims not grounded in the input or provided context.No
Answer relevanceWhether the output actually answers what the input asked.No

RAG quality

Evaluators that judge the retrieval side of a RAG workflow. Score is numeric 0.0–1.0.
EvaluatorWhat it scores
GroundednessIs every claim in the output supported by the retrieved context?
HelpfulnessDoes the retrieved context actually help answer the question, or is it off-topic?
Retrieval relevanceAre the top retrieved chunks the most relevant to the question?

Code

Deterministic, no LLM call. Useful for structured outputs.
EvaluatorWhat it scoresScore type
Exact matchOutput equals expected, character-for-character.Binary
LevenshteinEdit-distance similarity between output and expected (normalised).Numeric
JSON matchOutput is valid JSON and conforms to the dataset’s output-shape contract.Binary

Score types and thresholds

  • Numeric evaluators return a score in [0.0, 1.0] and ship with a defaultThreshold (e.g. 0.7). The report flags a row as failing if the score is below the threshold.
  • Binary evaluators return pass or fail. No threshold.

Judge model resolution

LLM-as-judge and RAG evaluators need a judge LLM to run their scoring prompts. The model is resolved per organization via a new LLM capability:
  • 5.9.0 adds the EVAL capability to the org-manager LLM catalogue (display order 7, alongside the existing capabilities).
  • The platform-default judge model is gpt-5.4-mini, backfilled onto every org’s catalogue on upgrade.
  • An admin can change the judge model per org by assigning EVAL to a different model in Settings → AI Providers → Models.
This means the same experiment can produce slightly different scores in two different orgs if they’ve picked different judge models — the report records which model was used, so the comparison stays apples-to-apples within one org.
The judge LLM is not the same as the LLM your agent uses. The agent under test can use any provider; the judge runs separately and can be a cheaper, faster model dedicated to scoring.

Permissions

Evals reuse existing workflow-level permissions — there is no new role.
CapabilityPermission
List, view datasets, experiments, reportsWORKFLOW_READ
Create / update / delete datasets, add / remove test cases, launch / cancel / delete experimentsWORKFLOW_EDIT
Permissions are enforced per workflow resource ID — a user with WORKFLOW_EDIT on Workflow A but not Workflow B cannot launch an experiment against B’s nodes.

How it works

The runtime side has three pieces:
  • integration-designer owns the dataset, test-case, and experiment REST APIs and the experiment orchestration. It publishes judge job requests to Kafka and writes results to MongoDB.
  • evals-judge is a small Python service that consumes judge job requests, resolves the judge LLM via organization-manager, calls the LLM provider, and publishes scored responses.
  • organization-manager owns the LLM catalogue and resolves the model bound to the EVAL capability for the org.
Scoring is fully async. A judge job’s lifecycle:
  1. integration-designer publishes a request on ai.flowx.ai-platform.evals-judge.job.request.v1 with the test case, the evaluator id, and the node-config snapshot.
  2. evals-judge consumes the request, resolves the judge model, runs the scoring prompt, and publishes the result on ai.flowx.ai-platform.evals-judge.job.response.v1.
  3. integration-designer consumes the response, persists the score on the test-case row, and pushes a progress event over SSE to any open Designer report.
  4. Permanent failures (no judge model resolvable, invalid request shape, repeated timeouts) end up on ai.flowx.ai-platform.evals-judge.job.dlq.v1 for manual operator review — the report shows the test-case row as ERROR.

REST API surface

The Designer is the primary entry point, but the same endpoints are usable from your own tooling. All are scoped to a workspace and require WORKFLOW_READ / WORKFLOW_EDIT on the relevant workflow.
MethodEndpointPurpose
POST/api/workflows/datasetsCreate a dataset on a node
GET/api/workflows/datasets?nodeFlowxUuid={uuid}List datasets, optionally filtered to one node
GET / PUT / DELETE/api/workflows/datasets/{datasetUuid}Fetch, update name/description, or delete a dataset
POST/api/workflows/datasets/{datasetUuid}/test-casesAdd a test case (manual)
POST/api/workflows/experiments/startLaunch an experiment
GET/api/workflows/experimentsList experiments (paginated)
GET/api/workflows/experiments/{experimentUuid}Experiment report with aggregate scores
GET/api/workflows/experiments/{experimentUuid}/test-cases/{testCaseUuid}One test-case row with all scores
POST/api/workflows/experiments/{experimentUuid}/cancelCancel a running experiment
DELETE/api/workflows/experiments/{experimentUuid}Delete a completed or cancelled experiment

Operator setup

Evals add a new service and three new Kafka topics to the AI Platform deployment.

evals-judge service

  • Language: Python
  • Container image: {{flowx-ai}}/evals-judge
  • Dependencies: Kafka (request / response / DLQ), organization-manager (LLM resolution), the LLM provider configured for the EVAL capability, S3-compatible storage (MinIO env vars on multimodal test cases).
  • No exposed ingress. Communication is Kafka-only.
  • Storage: none of its own — state lives in integration-designer’s MongoDB and the LLM provider.
Add evals-judge to your AI Platform install. See AI Platform setup for the full Python services list.

Kafka topics

Declare the following in your Kafka provisioning (kafka-topics.yaml.gotmpl for k8s, or your own topic management):
TopicPartitionsPurpose
ai.flowx.ai-platform.evals-judge.job.request.v13Judge job requests from integration-designer
ai.flowx.ai-platform.evals-judge.job.response.v13Scored responses back to integration-designer
ai.flowx.ai-platform.evals-judge.job.dlq.v13Dead-letter queue for unscored jobs
The integration-designer consumer group for the response topic is integration-designer-evals-judge-response-group with 3 consumer threads by default.

EVAL capability backfill

The migration 20260513_eval_capability.xml runs on the first organization-manager boot after upgrade and:
  • Registers the EVAL capability in llm_capability_catalog (display order 7).
  • Backfills gpt-5.4-mini as the model bound to EVAL in llm_model_catalog (platform default).
  • Backfills the same binding into every existing organization’s llm_model table so evals work out-of-the-box on upgrade without manual configuration.
If your org uses a non-OpenAI provider, change the bound model in Settings → AI Providers → Models after the upgrade.

Limitations in 5.9.0

  • No user-defined evaluators. The 10-evaluator catalogue is read-only. Custom rubrics and custom scoring prompts are planned for a later release.
  • No eval suggestions. The EVAL_SUGGESTIONS capability is registered but reserved — there is no producer in 5.9.0.
  • Workflow-scoped only. Datasets bind to a single AI node on a single workflow. Cross-workflow shared datasets are out of scope.
  • Judge model is org-wide. Picking a different judge per evaluator or per experiment is not supported.
  • No retroactive backfill across workflow rewrites. If you change the AI node’s contract significantly, existing datasets keep their snapshot — they don’t fail, but they may no longer reflect the current workflow’s behaviour. Create a new dataset.
  • DLQ is manual. There is no automatic retry for jobs landing on the DLQ; an operator must inspect and decide whether to retry.

AI Platform setup

Install evals-judge alongside the rest of the AI Platform.

AI providers

Configure the LLM bound to the EVAL capability for your org.

Agent Builder overview

Build the AI agents whose outputs you’ll score.

Conversational workflows

Conversational workflow nodes you can attach datasets to.
Last modified on June 2, 2026